Comparative Analysis And Visualization Of Scalable Gene Sequences Based On Apache Spark

Posted on:2020-05-03

Degree:Master

Type:Thesis

Country:China

Candidate:T Z Liu

Full Text:PDF

GTID:2370330626956923

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As human beings enter the digital age of information,the birth of a large number of data,resulting in a lot of technical challenges.Recently,the digitization of genetic information has led to the exponential growth of biological genetic data.At the same time,these gene sequences contain many details of biological genetic information,which presents many computational challenges.Among them,the comparative analysis of DNA short read sequences is an important problem in bioinformatics.When storing,reading and analyzing short-read DNA sequence data with exponential growth,it becomes an urgent problem that even biological scientists without much computer knowledge can quickly and conveniently complete the comparative analysis of short-read gene sequence data.In order to solve this problem,this paper mainly does the following work:(1)This paper designs a scalable DNA short-read sequence alignment analysis system based on Spark,which makes full use of Apache Spark's parallel computing features and its relational processing module,namely Spark structured query language Spark SQL.By keeping data in memory,the Spark framework enables efficient data reuse,significantly reducing data access time and thus greatly improving query performance.Experiments show that the system can achieve data scalability on Spark parallel computing cluster.In addition,based on 1000 publicly available VCF format genes(size: 1.2tb),Spark was used to analyze the input data and evaluate the final results,which showed good performance of the system.(2)A web-based interactive prototype system is further implemented in which users can specify search conditions and perform search operations on data stored in memory through Spark SQL,so as to show users the best search results and greatly facilitate their use.(3)In this paper,a fast spark-based DNA sequence alignment algorithm is proposed,which is called spark-dna sequence alignment analysis algorithm.The proposed algorithm leverages Apache Spark to optimize algorithm performance,such as broadcast variables,partitioned connections,caching,and memory calculations.The performance of Spark-DNA sequence alignment analysis algorithm was evaluated by comparing with Spark BWA tool and Cloud Burst algorithm based on MapReduce.The results show that the spark-dna sequence alignment algorithm performs better than Spark BWA and Cloud Burst algorithms because it provides acceleration in the 101-702 range for short reads of the human genome.The experimental evaluation shows that Apache Spark provides a very good solution for the comparative analysis of DNA short read sequences.

Keywords/Search Tags:

Big data mining, Analysis of gene sequences, Apache Spark

PDF Full Text Request

Related items

1	MODIS SST Fast Retrieval Method Based On Apache Spark
2	Design And Implementation Of A Spark Autotuning System
3	A Research On Distributed Logistics Optimization Algorithm Based On Spark
4	Research On Parallelization Of Spatial Data Mining Clustering Algorithm Based On SPARK
5	Study Of Data Mining Methods For Gene Express Analysis
6	The Research And Implementation Of Pairwise Comparison Task Parallel Of Gene Sequences On Spark
7	Research And Implementation Of Data Mining Algorithm For Structural Health Monitoring Based On Hadoop/Spark
8	A Data-mining Platform For Functional Differentiation Analysis Between EST-based Transcriptomes
9	A Research On Data Mining Of Gene Profiles
10	Research On Data Mining Methods Of Gene Expression Profile