Font Size: a A A

Comparative Analysis And Visualization Of Scalable Gene Sequences Based On Apache Spark

Posted on:2020-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:T Z LiuFull Text:PDF
GTID:2370330626956923Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As human beings enter the digital age of information,the birth of a large number of data,resulting in a lot of technical challenges.Recently,the digitization of genetic information has led to the exponential growth of biological genetic data.At the same time,these gene sequences contain many details of biological genetic information,which presents many computational challenges.Among them,the comparative analysis of DNA short read sequences is an important problem in bioinformatics.When storing,reading and analyzing short-read DNA sequence data with exponential growth,it becomes an urgent problem that even biological scientists without much computer knowledge can quickly and conveniently complete the comparative analysis of short-read gene sequence data.In order to solve this problem,this paper mainly does the following work:(1)This paper designs a scalable DNA short-read sequence alignment analysis system based on Spark,which makes full use of Apache Spark's parallel computing features and its relational processing module,namely Spark structured query language Spark SQL.By keeping data in memory,the Spark framework enables efficient data reuse,significantly reducing data access time and thus greatly improving query performance.Experiments show that the system can achieve data scalability on Spark parallel computing cluster.In addition,based on 1000 publicly available VCF format genes(size: 1.2tb),Spark was used to analyze the input data and evaluate the final results,which showed good performance of the system.(2)A web-based interactive prototype system is further implemented in which users can specify search conditions and perform search operations on data stored in memory through Spark SQL,so as to show users the best search results and greatly facilitate their use.(3)In this paper,a fast spark-based DNA sequence alignment algorithm is proposed,which is called spark-dna sequence alignment analysis algorithm.The proposed algorithm leverages Apache Spark to optimize algorithm performance,such as broadcast variables,partitioned connections,caching,and memory calculations.The performance of Spark-DNA sequence alignment analysis algorithm was evaluated by comparing with Spark BWA tool and Cloud Burst algorithm based on MapReduce.The results show that the spark-dna sequence alignment algorithm performs better than Spark BWA and Cloud Burst algorithms because it provides acceleration in the 101-702 range for short reads of the human genome.The experimental evaluation shows that Apache Spark provides a very good solution for the comparative analysis of DNA short read sequences.
Keywords/Search Tags:Big data mining, Analysis of gene sequences, Apache Spark
PDF Full Text Request
Related items