Font Size: a A A

Research On Distributed Storage And Sequence Alignment Of DNA Data Based On HBase

Posted on:2019-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:S X WenFull Text:PDF
GTID:2370330590465799Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the implementation of large-scale genetic engineering projects such as the 1000 Genomes Project,the emergence of high-throughput sequencing technology of Next Generation Sequencing(NGS)has significantly reduced the cost of sequencing,and human DNA sequencing capabilities and sequencing speeds have grown explosively.The sequence data generated by the NGS platform provides a large number of samples for life science research.NGS platform is one of the most important and rapidly expanding data sources in the field of bioinformatics.However,the original terabyte data based on the NGS platform experiment has made a great challenge to the management and analysis of sequence data.Because traditional data storage and analysis software relies on the old hardware architecture and does not have good scalability in the face of rapidly growing sequence data,and the computing power of these software is limited and data security cannot be guaranteed,so biological workers are constantly seeking new computer solution synchronizes computing power and sequencing capabilities.In this paper,we first made a theoretical research work,and elaborated the basic theory of DNA sequence,text storage format and commonly used database.Secondly,we analyzed the development process and algorithm idea of biological sequence alignment algorithm.We introduced architecture features of the file system HDFS,HBase,and MapReduce in the Hadoop distributed framework,and concluded that the Hadoop related technology has a natural advantage in the application of DNA sequence storage analysis.Then this paper focuses on the hierarchical storage of DNA sequence data based on HBase,analyzes the HBase storage mechanism and RegionSplit principle and the specific splitting process in detail,then analyzes the RegionSplit optimization method and propose an improved storage scheme.In the scheme,we designed the hierarchical structure Rowkey with reference to current sequence database classification standards,and the best Region splitpoint is determined by implementing a customized hierarchical Region Split algorithm.By designing different sequence data import schemes and testing methods,the experiment proves that the storage scheme has better performance and high throughput in data scanning and distributed computing.Finally,for the problems of data retrieval and comparison of massive DNA sequences,the feasibility of realizing the sequence similarity retrieval function of HBase database is proved based on theories and fulfillments.Through detailed research on the parallelized sequence alignment algorithm CloudBurst,this paper proposes corresponding improvements to the existing problems of the CloudBurst algorithm and applies it to HBase.The comparison of multi dimension experiments show that the advantages of the improved CloudBurst algorithm,and verifies that the HBase database can efficiently carry out sequence similarity comparison.
Keywords/Search Tags:DNA sequence, HBase, Hierarchical, MapReduce, CloudBurst
PDF Full Text Request
Related items