Research On Distributed Storage And Sequence Alignment Of DNA Data Based On HBase

Posted on:2019-10-02

Degree:Master

Type:Thesis

Country:China

Candidate:S X Wen

Full Text:PDF

GTID:2370330590465799

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the implementation of large-scale genetic engineering projects such as the 1000 Genomes Project,the emergence of high-throughput sequencing technology of Next Generation Sequencing(NGS)has significantly reduced the cost of sequencing,and human DNA sequencing capabilities and sequencing speeds have grown explosively.The sequence data generated by the NGS platform provides a large number of samples for life science research.NGS platform is one of the most important and rapidly expanding data sources in the field of bioinformatics.However,the original terabyte data based on the NGS platform experiment has made a great challenge to the management and analysis of sequence data.Because traditional data storage and analysis software relies on the old hardware architecture and does not have good scalability in the face of rapidly growing sequence data,and the computing power of these software is limited and data security cannot be guaranteed,so biological workers are constantly seeking new computer solution synchronizes computing power and sequencing capabilities.In this paper,we first made a theoretical research work,and elaborated the basic theory of DNA sequence,text storage format and commonly used database.Secondly,we analyzed the development process and algorithm idea of biological sequence alignment algorithm.We introduced architecture features of the file system HDFS,HBase,and MapReduce in the Hadoop distributed framework,and concluded that the Hadoop related technology has a natural advantage in the application of DNA sequence storage analysis.Then this paper focuses on the hierarchical storage of DNA sequence data based on HBase,analyzes the HBase storage mechanism and RegionSplit principle and the specific splitting process in detail,then analyzes the RegionSplit optimization method and propose an improved storage scheme.In the scheme,we designed the hierarchical structure Rowkey with reference to current sequence database classification standards,and the best Region splitpoint is determined by implementing a customized hierarchical Region Split algorithm.By designing different sequence data import schemes and testing methods,the experiment proves that the storage scheme has better performance and high throughput in data scanning and distributed computing.Finally,for the problems of data retrieval and comparison of massive DNA sequences,the feasibility of realizing the sequence similarity retrieval function of HBase database is proved based on theories and fulfillments.Through detailed research on the parallelized sequence alignment algorithm CloudBurst,this paper proposes corresponding improvements to the existing problems of the CloudBurst algorithm and applies it to HBase.The comparison of multi dimension experiments show that the advantages of the improved CloudBurst algorithm,and verifies that the HBase database can efficiently carry out sequence similarity comparison.

Keywords/Search Tags:

DNA sequence, HBase, Hierarchical, MapReduce, CloudBurst

PDF Full Text Request

Related items

1	Research On Storage And Retrieval Of Land Cover Data Based On HBase And Multilevel Grid Index
2	Biological Data Storage Based On Hbase And Analysis Of DNA Sequence
3	Research On K-mer Frequency Counting Algorithm Of DNA Sequence Based On MapReduce
4	Online Scheduling On Two Hierarchical Uniform Machines And MapReduce Scheduling Problem
5	Storage And Processing System Of Marine Data Based On Hadoop
6	Research On Cloud Storage Of Spatial Data In HBase
7	Research On The Evaluation Of Full House Earthquake Damage Based On Cloud Computing Technology Based On Big Data
8	Research On Storage And Indexing Of Spatiotemporal Big Data Based On HBase Database
9	Research On Shortest Path Algorithm Of GIS Network Based On Hadoop
10	Distributed File System Management Technology Research Based On The Massive Remote Sensing Image Data