Font Size: a A A

Research On Cloud Platform Oriented Efficient Storage Compression Of Bioinformatics Data

Posted on:2016-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:J R WangFull Text:PDF
GTID:2180330476954937Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Biological sequencing technology is the basic technology of molecular biology, which is used to determination of biological gene sequences. The massive biological data scale generated in the sequencing project has an exponentially rapid growth. In the era of big data, how to store and analyze of these biological data is a major problem faced by biologists. Data compression technology can effectively reduce the amount of data and improve the utilization of limited bandwidth while transmission data in network, thus easing the pressure of data growth. DNA sequence is the main data object in biomedical research, which exists direct repeats or approximately duplicated gene fragments generated by self replication and genes mutation. While traditional compression methods have poor compression result. It is important to research on high performance DNA sequence compression methods for the efficient storage of biological data.DNA sequence-specific compression methods are still in research stage, the early compression algorithms used text compression methods to reduce redundant data. They made a great improving in compression effect relative to the general data compression methods. Due to the development of high-throughput sequencing technology, lots of resequencing datasets have been produced to explore differences of different individual genes, which have redundancy in transverse direction within single sequence and also have common redundancy between multiple sequences. Recently, DNA sequence compression methods are aiming at finding the differences between the sequences to reduce the amount of data.In this paper, based on the traditional compression methods, we studied on the DNA sequence compression technology and made a classification and summary. Using the sequence comparison idea to optimize and improve the existing compression methods, we designed a compression method processing DNA sequence collections Gcompress(Genome compress). This method is designed in two kinds of compression modes, one for reduction single sequence similarlity based on the local similarity intra-individual, and the other for the similarity data reduction among the multi-sequences of different individuals. Both modes are using the Dictionary-based compression algorithm combined with Huffman coding method, which can effectively reduce redundant data among sequences and gain outstanding compression result. We compared the performance of Gcompress with general compression software gzip and existing outstanding DNA sequence compression methods. Experimental results showed that, compared with gzip, Gcompress’ s single sequence compression mode could achieve a higher compression rate with lower time consumption; While Gcompress’ s multi-sequence compression mode could ensure good compression ratio and improve the compression speed effectively at the same time, when compared to the contrast algorithms.In addition, we use Map/Reduce model combined with the single compression mode to realize a block-based distributed compression method, which can reduce redundancy effectively based on local relevancy of the data. This can deal with the processing pressure for big single sequence utilizing of the computational resources of Cloud platform, which provides technical support for efficient storage and transmission of the data for sharing.
Keywords/Search Tags:DNA sequence compression, Huffman, multiple sequence compression, LZ
PDF Full Text Request
Related items