Font Size: a A A

Lossless Comprssion Of High-throughput DNA Sequence Data

Posted on:2016-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y P ZhangFull Text:PDF
GTID:2180330464456907Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
The advance of next generation sequencing(NGS) has greatly promoted the research on genomics analysis, hereditary disease diagnosis, food security, etc.. The exponential growth of big NGS data outpaces the decrease of storage cost and the increase of network bandwidth, which poses great challenges to data storage and transmission. Efficient compression techniques of NGS data have been widely adopted to solve this ‘big data’ issue. In this dissertation, the state-of-the-art NGS data compression technologies are comprehensively reviewed and systemically experimented. New reference-based compression algorithms are proposed for the most popular raw NGS data format FASTQ. The main contributions of this study lie in :(1) High-throughput DNA sequence data are classified into genomic sequence and raw NGS sequencing data. A survey on their storage format and compression methods is provided. Moreover, the compression methods are systematically compared with extensive experimental results.(2) A reference-based compression algorithm called FQZip is proposed for FASTQ data. FQZip first separates the three components of an input FASTQ namely metadata, short reads, and quality scores into three data streams and then compresses them independently according to their own characteristics. Particularly, the repeats in metadata are identified and compressed with LZMA, while the redundancy in quality scores is handled by run-length coding and arithmetic coding. Reads are aligned against a homologous reference genome by external alignment tool BWA and the alignment results are then compressed by using arithmetic coding, Huffman coding, and LZMA. Experimental results on real-world NGS data indicate that FQZip obtains superior compression ratio to other state-of-the-art NGS data compression methods.(3) A Light-weight reference-based compression algorithm called LWFQZip is proposed as an improved version of FQZip. Following the decomposition compression framework of FQZip, LWFQZip is equipped with a kmer index based light-weight mapping model that is able to fast align reads against the reference sequence(s) and produce the most concise alignment results for storage. The alignment speed of LWFQZip is significantly faster than FQZip. LWFQZip achieves an average compression ratio of 0.144 on eight real-world NGS data sets, which is comparable or superior to other state-of-the-art lossless NGS data compression algorithms.This dissertation presents two efficient reference-based compression algorithms for FASTQ and a fast sequence alignment method, which together contribute to the state of art applications for NGS data storage and transmission. They are expected to serve as candidate solutions for relieving the stress brought by high throughput DNA sequencing.
Keywords/Search Tags:Next-generation sequencing, DNA sequence compression, Reference-based compression, Reference-free compression, FASTQ
PDF Full Text Request
Related items