Lossless Comprssion Of High-throughput DNA Sequence Data

Posted on:2016-01-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y P Zhang

Full Text:PDF

GTID:2180330464456907

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

The advance of next generation sequencing(NGS) has greatly promoted the research on genomics analysis, hereditary disease diagnosis, food security, etc.. The exponential growth of big NGS data outpaces the decrease of storage cost and the increase of network bandwidth, which poses great challenges to data storage and transmission. Efficient compression techniques of NGS data have been widely adopted to solve this â€˜big dataâ€™ issue. In this dissertation, the state-of-the-art NGS data compression technologies are comprehensively reviewed and systemically experimented. New reference-based compression algorithms are proposed for the most popular raw NGS data format FASTQ. The main contributions of this study lie in :(1) High-throughput DNA sequence data are classified into genomic sequence and raw NGS sequencing data. A survey on their storage format and compression methods is provided. Moreover, the compression methods are systematically compared with extensive experimental results.(2) A reference-based compression algorithm called FQZip is proposed for FASTQ data. FQZip first separates the three components of an input FASTQ namely metadata, short reads, and quality scores into three data streams and then compresses them independently according to their own characteristics. Particularly, the repeats in metadata are identified and compressed with LZMA, while the redundancy in quality scores is handled by run-length coding and arithmetic coding. Reads are aligned against a homologous reference genome by external alignment tool BWA and the alignment results are then compressed by using arithmetic coding, Huffman coding, and LZMA. Experimental results on real-world NGS data indicate that FQZip obtains superior compression ratio to other state-of-the-art NGS data compression methods.(3) A Light-weight reference-based compression algorithm called LWFQZip is proposed as an improved version of FQZip. Following the decomposition compression framework of FQZip, LWFQZip is equipped with a kmer index based light-weight mapping model that is able to fast align reads against the reference sequence(s) and produce the most concise alignment results for storage. The alignment speed of LWFQZip is significantly faster than FQZip. LWFQZip achieves an average compression ratio of 0.144 on eight real-world NGS data sets, which is comparable or superior to other state-of-the-art lossless NGS data compression algorithms.This dissertation presents two efficient reference-based compression algorithms for FASTQ and a fast sequence alignment method, which together contribute to the state of art applications for NGS data storage and transmission. They are expected to serve as candidate solutions for relieving the stress brought by high throughput DNA sequencing.

Keywords/Search Tags:

Next-generation sequencing, DNA sequence compression, Reference-based compression, Reference-free compression, FASTQ

PDF Full Text Request

Related items

1	Research On Fast Migration Algorithm Between Reference Gene Compression Libraries
2	Research On Compression And Assembly Of Biological Sequencing Data Based On Non-reference Genomes
3	Optimization And Implementation Of Lossless Compression Of Gene Sequencing Data
4	The Research Of Reference-based Compression Specified For Sequence Data
5	Research Of Reference-based Genome Sequence Data Compression Algorithm
6	High-throughput Genome Resequencing Data Compression Algorithm Based On Self-index Structure
7	Research On The Third-generation DNA Sequencing Data Compression Method
8	Compression Of DNA Sequences Based On Reference Sequences And Weighting Of Context Models
9	Research On Compression And Indexing Methods For High-throughput Sequencing Data
10	Research Of Genome Data Compression Algorithm Based On Reference Sequence And Suffix Array