Font Size: a A A

Optimization And Implementation Of Lossless Compression Of Gene Sequencing Data

Posted on:2019-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:G LiuFull Text:PDF
GTID:2370330563991548Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Since the Human Genome Project,the rapid development of gene sequencing technology has greatly reduced the time of gene sequencing,and the economic costs have also rapidly declined.On this basis,the clinical application of gene sequencing technology has become increasingly widespread,and gene sequencing data grows explosively.However,the decline rate of storage hardware price is far from matching the growth rate of gene sequencing data,making the data storage problem in the gene sequencing industry a bottleneck.The efficient compression algorithm is an effective way to solve this bottleneck.This paper proposes a lossless compression algorithm DFQZ with high compression efficiency based on the investigation of other compression algorithms for FASTQ format files generated by human genome sequencing.The compression process is mainly divided into two steps.The first step is to convert the data of every part into a more concise description format based on the data characteristics,and generate an encoded file.The second step is to call general compression algorithms such as LZMA and ZPAQ to compress the encoded file generated in the first step to get a better gain.The characteristics of DFQZ include: 1)Before compression,DFQZ divides the FASTQ file into three parts: basic information,gene sequence and quality value,and designs different compression algorithms for each part.2)For the basic information part,the index search scheme is used to divide the basic information into an ID part and a description part.Through an index search,a record number and a separate X,Y are respectively got.Besides,DFQZ is compatible with different versions of FASTQ.3)For the gene sequence part,the biological characteristics of the data are used.The gene sequences are aligned to a known reference sequence,only the position and the result of the alignment are saved.So,the compression efficiency is greatly improved.The BWA algorithm with the highest degree of recognition in the industry was used in the alignment process.At the same time,DFQZ also supports a more portable aligment method based on hash table lookup.4)The BWA alignment scheme is optimized and accelerated to increase the speed of compression.5)For the two FASTQ files generated by paired-ends sequencing,the BWA alignment schecme can be used to compress these two files simultaneously,which improves the compression ratio.6)Provieding both fast mode and best mode to adapt to different needsThis article provides a lossless compression algorithm DFQZ for FASTQ files generated from human genome sequencing.Compared with other compression algorithms,this algorithm can make better use of data characteristics and improve the compression ratio.It provides a certain help for the gene sequencing industry to solve data storage problems.All of the test data used in the experiments were actual data from gene sequencing.
Keywords/Search Tags:Gene Sequencing, FASTQ, Compression, Read Alignment, Reference Sequence
PDF Full Text Request
Related items