Font Size: a A A

Research On Lossless Compression Algorithms For FASTQ Files

Posted on:2019-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:2370330572951755Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With rapid development of gene sequencing technology,the cost of sequencing has plummeted,and the volume of sequencing data has risen sharply,which has caused inconvenience for data analysis and usage.FASTQ format is one of the most widely used formats to store sequencing data,therefore it is necessary to design a compression algorithm for FASTQ files to achieve efficient storage and transmission.This paper proposes a lossless compression algorithm named FTComp for FASTQ files and it includes two phases.The first stage is the preprocessing of FASTQ files.Based on the features of FASTQ format,FTComp classifies and extracts data and then generates identifier file,DNA file and quality score file.For the identifier file,FTComp partitions all the identifier sequences into several regions and then applies different methods to them according to their regional text features.For the DNA file,FTComp uses group coding and run-length coding to pack the nucleotides.For the quality score file,the algorithm uses run-length coding to preprocess the sequences.FTComp uses the lossless compression algorithm Hy BWT proposed in this paper as a compressor in the second stage to compress the data generated from the first stage.Hy BWT first performs BWT transformation on the text,then constructs a wavelet tree to represent the transformed text succinctly,and then uses hybrid coding to compress wavelet tree,which is the end of the second stage of FTComp.The experiments in this paper include two parts,the first one is the influence of parameters on compression ratio and the second part is about the comparison between FTComp and other compression algorithms.The first experiment tests the influence of the shape of the wavelet tree and the block size of its bit vector on compression ratio.The results show that because the FASTQ file contains three kinds of sequence data with different characteristics,it should be set to the appropriate parameters respectively to achieve higher compression ratio.The comparison test compares five algorithms with FTComp,which includes two classic text compression algorithms named Gzip and Bzip2,and three industry-leading lossless compression algorithms for FASTQ file named DSRC2,Quip,and LFQC.The experiment compares the compression ratio,compression speed,and decompression speed among the six algorithms in eight different sets of FASTQ file.The results show that FTComp has superior performance in terms of compression ratio,and the space occupation after compression is reduced by about 80% on average.It is very close to LFQC and exceeds both DSRC2 and Quip which are the industrial-grade lossless compression algorithms for FASTQ file.FTComp has obvious advantages compared with Gzip and Bzip2.In terms of compression speed and decompression speed,FTComp performs stably,and is about 5 to 10 times faster than LFQC.
Keywords/Search Tags:FASTQ format, Lossless compression, BWT, Wavelet tree, Hybrid coding
PDF Full Text Request
Related items