Font Size: a A A

The Research Of Reference-based Compression Specified For Sequence Data

Posted on:2020-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ChenFull Text:PDF
GTID:2370330620952517Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of sequencing technology,the volume of gene data increase rapidly and has post challenge to storage and transmission.People prefer employing compression to reduce the volume but get unsatisfactory result.The reference-based compression,which use the similarity between genome data and target gene data,can solve this problem.However,current compression has some problems including low matching rate and compression ratio can be improved.Thus,this paper purposed to present reference-based algorithm for compressing the FASTQ file.The main contributions include :(1)Present a reference-based compression algorithm called TBFQC that combines Karp Rabin algorithm and Levenshtein Distance algorithm.This algorithm divides FASTQ into identifier,sequence and quality score,and then compresses those three part data according to its characteristic respectively.For identifier,Deflate algorithm is used.For the sequence part,the algorithm first uses the Karp Rabin algorithm combined with Levenshtein Distance algorithm to mapping and compress.For quality score,run-length encoding is used.After the processing above,the algorithm further compress the result above in XZ compressor.(2)Present a reference-based compression called FastqBow that employ Bowite index.This algorithm also compresses the three parts of FASTQ respectively in different ways.For identifier,an increment-coding-like scheme is used.For sequence part,Bowite index is used to mapping the target sequence to the reference genome,and the mapping result is encoded to replace the target one.For quality score,statistics-based byte coding is used.After that,FastqBow uses the context model in conjunction combined with the arithmetic coding to further compress the gene data.(3)Experiments were carried out using gene sequencing data from 10 different species as experimental data.The result demonstrated that the average compression ratios of TBFQC and FastqBow are 7.5 and 9.8,is the 2 times and 3 times that of GZIP,a universal compressor.And FastqBow got the best compression ratio among state-of-the-art FASTQ compressor.In the comparison of the matching ratios of the two schemes,FastqBow matched more bases than TBFQC,which is the reason that the former one can get better compression ratio.
Keywords/Search Tags:Reference-based compression, FASTQ, Karp Rabin algorithm, Levenshtein Distancee algorithm, Bowtie index algorithm
PDF Full Text Request
Related items