The Research Of Reference-based Compression Specified For Sequence Data

Posted on:2020-12-07

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Chen

Full Text:PDF

GTID:2370330620952517

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of sequencing technology,the volume of gene data increase rapidly and has post challenge to storage and transmission.People prefer employing compression to reduce the volume but get unsatisfactory result.The reference-based compression,which use the similarity between genome data and target gene data,can solve this problem.However,current compression has some problems including low matching rate and compression ratio can be improved.Thus,this paper purposed to present reference-based algorithm for compressing the FASTQ file.The main contributions include :(1)Present a reference-based compression algorithm called TBFQC that combines Karp Rabin algorithm and Levenshtein Distance algorithm.This algorithm divides FASTQ into identifier,sequence and quality score,and then compresses those three part data according to its characteristic respectively.For identifier,Deflate algorithm is used.For the sequence part,the algorithm first uses the Karp Rabin algorithm combined with Levenshtein Distance algorithm to mapping and compress.For quality score,run-length encoding is used.After the processing above,the algorithm further compress the result above in XZ compressor.(2)Present a reference-based compression called FastqBow that employ Bowite index.This algorithm also compresses the three parts of FASTQ respectively in different ways.For identifier,an increment-coding-like scheme is used.For sequence part,Bowite index is used to mapping the target sequence to the reference genome,and the mapping result is encoded to replace the target one.For quality score,statistics-based byte coding is used.After that,FastqBow uses the context model in conjunction combined with the arithmetic coding to further compress the gene data.(3)Experiments were carried out using gene sequencing data from 10 different species as experimental data.The result demonstrated that the average compression ratios of TBFQC and FastqBow are 7.5 and 9.8,is the 2 times and 3 times that of GZIP,a universal compressor.And FastqBow got the best compression ratio among state-of-the-art FASTQ compressor.In the comparison of the matching ratios of the two schemes,FastqBow matched more bases than TBFQC,which is the reason that the former one can get better compression ratio.

Keywords/Search Tags:

Reference-based compression, FASTQ, Karp Rabin algorithm, Levenshtein Distancee algorithm, Bowtie index algorithm

PDF Full Text Request

Related items

1	High-throughput Genome Resequencing Data Compression Algorithm Based On Self-index Structure
2	Research On Fast Migration Algorithm Between Reference Gene Compression Libraries
3	Lossless Comprssion Of High-throughput DNA Sequence Data
4	Implementations Of Two Improved Versions Of The AKS Primality Testing Algorithm
5	Research On Point Cloud Compression Algorithm Based On Geometric Feature Constraint
6	Research On High Performance Biological Data Compression Algorithm Based On Heterogeneous Computing Platform
7	Lossless Reference DNA Data Compression Method Based On ICBDS Optimization
8	Research On DNA Sequences Compression Algorithm Based On Statistical Theory
9	Research On Lossless Compression Algorithms For FASTQ Files
10	Research Of Reference-based Genome Sequence Data Compression Algorithm