Research Of Genome Data Compression Algorithm Based On Reference Sequence And Suffix Array

Posted on:2020-10-14

Degree:Master

Type:Thesis

Country:China

Candidate:W Q Wu

Full Text:PDF

GTID:2370330602959052

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

In recent years,since the genomic data grows exponentially,a huge amount of data has been generated,result in that modern storage technologies are fail to meet the actual storage requirements of data sets.Therefore,it poses a huge challenge to the transmission and storage of data.Although there are now general-purpose data compression software that can compress genomic data,these compression tools cannot efficiently compress the characteristics of the genome.Therefore,it is necessary to explore some novel compression algorithms,which can compress gene sequence data efficiently,on the basic of the characteristics of gene sequence.In this paper,a lossless data compression algorithm based on reference genome is proposed.Using the high similarity of the genomes between the same species,the target genome sequence is replaced by that the fragment of the target genome which is same as the reference genome sequence,and the portion of the target sequence which is different with the reference genome sequence in length and position,which greatly reduce the size of the intermediate file.Considering that the proposed algorithm in this paper is applied by general computers,it is necessary to use a lightweight search engine to accomplish the matching task.Therefore,the suffix array technique is used as the index structure in the search for the above matching process.Although the suffix array takes up less memory,when the target characters determines the left and right boundaries,the algorithm has a large amount of computation,so the binary search method is used to reduce the algorithmic computation.Therefore,for the proposed algorithm of this paper,the suffix array technique and the binary search method both are used to find the longest match of the target sequence segment along the reference gene sequence.This article also improved the intermediate result file and improved the compression ratio of the intermediate file.Finally,the intermediate file is compressed by efficient entropy coding scheme.The experimental results shown that the proposed algorithm requires up to 20.5 minutes to compress the human complete genome sequence of the 3GB FASTA format file.After 30 groups of human genomes were compressed by the proposed algorithm,the size ranged from 4.99 MB to 29.96 MB,of which 21 groups were less than 20 MB.For memory requirements,the proposed algorithm requires a maximum of 1.5GB of memory capacity,which is far superior to the existing excellent HiRGC genome sequence data compression algorithm.It can be run on a normal home computer and be valuable in practical application scenarios.

Keywords/Search Tags:

genomic sequence, suffix array, lossless compression, sequence matching

PDF Full Text Request

Related items

1	Studied On Gene Sequence Alignment Based On Mixed Suffix Tree And Suffix Array
2	Research Of Reference-based Genome Sequence Data Compression Algorithm
3	Design And Optimization For A Biological Sequence Matching System Supporting Scoring Matrices
4	Lossless Comprssion Of High-throughput DNA Sequence Data
5	Ultra-large Multiple Sequence Alignment Based On Distributed Computing
6	Sequence Assembly Algorithms For Next-generation Sequencing Technology Research
7	LM-Suffix: Research On Gene Sequence Index Structure Based On Suffix Tree
8	Optimization And Implementation Of Lossless Compression Of Gene Sequencing Data
9	Research On Cloud Platform Oriented Efficient Storage Compression Of Bioinformatics Data
10	A Lossless Compression Method Of Seismic Data