Font Size: a A A

Research Of Genome Data Compression Algorithm Based On Reference Sequence And Suffix Array

Posted on:2020-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:W Q WuFull Text:PDF
GTID:2370330602959052Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In recent years,since the genomic data grows exponentially,a huge amount of data has been generated,result in that modern storage technologies are fail to meet the actual storage requirements of data sets.Therefore,it poses a huge challenge to the transmission and storage of data.Although there are now general-purpose data compression software that can compress genomic data,these compression tools cannot efficiently compress the characteristics of the genome.Therefore,it is necessary to explore some novel compression algorithms,which can compress gene sequence data efficiently,on the basic of the characteristics of gene sequence.In this paper,a lossless data compression algorithm based on reference genome is proposed.Using the high similarity of the genomes between the same species,the target genome sequence is replaced by that the fragment of the target genome which is same as the reference genome sequence,and the portion of the target sequence which is different with the reference genome sequence in length and position,which greatly reduce the size of the intermediate file.Considering that the proposed algorithm in this paper is applied by general computers,it is necessary to use a lightweight search engine to accomplish the matching task.Therefore,the suffix array technique is used as the index structure in the search for the above matching process.Although the suffix array takes up less memory,when the target characters determines the left and right boundaries,the algorithm has a large amount of computation,so the binary search method is used to reduce the algorithmic computation.Therefore,for the proposed algorithm of this paper,the suffix array technique and the binary search method both are used to find the longest match of the target sequence segment along the reference gene sequence.This article also improved the intermediate result file and improved the compression ratio of the intermediate file.Finally,the intermediate file is compressed by efficient entropy coding scheme.The experimental results shown that the proposed algorithm requires up to 20.5 minutes to compress the human complete genome sequence of the 3GB FASTA format file.After 30 groups of human genomes were compressed by the proposed algorithm,the size ranged from 4.99 MB to 29.96 MB,of which 21 groups were less than 20 MB.For memory requirements,the proposed algorithm requires a maximum of 1.5GB of memory capacity,which is far superior to the existing excellent HiRGC genome sequence data compression algorithm.It can be run on a normal home computer and be valuable in practical application scenarios.
Keywords/Search Tags:genomic sequence, suffix array, lossless compression, sequence matching
PDF Full Text Request
Related items