| Sequencing technology has developed from the first generation Sanger sequencing to the third generation sequencing(TGS)era.Nowadays,with the development of bioinformatics,the third-generation sequencing technology has surpassed the second-generation sequencing technology to become the mainstream research direction and promoted the development of bioinformatics.However,because of the characteristics of the three generations of gene sequencing data,it will bring many new challenges.The mismatch between the sequence generated by sequencing and the storage space of database,and the mismatch between the growth rate of data and the growth of computer capacity are the urgent problems to be solved.In order to cope with the rapid growth of sequencing data,compared with the scheme of increasing storage capacity and reducing data generation,compression of short read data generated by sequencing is an effective method.By analyzing the existing mainstream second-generation compression algorithm,this paper proposes a compression algorithm specifically for the third-generation re sequenced data.Based on the above algorithm,the improved decompression algorithm realizes the local decompression technology,and the innovative design of the insertion compression algorithm.The main contents of this paper include:(1)This paper discusses the development of sequencing technology,compares second-generation and third-generation sequencing technology and the characteristics of the data generated,analyzes the mainstream sequencing data storage format,and deeply studies the structure of gene sequencing data and the compression algorithm for biological data,which lays the foundation for the next algorithm design.(2)LYZip,a compression framework for three generations of re sequenced data,is designed,in which LYZIP adopts appropriate compression strategies for different data streams.For the base sequence with the largest proportion of re-sequenced data,a compression algorithm TPBWT for the self-index inverse prefix sequence transformation of the data stream is proposed,and based on the compression algorithm,the decompression process is improved,and a partial decompression algorithm is proposed,which is conducive to the display and analysis of compressed documents by downstream software.Experiments show that the algorithm has good compression performance,which paves the way for the next insert compression algorithm.。(3)In order to solve the problem of slow sequencing of re sequenced data,an innovative insertion compression strategy is proposed and related algorithms are designed.On the basis of partial decompression,the processing time and compression time of sequencing data are reduced by omitting the sorting process through insertion operation.Experimental results show that the compression time of insert compression is reduced compared with that of sequencing sequence.In this paper,a compression framework LYZIP is designed for re sequencing DNA sequence data.On the basis of this algorithm,the partial decompression technology is realized,and the insertion compression is further realized.Experiments show that these algorithms have good compression performance,which can solve the storage problem of DNA sequencing data,and provide a reference for the third generation data compression or more advanced sequencing data compression in the future. |