Font Size: a A A

Research On Lossless Compression Of High-throughput Genome Data

Posted on:2020-08-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:R J WangFull Text:PDF
GTID:1360330590472807Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of high-throughput genome sequencing technologies and the huge decline in sequencing costs,genome sequencing data and genome sequence data assembled by it have grown exponentially.How to effectively store and transmit these massive high-throughput genome data is an urgent problem in the field of biomedicine and bioinformatics.Genome data compression technology has become an effective way to solve the problem,that is,to use genome data with less storage space and transmission cost in an efficient compression manner.However,the high complexity of genome data,the high-throughput of sequencing data,and the limitations of existing genome sequencing technologies have made great challenges for efficient and rapid compression of highthroughput genome data.This thesis focuses on the lossless compression methods of genome sequence data and genome sequencing data,the main contributions of this thesis include the following aspects:Firstly,considering the shortcomings of the current genome sequence data lossless compression method using the fixed-context order for base probabilities prediction.The correlation between the first-order information entropy and the genome compression result is deeply studied.A lossless compression method for the genome sequence data based on entropy is proposed.The method is based on calculating the first-order entropy of the genome sequence,and dynamically determining the parameters of the genome sequence compression finite-context model.The experimental compression results on all bacterial genome data verified the effectiveness of the proposed method.Secondly,the existing genome sequence data lossless compression method only uses part of the base information when predicting the base probabilities,and the prediction effect is limited.A lossless compression method based on deep learning to genome sequence data is proposed.The method first uses convolutional neural networks to extract local features in genome sequence data,and uses recurrent neural networks to extract global features in genome sequence data.Subsequently,the method fully integrates local features and global features to form a predicted base probabilities model.Finally,The experimental compression results on real human mitochondrial genome sequence data verified the effectiveness of the proposed method.Thirdly,based on the problem of the bucket index error caused by sequencing errors in the current genome sequencing data,a lossless compression method based on sequence error correction for genome sequencing data is proposed.The method analyses and corrects the base errors in the genome sequencing data,so that the reads can be allocated to more reasonable buckets,increasing the data redundancy density in the bucket,thereby improving the compression result.The effectiveness of the method was verified on a real five-group genome sequencing data compression experiment.Fourthly,traditional text compression scheme is still used in the bucket compression method of genome sequencing data,and the characteristics of genome sequencing data are not fully utilized.Aiming at address this problem,a lossless compression method of genome sequencing data based on de Bruijn graph was proposed.Based on the granulating sequence data,the de Bruijn map is constructed,and the sequence of the read is represented as a path in the de Bruijn graph.By dynamically constructing the de Bruijn graph,the original de Bruijn graph is no longer to be stored,this saves storage space and results in better compression performance.The effectiveness of the method was verified on a real eight-group genome sequencing data compression experiment.
Keywords/Search Tags:Genome data, Lossless compression, Entropy, Deep learning, Sequencing errors correction, De Bruijn graph
PDF Full Text Request
Related items