Font Size: a A A

Research On Compression Of Large Collections Of Genomes And Its Parallelization Algorithm

Posted on:2022-02-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:H C YaoFull Text:PDF
GTID:1480306557962899Subject:Information security
Abstract/Summary:PDF Full Text Request
Genomic data is widely valued by the international community because of its important social value and scientific research value.With the continuous advancement of genome sequencing technology,the speed of sequencing continues to increase and the cost of sequencing continues to decrease.Therefore,human genomic data is increasing exponentially,and the growth rate will be even faster in the future.The growth rate of genomic data far exceeds the growth rate of storage capacity and network bandwidth,which poses great challenges to the storage,backup and migration of genomic data.Genome compression algorithm is an important way to alleviate this challenge.Genomic data records all the genetic information of an organism,it is a key resource of each country,and its security is an important part of national security.Therefore,on the premise of improving the efficiency of genomic data storage,it is very important to ensure the safety of genomic data.The thesis aims to store large collections of genomes more efficiently and safely.Through in-depth study of the existing genome compression algorithms at home and abroad,the thesis gradually solves the problem from the core algorithm of genome compression,the large collections of genomes compression algorithm,the single-machine multi-thread parallel genome compression algorithm,the distributed parallel genome compression algorithm and the genomic stream data compression algorithm.Specifically,this thesis mainly includes the following three innovative works:(1)Aiming at the core algorithm in genome compression,this thesis proposes a genome sequence matching algorithm MLGM(Maximum Likelihood Genome Matching)based on the maximum likelihood model,which improves the compression ratio of existing matching algorithms.Taking account to the biological characteristics of genome sequences,there are three types of mutations that exist widely among genome sequences: insertion,deletion and substitution.Different from the maximum exact matching algorithms,MLGM uses the storage overhead but not the length of fragments as the criterion for determining matching fragments.After the exact matching fragments are searched,they are used as the anchor points to unite the fragments through comparing the INDEL and substitutions.The union iterates continuously until the optimal maximum likelihood matching fragment is found.Compared with the exact matching used by the existing genome compression algorithms,the MLGM algorithm can represent the difference information between sequences with less storage overhead.The experimental results show that MLGM algorithm improves the average compression ratio by 16.64% compared to the matching algorithm used by the most excellent genome compression algorithm mem RGC.(2)On the basis of MLGM,an efficient large collections of genomes compression algorithm HRCM based on the two-dimensional hash matrix is proposed,which solves the problem that there is no efficient lossless compression algorithm for multi sequences.Before the first-order compression,HRCM extracts the basic DNA bases and the auxiliary information of the sequence file separately,which improves the matching speed and reduces the memory consumption.After the first-order compression,the compression results are not directly stored,but the first-order compression results of some sequences are selected to build a two-dimensional hash matrix for the second-order compression.The first-order compression results of all to-be-compressed sequences are compressed again based on the two-dimensional hash index.The repeated compression results of the first-order compression of each sequence are removed,and an efficient expression method for the first-order and the second-order compression results is designed.The total overhead of compression results storage is greatly reduced.The second-order compression mechanism also reduces the influence of reference selection.Moreover,MLGM adopts pipeline compression mechanism.The memory footprint is not related to the number of to-be-compressed sequences,which makes MLGM can compress large collections of genome sequences.Experimental results show that when compressing 1100 human genomes,the average compression ratio of HRCM reaches 2347:1,which is 5.8 times higher than the best genome compression algorithm mem RGC.Moreover,because the reference sequence is preprocessed and index is built only once,the compression speed is 46.4% higher than the fastest compression algorithm Hi RGC.The robustness of HRCM is also the best among all algorithms.(3)Aiming at the problem of long compression time for large collections of genomes,multiple efficient parallel compression algorighms for large collections of genomes are proposed.By studying the dependent part and the parallelizable part of the large collections of genomes compression algorithm,the genomic data flow and algorithm are redesigned by fusing them with a variety of parallelization technologies.The multi-thread parallel genome compression algorithm(Mt GC)for single machine,Hadoop based genome compression algorithm(Hadoop GC)and Spark based genome compression algorithm(Spark GC)for distributed file system are proposed.The three parallel algorithms solve different problems and complement each other.Mt GC improves the speed of genome compression in a single-machine by parallelizing non-reference sequences compression using thread pool technology.Hadoop GC solves the genome compression problem on the most widely used distributed computing system Hadoop platform.Hadoop GC designs genomic data input and output algorithms in distributed file systems,data distribution strategy on distributed computing platform,the memory management and file management algorithms.The first-order compression is completed in the Map phase,and the second-order compression is completed in the Reduce phase.Each phase uses multi-node computing power to achieve parallelism.Spark GC is proposed based on the Spark's in memory distribution data set and stream data real time processing technology.It not only supports batch compression of large collections of genomes,but also supports genomic stream data compression.Spark GC combines the Spark operators with large collections of genomes compression algorithms and realizes distributed parallel compression through RDD transformation.It also designs RDD memory caching algorithms and serialization mechanism to improve the utilization of memory.Experimental results show that when the thread pool size of Mt GC is 4,the average speed of compressing 1100 human genomes is 2.8 times that of HRCM.When Hadoop GC compresses 1100 human genomes on four computing nodes,the average compression speed is 3.3times that of HRCM.The compression speed of Spark GC is 3.85 times that of HRCM on single working node.When the number of working node is 4,the average compression speed of Spark GC is 10 times that of HRCM.The time of compressing 1100 human genomes is shortened from 60 hours to 6 hours,and the scalability performance is excellent.Moreover,the design and theoretical analysis of the three parallel strategies are also valuable to guide the parallelism of other genome processing algorithms.The research results of this thesis have solved the problem of batch compression for large collections of genomes,the problem of genomic stream data compression and their parallelization.They have improved the compression ratio,compression speed and applicability for large collections of genomes,and have excellent robustness and scalability.They will alleviate the problem of genomic data explosion to a certain extent.They also solve the problem of genomic data storage in domestic processor big data machine.At the same time,the domestic processor ensures the independent control of the storage device,and the referential genome compression ensures the safety of genomic data.
Keywords/Search Tags:Genome compression, Parallelization, Multi-threading, Parallel computing, Genome security
PDF Full Text Request
Related items