Font Size: a A A

Large Scale Ultra-long Biological Sequence Clustering

Posted on:2021-04-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z K YinFull Text:PDF
GTID:1360330632957838Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the new sequencing technology,the data generating speed has been faster than the famous Moore's Law,and genomic data is now growing at a rate of more than 10 times over 12-18 months.More than 1.5 million people will be sequenced during China's 13th Five-Year Plan period,there will be 300-500GB of sequencing data for each people,thus,there will be about EB data in total.In metagenomic analysis,the volume of sequencing data of a soil sample could be as large as 50 TB.How to efficiently process this large-scale biological data is the main challenge.Biological sequence clustering is of vital importance in modern biology and life science.For example,clustering is of great significance in redundant removal,sequence classification and species analysisThe research contents in this dissertation focus on large-scale biological sequence clustering,and speeding up this process by high-performance computing technologies We have solved four research problems(1)The I/O efficiency of processing large-scale biological data:The first problem we need to solve is how to efficiently parse the sequence data when processing large-scale data.This is the very first step of sequence cluster-ing.Thus,we have proposed a high-performance I/O framework for processing biological sequence data.This framework is designed for multi-core platforms and supports the widely used FASTA and FASTQ formats.And the framework has been used in RabbitMash to eliminate the I/O performance bottleneck.Rab-bitMash works as one of the most important kernels in our clustering analysis.In addition,this framework has also been applied to RabbitQC as a case study Thanks to this I/O framework,RabbitQC outperforms other competitors(2)The similarity estimation of ultra-long sequences:In order to accelerate sequence similarity estimation,we have proposed Rabbit-Mash,a highly tuned and efficient sequence similarity estimator on modern multi-core platforms.RabbitMash is based on MinHash algorithm and uses Jaccard Index to measure sequence similarities.MinHash algorithm is far more efficient than alignment-based algorithms especially when dealing with long sequences.RabbitMash takes full advantage of modern hardware including multi-threading,vectorization and fast storage devices.RabbitiMash is able to calculate the dis-tance matrix of 1.1TB genomic data in 5 minutes,with is 2-3 orders of magnitude faster than the alignment-based methods.Furthermore,we have developed the RabbitSketch library based on RabbitMash.Besides the MinHash algorithm,RabbitSketch also supports other popular sketching algorithms such as HistoS-ketch,OrderMinHash,and HyperLogLog.RabbitSketch provides both C++and Python APIs for users with different research background(3)The extensibility of hierarchical clustering for large-scale data:In this dissertation,we have introduced a hierarchical clustering method based on sparse distance matrix which supports ultra-long sequences.In this method,firstly,RabbitMash is used to calculate sparse distance matrix,then the sparse matrix is sorted in ascending order according to distance,and finally we use the hcluster algorithm to cluster the sorted sparse matrix.In addition to calculating the sparse distance matrix,another performance bottleneck in this method is sorting the sparse distance matrix.To eliminate this performance bottleneck,we propose a hybrid sorting algorithm based on merging and sorting network(4)Porting greedy incremental clustering algorithm to distributed systems:We have also introduced a greedy incremental clustering framework on distributed systems in this dissertation.Normally the scalability of greedy incremental clus-tering is limited by the insufficient single-node resources.This framework breaks this limitation by mapping this greedy clustering algorithm to distributed com-puting clusters.Our strong scaling test shows that the framework has achieved a near linear speedup on a cluster with 10 nodes(200 CPU cores).In conclusion,we have proposed several methods for large-scale biological sequence clustering.And our methods are able to process 1.1TB sequence data in hours.In addition,we have also proposed a series of applications and libraries we used in our clustering method.And these applications and libraries can serve as useful resources for accelerating other large-scale genome analysis tasks in bioinformatics.
Keywords/Search Tags:Clustering, High-Performance Computing, Bioinformatics, Sketch Algorithms, SIMD
PDF Full Text Request
Related items