Font Size: a A A

The Research Of Metagenome Sequence Analysis Optimization

Posted on:2021-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:K X LiFull Text:PDF
GTID:2370330614456800Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Next-Generation Sequencing technology produces a large number of sequence data,which brings unprecedented opportunities for the research of life science.However,due to the limitation of sequence error rate and computing resources,the analysis of massive genome sequence is still an insurmountable obstacle.SpaRC(Spark Read Clustering)is a pre-assembly software based on Apache spark platform,which clusters the sequenced reads directly.SpaRC is very scalable,and can solve massive sequencing data problem by increasing computing resources horizontally.However,SpaRC has some problems,such as the parameter problem and small clusters problem.Aiming at the above problems,this paper optimizes and improves SpaRC performance to make it can be applied to large-scale sequencing dataset and isolate genome level sequence clusters,thereby providing high quality data for the next sequence assembly.The main research work of this paper is as follows:(1)In order to optimize SpaRC further,the related theories and technologies involved in scalable metagenome sequence analysis were discussed.This paper mainly studies the representation and storage format of metagenome sequencing data,big data processing engine Apache spark and cloud computing platform AWS EMR,SpaRC clustering principle,etc.(2)A parameter optimization strategy based on Bayesian method is proposed to automatically select the optimal parameters for different dataset.There are many parameters in SpaRC,and different parameters have a great impact on the clustering performance,so it is difficult to manually select the optimal parameters in a specific dataset.In this paper,the Bayesian parameter optimization method based on Gaussian process is used to train and find the optimal parameters on the small dataset,and the optimal parameters are applied to the corresponding large data set.Experimental results show that Bayesian parameter optimization can effectively improve SpaRC performance.(3)A re-clustering strategy based on "global clustering" is proposed to re-cluster clusters from the same genome into a larger cluster,and then reconstruct the individual genome.The clusters obtained by SpaRC clustering are small,and each cluster only corresponse to partial gene fragments.This paper introduces a global vector to re-cluster SpaRC.The experimental results show that global clustering can effectively restore the individual genome,which provides a new idea for discovering the genome of new species.
Keywords/Search Tags:Metagenome, SpaRC, Bayesian parameter optimization, Global clustering
PDF Full Text Request
Related items