Font Size: a A A

Research Of Classifying Large-scale Metagenomics Data Based On Next-generation Sequencing

Posted on:2019-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y JiangFull Text:PDF
GTID:2370330566980050Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Metagenomics brings in new discoveries and plays an irreplaceable role in the study of microorganisms,since it investigates the DNA sequences of microbial directly recovered from environmental samples,and human-cultured is not necessary.Because metagenomic directly get DNA sequences from environmental samples,all DNA sequeces of microbial is mixed.Thus,one fundamental task in metagenomics analysis is to determine the taxonomy of DNA sequence fragments.However,the scale of metagenomics data becomes bigger and bigger,and classifying metagenomics data become more and more difficult with the application of next-generation sequencing technology.At the same time,classifying metagenomics data directly affects the subsequent research of metagenomics.Thus,how to efficiently and effectively classify metagenomic sequences has become an important problem.The available metagenomic data classifying methods(also known as the metagenomic data binning methods)fall into two categories: supervised and unsupervised.The main contributions of this thesis are:(1)For supervised methods,we propose a metagenomics fragments classification method--EnSVMB,which uses ensemble SVM and BLAST to accurately classify fragments.EnSVMB first trains mutiple linear SVMs with different k-mer,then integrates them and divides fragments into confident and diffident sets.Empirical study shows that the accuracy on confident set is significantly better than that on diffident set.Particularly,results shows that accuracy,sensitivity and specificity of EnSVMB on confident set are higher than 95%,90% and 97%,but on diffident set are lower than 88%,60% and 75%.To further improve the performance on diffident set,EnSVMB takes advantage of best hits of BLAST to reclassify fragments in that set.Results show that EnSVMB can efficiently and effectively divide fragments into confident and diffident sets,and EnSVMB achieves higher accuracy,sensitivity and more true positives than related state-of-the-art methods and holds comparable runtime cost.(2)For unsupervised methods,we propose BMC3 C,an ensemble clustering and graph partition based method.In addition,we incorporate a new metagenomics feature--Codon usage into BMC3 C.BMC3C begins by searching the proper number of clusters and repeatedly applying the k-means clustering with different initializations to cluster contigs.Then,a weight graph with each node representing a contig is derived from these clusters.If two contigs are frequently grouped into the same cluster,the weight between them is high,and otherwise low.Finally,BMC3 C utilizes a graph partitioning technique(Normalized Cut)to partition the weight graph into subgraphs,each corresponding to a clustering.We conduct experiments on both simulated and real-world datasets to evaluate BMC3 C and other state-of-the-art tools.Results show that BMC3 C has an improved performance than these tools.To the best of our knowledge,that is the first time that the codon usage features and ensemble clustering are used in unsupervised metagenomic classification.Results also shows that ensemble clustering and codon usage features lead to improved performance of BMC3C.
Keywords/Search Tags:Metagenomics, SVM, BLAST, Ensemble clustering, Condon usage
PDF Full Text Request
Related items