Font Size: a A A

Metagenomic Contig Binning Based On Machine Learning

Posted on:2021-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:C HeFull Text:PDF
GTID:2370330629953812Subject:Engineering
Abstract/Summary:PDF Full Text Request
Metagenomics study genes and functions of microbes by DNA sequences directly extracted from environmental samples,it becomes one of the most important methods in microbiology.Contigs are basic subject of metagenomics study,and contig binning can improve the integrity of potential microbial gnomes.Contig binning is one of the most important problems of metagenomics,but accuracy of current methods is still to be improved,and the running time of current methods is too long to be applied in practical bioinformatics studies.This paper proposes a new contig binning methods which based on manifold learning and K-Means clustering algorithm.The main contributions and conclusions of this paper are as follows:(1)Contig feature engineering based on gradient boosting.For the problem that which contig sequence feature is more important to the problem of contig binning,this paper use gradient boosting to evaluate and select contig sequence features.Contig sequence features is firstly extracted from contigs data.Then,a gradient boosting model is trained fed by all the extracted features,which is optimized by grid search.With the results of gradient boosting model and the proposed multi compoents features importance calculation mthod,all the feature importance is obtained.Through analysis of sorted feature importance,the most import contigs sequence feature is screened.4-mer is selected as the most import feature in contig binning in standard contig dataset Strain Mock.(2)Contig binning based on manifold learning and K-MeansTo solve the problem that metagenomics contigs is hard to handle due to its high dimension,this paper study the metagenomic contig feature dimension reduction mthod based on manifold learning,which is aimed to get the low dimension manifold embedding of binning features.To solve the problem of the low accuracy and low computational efficiency of binning number estimation in contig binning,this paper estimates a more accurate binning number which is much closer to true binning number based on manifold embedding and bayesian inference,more specifically,bayesian inference gaussian mixture model clustering algorithm is used to efficiently and precisely estimate the initial bin number k,then binning estimation number K is obtained after a series of K-Means clustering using silhouette index.As the current metagenomic contig binning methods often suffer a low precision and low efficient solution,this paper implements a contig binning method based on manifold and K-Means clustering algorithm,based on manifold embedding of binning features and the binning number K estimated above,a fast contig binning method is implemented and binning results is produced.Experiments and analysis is conducted to evalute method in(2).12 binning number estimation method are compared and efficiency of bayesian inference based binning number estimation method is confifirmed.For contig binning methods,three other binning method are compared with this paper's method on metagenomic benchmark datasets Strain Mock and Species Mock.The experiment results show that,metagenomic binning of this paper has achieved better performance in terms of ACC,ARI and NMI compared with Meta BAT,COCACOLA and Solid Bin.It reaches 0.99864,0.99813 and 0.99723 on Species Mock in terms of ACC,NMI and ARI respectively and the running time efficiency is improved 80%.Metagenomic contig binning method based on machine learning proposed by this paper satisfies the demands of practical application.
Keywords/Search Tags:Metagenomics, machine learning, manifold embedding, gradient boosting, bayesian inference
PDF Full Text Request
Related items