Font Size: a A A

Feature Selection And Disease Prediction Of Intestinal Metagenomic Data Based On Machine Learning

Posted on:2021-07-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:D F LiFull Text:PDF
GTID:1480306575454274Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Driven by next-generation sequencing technology,metagenomics research has undergone tremendous changes in both breadth and depth.With the rapid increase of metagenomic data size,various machine learning methods have been adopted to preform different tasks on big metagenomic data.Human gut metagenomic data is characterized by small sample size,high dimensions and complicated relationships with hosts,there are some issues which should be addressed in selection of microbiota features and disease predication analysis:(1)The diseases such as colorectal cancer that are greatly affected by age,gender etc,while the influence of such confounding factors is not considered in the current processing of feature selection or treated as a general variable,which may cause false positive results,poor interpretability and prediction performance.(2)For the chronic metabolic diseases such as obesity and hyperlipidemia,influenced by genetic,lifestyle,diet and other factors,the performance of classification is not ideal with the accuracy of validation sample sets lower than 0.7,which limits the clinical application of gut metagenomic data.(3)Many researchers find it hard to afford the sequencing and analysis cost,thus need to make effective use of published control samples.So the question of how many cases can be matched to achieve a better classification performance needs to be discussed.In order to address the above issues in the study of human intestinal metagenomics taking into account the real demand in practice.For the problem that the increase of false positive features due to the confounding factors,a feature screening method based on the causal inference model for intestinal metagenomic data was proposed.In this method,a causal risk ratio(CRR)explicit expression,considering confounding factors based on causal inference model,was derived.According to the CRR ratio,the more accurate diseased-related features were screened.The result showed that:(1)based on colorectal cancer related gut metagenomic data affected by age,402 features selected by CRR values were all different significantly between case and control groups(wilcox test,p < 0.01),while in the top 402 features selected by generalized additive model(GAM)method controlling the confounding factors,only 37.8% of all were different significantly between groups.(2)Moreover,the features screened by CRR methods were adopted to construct disease predictive model,respectively.In three independent validation data sets,the average AUC values of model based on CRR features were 0.928,0.886 and 0.849,were all significantly higher than the that of GAM features(0.885,0.852 and 0.775,respectively,and t test p < 0.01).For the problem of poor prediction performance in chronic metabolic diseases research,it was proposed that features engineering from multiple taxonomy levels were added to predictive model to improve performance.According to taxonomic category system established by Linnaeus,microorganisms can be classified into six levels: phylum,class,order,family,genus and species.At present,only the features at genus level were adopted,the data of other taxonomy levels was not effectively utilized.In this method,the features from other taxonomy levels was added into to those of genus level,combined with the feature engineering of logarithm processing of metagenomic abundance values,to construct the disease predication model.Seven specific operation schemes were set up to check the improvement results based on four predictive models.The results showed that among the seven specific methods,the optimal one was the combinations of multi-taxonomy levels after taking the logarithm of the abundance value.The biggest increase of AUC of four models was the SVM,reaching 9%,followed by L1 regularization regression model by 6%.In the other two models,the increase was among 1% and 3.9% due to the high initial AUC values.For the problem of how to determine the minimum sample size of case to match the published control samples,the influence of different grouping proportion on disease predication were studied systematically by using the methods of imbalanced dataset.Firstly,a control sample set with a large sample size was constructed.Then three datasets from different sources were adopted to evaluate the performance of three approaches,which included deep factorization machines algorithm(Deep FM),synthetic minority oversampling technique(SMOTE)and under-sampling randomly,for handling the unbalanced data.The results show that SMOTE over-sampling method and Deep FM were applicable to different data sets,and the optimal choice should be made in the actual study.The random down-sampling method was adapted to three different disease data sets,it indicates that the imbalance of data set has a great influence on the prediction of intestinal metagenomic data,therefore,it is suggested that on the premise of ensuring the balance of the data set,disease group sample size should be at least 30,and the disease prediction performance tends to be stable when more than 60.This conclusion was applied to the project design of the gut metagenomics study of childhood undernutrition in China.In this study,65 disease samples and 61 healthy samples were collected,respectively,and the AUC of disease prediction using random forest was 0.9.
Keywords/Search Tags:Human gut metagenomics, Machine learning, Causal inference model, Features engineering, Disease predictive modeling, Imbalanced dataset
PDF Full Text Request
Related items