| The gut microbiome profoundly affects human health and disease.A large number of studies have shown the feasibility of developing novel clinical interventions based on the gut microbiome.However,the confusing meta-data or the lack of such information in public databases,inconsistent analysis processes and results between studies,and the confounding factors in gut microbiota analysis have greatly held back the progress of discovering disease biomarkers.In order to improve the inter-studies comparison,a standardized pipeline is needed to minimize the influence on the results caused by using different analysis methods.Thus,we created GMrepo(data repository for Gut Microbiota)and included detailed,manually curated meta-data.Precomputed species/genus relative abundances,prevalence within and across phenotypes,and pairwise co-occurrence information are all available at GMrepo.Marker taxa identified from pairs of phenotypes on a per-study basis are also included in order to support inter-study and inter-disease comparisons.Furthermore,to facilitate users to quickly access the data of interest,we equipped GMrepo with a metadata-based graphical query builder(data selector)to help users to create complex and flexible queries with a few clicks.So far,GMrepo has collected 66,133 samples/runs from 295 projects,containing 94phenotypes(healthy and diseases),and identified 291 disease-related marker species and112 marker genera from the comparison of 16 pairs of phenotypes.All data of GMrepo are available for download and can be accessed at https://gmrepo.humangut.info/home.In order to verify the clinical application value of the marker taxa,we also created a machine learning toolbox GM.classify,to provide a unified analysis framework.GM.classify is easy to use and supports multi-threading.Built-in functions of GM.classify include data preprocessing,feature selection,model training,model evaluation,and external verification.GM.classify supports the construction of binary classification models using the random forest algorithm.This study focused on the identification of gut-microbiome-derived biomarkers for two common digestive diseases.Based on difference tests,we identified 25 biomarkers from 7colorectal cancer studies that are consistent across multiple studies,and showed that the classification model constructed by GM.classify working well using the consistent biomarkers(average AUC: 0.84).For liver cirrhosis,we used GM.classify to obtain consistent biomarkers in multiple studies.However,careful examination revealed that the biomarkers of liver cirrhosis were strongly affected by treatment regimens.For example,a random forest model built on two genera known associated with the drug can accurately distinguish cirrhosis samples from controls(cross-verification AUC>0.88).In conclusion,this study constructed a database with rich contents and intuitive interfaces,and developed an easy-to-use machine learning toolbox.Furthermore,we analyzed two common digestive diseases,the results support the clinical application value of marker taxa and emphasizes the importance of controlling confounding factors in microbiome research. |