Font Size: a A A

Simulation And Strategy Of Association Analysis Methods In Zero-Inflated Microbiome Data

Posted on:2024-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:Z FanFull Text:PDF
GTID:2530306923956449Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Background The entire genome of the microbial community constitutes the extended human genome,providing genetic and metabolic capabilities that humans are not born with.Researchers often use correlation methods to explore the relationship between microorganisms and host genomes or metabolomes.However,real micro biome data have unique and complex characteristics.Firstly,microbio me data are usually sparse.Some observation values of each microorganism are zeros,which is also known as zero inflation.Secondly,microbiome data might be over dispersion,that is,the variance of variables was greater than the mean value.In addition,nonlinear relationships between microorganisms and other omics features are in various functional forms.Finally,there is the problem of " Curse of Dimensionality".The number of variables is larger or much larger than the number of samples.The above data characteristics may not only reduce the power of correlation methods,but also lead to biased or wrong results.We designed a comprehensive parametric simulation study to evaluate the performance of five correlation methods,including Pearson product-moment correlation coefficient,Spearman rank correlation coefficient,zero inflated negative binomial(ZINB)model,Maximal Information Coefficient(MIC)and mutual information(MI)on data with different characteristics.Based on the results of simulation study,we developed a framework "Corr-ZI" for association analysis in microbiome data with different data characteristics.The performance of Corr-ZI was further evaluated by using the iHMP and TCGA datasets.We also explored the relationships between the microbiome,metabolome and genome in two real datasets.Methods In the simulation study,we generated microbiome data based on negative binomial regression and logistic regression,as well as data conforming to normal distribution as another variable in the correlation analysis.The methods evaluated in the simulation study included linear correlation analysis methods:Pearson,Spearman and ZINB models,and non-linear correlation analysis methods MIC and MI.We simulated the comprehensive situations:(1)data with different proportions of zero observations(10%-90%);(2)data with different degrees of dispersion;(3)relationships of different function(linear correlation,quadratic,sigmoid function,sine function and cosine function);(4)different sample sizes;(5)different covariable effect(-0.9~0.9);(6)different distribution of zero values.The evaluation criteria included true positive rate(TPR),false positive rate(FPR),the receiver operating characteristic(ROC)curves and area under curve(AUC).Based on the results of the simulation study,we designed a framework "Corr-ZI"to determine the most appropriate method for correlation analysis in microbiome data.The five correlation methods and Corr-ZI were further evaluated in the public databases,including Integrative Human Microbiome Project(iHMP)and The Cancer Genome Atlas(TCGA).The data in Case Study one is microbiome and metabolome from the iHMP for inflammatory bowel disease,with 69 samples containing both 459 metabolites and 69 microbial genera.The data in case study two is microbiome and genome for colorectal cancer from the TCGA database,with 110 samples containing both 300 differentially expressed genes and 1386 microorganisms.Results The results of simulation study showed that the TPR of each method decreased with the increase of zero inflation rate(ZIR)and dispersion.In linear scenario,it was found that the average TPR of the ZINB model in different ZIR scenarios increased by 14%and 39%compared to the Pearson and Spearman.When the ZIR=50%and the dispersion=1,the TPR of ZINB model,Pearson and Spearman was 0.86,0.67 and 0.18,respectively.When the ZIR=20%and the dispersion=1,the TPR of ZINB model,Pearson and Spearman was 0.92,0.87 and 0.61,respectively.In addition,the performance of the Pearson and Spearman is susceptible to the distribution of zero observations.The true positive rates of Pearson and Spearman were higher when the signs of β and y are equal;when the signs of β and y are not equal,the TPR of the two methods decreased by about 24%and 28%,respectively.In the nonlinear scenario,when the ZIR was greater than 40%the MI method had a higher TPR than MIC.When the dispersion was 1 and the ZIR was 30%,50%and 70%,the TPR was 0.87,0.29 and 0.08 for the MIC method,and 0.62,0.50 and 0.34 for the MI method,respectively.In addition,the FPR of all five correlation methods were controlled to be around 0.05 for different ZIR and dispersions.According to ROC analysis,this study found that the ZINB models and Pearson had higher AUC than the other three non-parametric methods in linear situation.the MIC and MI method had better performance in nonlinear situation.Based on the results of simulation study,we designed a framework "Corr-ZI".The performance of the five methods and the Corr-ZI framework was evaluated in two datasets.The microbiome data in the iHMP and TCGA showed high and low zero inflation respectively.The Corr-ZI was able to find 586 and 10757 pairs of significant associations between microorganisms and other features respectively in iHMP and TCGA data,which were more than the other five correlation methods.Furthermore,the Corr-ZI was able to find significant associations that were overlooked by other methods,after correcting for false discovery rates.For example,only the Corr-ZI framework found correlation between Bifidobacterium,Parasutterella,Akkermansia,Bacteroidesand,Escherichia_Shigella,etc.and palmitoylethanolamide.Previous studies had demonstrated that these associations are of practical relevance.In addition,there were various function of relationship between microorganisms and metabolites.The "U-shape"correlation can be found by MIC and MI methods.Conclusions Simulation studies showed that ZINB models controlled the FPR well and had a high TPR in high ZIR data.ZINB model and the Pearson method have similar TPR when the ZIR is less than 20%.The TPR of the MIC was susceptible to the ZIR,and the MI method was more suitable for non-linear association analysis when the ZIR was greater than 40%.MIC and MI methods were more suitable for non-linear association analysis,while the parametric statistical methods ZINB models and Pearson were more suitable for linear association analysis.Overall,the performance of the five methods decrease in high zero inflation and dispersion microbiome data,which makes it difficult to choose a correlation method for practical analysis.Corr-ZI is able to handle complex microbiome data characteristics,and integrate the output of more powerfull correlation method in the current situation.In both case studies,the Corr-ZI framework showed higher performance and found more meaningful correlations in data with high or low zero inflation.
Keywords/Search Tags:Association analysis, Microbiome, Zero inflation, Nonlinear, Mutual information
PDF Full Text Request
Related items