Font Size: a A A

Two Algorithms Improving Statistical Power,Accuracy And Computational Efficiency In Multi-locus Genome-wide Association Studies

Posted on:2018-04-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Cox Lwaka TambaFull Text:PDF
GTID:1360330575967135Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Many traits of biomedical,agricultural,or evolutionary importance are quantitative in nature.Variation in these traits is often due to the effects of multiple genetic loci as well as environmental factors.Knowledge of the number,locations,effects,and identities of such genetic loci can lead to new biological insights.GWAS entails examining a large number of SNPs in a limited sample with hundreds of individuals,implying a variable selection problem in the high dimensional dataset.The recent developments in technology generate a vast amount of data in GWAS running into tens of million SNPs that need to be tested for association with a trait of interest.Despite the fact that many approaches have been developed,there is need to develop computationally fast algorithms which ensure high power in QTN detection,high accuracy in estimation,and low false positive rate.Although many single-locus GWAS approaches under polygenic background and population structure controls have widely been used,some significant loci fail to be detected,and the effects of identified loci cannot be estimated.This is because these approaches fail to consider the joint effect of multiple genetic markers on traits.Also,the Bonferroni correction used for multiple test correction for the threshold value of significance test is too stringent.Hence,many relevant loci are missed out.Penalized regression models are multi-locus in nature hence a less stringent significance criterion can be adopted.Penalized methods consider the joint effect of multiple genetic markers on traits.Also,they can shrink some marker effects to zero because only a small subset of SNPs is usually associated with the trait of interest.However,penalized regression models fail when the number of markers is several times larger than the sample size.Therefore,the solution lies in reducing the number of markers before employing a shrinkage method in a multi-locus genetic model.We addressed this issue in the first part of our work;we developed an algorithm called ISIS EM-BLASSO that reduces the number of SNPs to a moderate number before estimating the QTN effects in a multi-locus model.We used an iterative modified-sure independence screening approach in reducing the number of SNPs to a moderate number.EM-Bayesian LASSO was used to estimate all the selected SNP effects for accurate QTN detection.In the second part of our work,we addressed the issue of the computational challenge faced in GWAS due to a large number of SNPs that need to be tested for association with a trait of interest.We developed a fast mrMLM algorithm for multilocus GWAS called FASTmrMLM.We accelerated mrMLM algorithm by using GEMMA idea,matrix transformations,and identities.The target functions and derivatives in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step.All potentially associated QTNs are estimated in a multi-locus model by EM-Empirical Bayes and/or LARS algorithm.We performed Monte Carlo simulation studies to confirm the effectiveness of the new approaches(ISIS EM-BLASSO and FASTmrMLM).We sampled SNP genotypes from the Arabidopsis thaliana data and set six QTNs.Then we simulated phenotypic values with various genetic backgrounds;no polygenic background added,polygenic background added,and epistasis effect added.In the first part of this work,we analyzed each of the 1000 simulated samples by ISIS EM-BLASSO,EMMA,SCAD,FarmCPU and mrMLM methods.In the second part of this work,each of the 1000 simulated samples was analyzed by FASTmrMLM,mrMLM,FarmCPU,GEMMA and EMMA methods.We computed the power,mean squared error,false positive rate and running time in each case study to validate the new approaches.The effectiveness of these new procedures was further confirmed by analyzing six flowering related traits of Arabidopsis thaliana.The main results of our work are as follow:1.To validate the new method,ISIS EM-BLASSO,three Monte Carlo simulation experiments were conducted to compare the new method with four methods(EMMA,SCAD,FarmCPU,and mrMLM).As a result,the average powers across six simulated QTN for ISIS EM-BLASSO,EMMA,SCAD,FarmCPU,and mrMLM were 70.0,46.0,52.8,41.9 and 68.6(%)respectively in the first simulation experiment.The same trends were observed in the other simulation experiments.When paired t-test was conducted between ISIS EM-BLASSO and the other methods,ISIS EM-BLASSO had significantly higher power than EMMA,SCAD,and FarmCPU(P-value=0.001?0.007)in the first simulation experiment.Although there was no significant difference between ISIS EM-BLASSO and mrMLM,ISIS EM-BLASSO had slightly higher power than mrMLM.This means that ISIS EM-BLASSO had the highest power in QTN detection.The average MSE values across six simulated QTN for ISIS EM-BLASSO,EMMA,SCAD,FarmCPU,and mrMLM were 0.0812,0.5432,0.2030,0.2824 and 0.0934,respectively in the first simulation experiment.When paired t-test was conducted between ISIS EM-BLASSO and the other four methods,the MSE value was at least significantly lower for ISIS EM-BLASSO than from EMMA and SCAD.There was no significant difference between ISIS EM-BLASSO and the other two(mrMLM and FarmCPU)methods.However,ISIS EM-BLASSO had slightly lower MSE than mrMLM and FarmCPU.The same trends were observed across all the simulation experiments.Indeed,reducing the number of SNPs increases accuracy in effect estimation and the power of QTN detection.Despite ISIS EM-BLASSO having the highest accuracy in QTN effect estimation,it had slightly higher Type 1 errors(false positive rates)compared with SCAD,EMMA,FarmCPU,and mrMLM.Even though,all the Type 1 errors were less than 0.04%.In the first simulation study,the Type 1 errors for ISIS EM-BLASSO,EMMA,SCAD,FarmCPU,and mrMLM were 3.25E-2,3.25E-2,1.9E-2,1.78E-2 and 1.99E-2 respectively,whereas in the second simulation study,the false positive rates were 3.47E-2,1.66E-2,2.19E-2,1.74E-2 and 2.34E-2 respectively.ISIS EM-BLASSO as described is the fastest compared to the other methods.ISIS EM-BLASSO took 3%,16%,20%,and 50%of the computing time of EMMA,mrMLM,SCAD,and FarmCPU methods respectively.The new method reduces the scan to a moderate number hence reducing the computing time.ISIS EM-BLASSO detected 14,11,23,21,9 and 11 SNPs to be significantly associated respectively with the six traits studied.The detected SNPs for each trait were used to conduct a multiple linear regression analysis,and AIC and BIC values were calculated.ISIS EM-BLASSO method showed low AIC and BIC values for nearly all traits indicating that SNPs detected fit the data better than the other methods.The numbers of known genes in the proximity of SNPs detected for the six traits were in total 67.22,15,and 13 genes for ISIS EM-BLASSO,mrMLM,FarmCPU,and EMMA respectively.ISIS EM-BLASSO detected more known genes than the other methods.ISIS EM-BLASSO identified 50 new genes.2.To validate the new method,FASTmrMLM,the above three Monte Carlo simulation experiments were also used to compare the new method with four methods(mrMLM,FarmCPU,GEMMA and EMMA).As a result,FASTmrMLM takes less than 50%of the running time taken by mrMLM.In the first simulation,the running times(Intel Core i5-4570 CPU 3.20GHz,Memory 7.88G)for FASTmrMLM,mrMLM,FarmCPU,GEMMA and EMMA methods were 6.25,13.77,5.12,2.57 and 68.77(hours)respectively.Indeed,FASTmrMLM significantly quickens mrMLM.Although GEMMA and FarmCPU had lower computational time than FASTmrMLM,their performances in statistical power and parameter estimation accuracy were worse than those of FASTmrMLM.The same trends are observed across all the simulations.In the first simulation experiment,the average powers across six simulated QTN for FASTmrMLM,mrMLM,FarmCPU,GEMMA and EMMA were 68.8,68.6,41.9,46.0 and 46.0(%)respectively.When paired t-test was conducted between FASTmrMLM and the other methods,FASTmrMLM had significantly higher power than FarmCPU,GEMMA,and EMMA(P-value=0.004?0.012).Although there was no significant difference between FASTmrMLM and mrMLM(P-value?0.688),FASTmrMLM had slightly higher power than mrMLM implying that FASTmrMLM had the highest power in QTN detection.The average MSE values across six simulated QTN for FASTmrMLM,mrMLM,FarmCPU,GEMMA,and EMMA in the first simulation experiment were 0.0775,0.0933,0.2824,0.5467 and 0.5432 respectively.When paired t-test was conducted between FASTmrMLM and the other four methods,the MSE value was at least significantly lower from FASTmrMLM than from GEMMA and EMMA(P-value?0.009?0.020).There was no significant difference between FASTmrMLM and the other two(mrMLM and FarmCPU)methods(P-value=0.110?0.806).However,FASTmrMLM had slightly lower MSE than mrMLM and FarmCPU.The same trend cuts across all simulations.Therfore,FASTmrMLM has the highest accuracy in the estimation of QTN effect.FASTmrMLM effectively controlled FPR in QTN detection as observed in all the simulation experiments.In the first simulation experiment,the FPR values for FASTmrMLM,mrMLM,FarmCPU,GEMMA,and EMMA methods were 1.80E-2,1.99E-2,1.78E-2,3.25E-2 and 3.25E-2(%),respectively.It indicates that FASTmrMLM had almost the lowest FPR,although a less stringent selection criterion was adopted.FASTmrMLM identified 17,15,14,17,14 and 15 SNPs to be significantly associated respectively with the six traits studied.The identified SNPs for each trait were used to conduct a multiple linear regression analysis and we calculated the corresponding AIC and BIC values.FASTmrMLM had low AIC and BIC values for nearly all the traits indicating that SNPs identified fit the data better than the other methods.FASTmrMLM,mrMLM,FarmCPU,and GEMMA/EMMA identified in total 52,22,15 and 13 known genes respectively in the proximity of detected SNPs.FASTmrMLM detected 26 new genes.The new method identified more known genes than the other methods.We have developed reliable GWAS methods.ISIS EM-BLASSO is an alternative multi-locus GWAS method and FASTmrMLM is a fast and reliable algorithm in multi-locus GWAS.
Keywords/Search Tags:Correlation learning, GWAS, Penalization, SNP, multi-locus model, mixed linear model
PDF Full Text Request
Related items