Font Size: a A A

Joint Analysis Strategies Of SKAT And Penalized Regression And Their Application In Genetic Association Studies

Posted on:2017-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:J G ZhangFull Text:PDF
GTID:2284330503965225Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Background Genome-wide association studies(GWAS) have discovered many susceptible genetic variations for a variety of complex diseases. Nevertheless, a large proportion of heritability is unexplainable in GWAS results. Rare variants were recently occurred in the human evolution, these were not subjected by much pressures, so they tend to involve some functions so that have more stronger genetic effects. With the rapid development of second-generation sequencing technology, more and more genetic association studies data including rare variants are emerged. However, the frequency of rare variants in population is low, it results that conventional statistical methods are difficult to detect their effects and it has brought new challenges to current analysis methods.In the early stage of GWAS, A single genetic mutation association test is limited to the corrected significant level and result in low efficiency. In view of many diseaserelated genes including several genetic variants having functional effects, especially rare variants, it is useful analysis strategy grouping variants in same gene to increase its effect. In other hand, SKAT is the representative of such models that successfully solves the problem of linkage disequilibrium between genetic variation and allowes different variants to have different directions and magnitude of effects. but all these methods operate on a single gene or ROI(Region of Interest), ignoring information contained in other genes or outside gene boundaries.Genome-wide genetic association data was a state of high dimension, noise and severe collinearity. It’s useful to solve such problem that use penalty functions with the traditional methods of the least square method and the likelihood estimation. In 1996 Since LASSO(least absolute shrinkage and selection operator) was proposed by Tibshirani in 1996, many novel statistical methods based on the thought of punishment have developed. In 2005 Zou et al proposed elastic net which combine the ridge regression and LASSO, this method was effectively the situation that the number of variables is much larger than the observed value. In 2009 and 2012 Breheny et al and Huang et al proposed c MCP and Gel for bi-level variable selection—selecting not only the important groups, but important members within those groups, it excludes the impact of excessive noise, but its theory and application needs further study.The method of SKAT that is based on genes or resgion of interest can only do statistical inference in the groups, but it can not estimate the effect of individual variation. The penalized regression can’t calculate the p value of statistical test. Thus, this study proposes two-stage analysis strategy to combine the advantages of two types of methods. we use two-stage analysis methods and bi-level variable selection to analyze genetic association data and to evaluate their properties in order to provide a method guidance for the genetic association study.Method SKAT, LASSO, EN and two-stage strategies(SKAT+EN,SKAT+LASSO,EN+SKAT,LASSO+SKAT) as well as bi-level variable selection models(c MCP, Gel) are used in genome-wide association studies and candidate gene association studies to compare their application performance. There methods further applied in lnc RNA H19, HOTAIR, MALAT1, MEG3 and liver cancer cases control genetic association study in order to demonstrate their practical application.In genome-wide association studies, the data of analysis is come from Genetic Analysis Workshop 18(GAW18). This data include 849 individuals. We choose variants in chromosome 3(including 532092 SNPs 1141 genes) as independent variables and simulate diastolic blood pressure(DBP) as outcome variable. Evaluation indices includes sensitive, and specificity, Youden index, elected rate, P value of the correlation and coefficient correlation.In candidate gene association studies, the data of GAW18 is still used and variants all associate with outcome variables(including 119 SNPs 35 genes) as independent variables used in model. We choose 200 simulated values of DBP of 849 individuals as outcome variable to evaluate these methods’ powers and Q1 as outcome variables to evaluate type I error. Absolute error and relative error are added in evaluation indices.Finally, the optimal statistical analysis strategies are used in lnc RNA H19, HOTAIR, MALAT1, MEG3 and liver cancer cases control genetic association study in order to demonstrate the practical application in the future studies.Result 1. In the genetic association studies, the results at the gene level show that, the highest average sensitivity is 0.595 calculated by SKAT method and the highest average specificity is 0.906 counted by SKAT+LASSO method. The Youden index of SKAT is 0.112, as the highest index, followed by SKAT+EN strategy’s value is 0.086. The result at the SNP level indicate that the method of EN has highest sensitivity and the method of SKAT+LASSO has highest specificity. The highest Youden index is counted by EN+SKAT method and the second is EN method. The methods of EN and EN+SKAT can selected the most number of true associated SNPs. MAP4 that is the largest contribution gene to DBP has the highest selected rate in the various statistical analysis, its value is associated with the number of SNPs within the gene and the proportion of explained variance of DBP. SNP 48040283 and 47957996 have top1 and top2 selected number in the models, they all belong the gene of MAP4 and their effect intensity ranked 1st and 5th in all SNPs.2. In the candidate gene association studies, at the gene level, the highest power of methods is EN method and LASSO method, and their values is 0.638 and 0.616 separately. The method of SKAT and its joint analysis strategies get the lower value of the type I error rate in all strategies. The result at the SNP level indicate that the methods of EN and LASSO are get the highest values in the index of power. The lowest values of the type I error rate are counted by SKAT+EN and SKAT+LASSO. In addition, both in gene and in SNP level, the power of EN+SKAT is lower than EN method and LASSO method, but it much lower than them in the value of Type I error. MAP4 is the highest selected rate gene, it is related with the number of SNPs within the gene and the proportion of explained variance of DBP. SNPs Selected by models that have highest value of the coefficient of DBP are not associated with MAF values and true coefficients of DBP, but they are related with the proportion of explained variance of DBP. There is no correlation between absolute error, relative error and MAF value, proportion of variance explained in different statistical methods, but absolute error is related with the true coefficient of SNP.3. This study use different kinds of strategies to analyze case-control genetic association study based on liver cancer hospital. The result show that traditional Logistic regression using univariate analysis only find rs151191249 that possibility associate with liver cancer. However, the methods of EN and LASSO can separately screen 11 and 10 SNPs that possibility associate with liver cancer.Conclusion 1. In the study of the genetic association studies, EN+SKAT could screen few number variants that associate with phenotypes in big data. This methods not only has high sensitivity but also has restraint false positives, it could provide some clues for the future studies of genetic mechanisms.2. In the association study of candidate gene, the methods of EN show its power and efficiency, it could dig out more SNPs that associate with phenotypes than Logistic regression, and it could identify the purpose of our study. In addition, EN+SKAT analysis strategy can significantly reduce the probability of Type I error although it is lower than the method of EN in power, but it is worth to use in candidate gene studies.
Keywords/Search Tags:genetic association study, SKAT, penalized regression, GAW18, liver cancer, lncRNA
PDF Full Text Request
Related items