Font Size: a A A

The Performances Of Four SNP Set Association Studies In Genome-wide Association Study

Posted on:2014-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:M CaiFull Text:PDF
GTID:2284330485994940Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
It is widely believed that genetic variants play an important role in the etiology of common diseases risk. With the rapid development in high throughput genotyping technology, large numbers of SNPs can be genotyped simultaneously in genome-wide association study, as a result the high-dimensional data are yielded. Due to the large number of predictors, the association study based on individual SNP suffer from multiple testing adjustment to ensure the overall type Ⅰ error rate is controlled. After the multiple correction, the significant levels will be too strict to maintain the powers of test, and miss the real causal SNPs.Nowadays, more and more researchers are interested in multiple loci (SNP set) association study, and some have considered gene, region or SNP set based association studies. Some methods have been proposed by grouping SNPs together into a SNP set based on genomic features, then testing the joint effect of the SNP set. Several studies have revealed that treat SNP sets instead of individual SNP may alleviate the problems of multiple testing, assess the association between genetic variations and complex diseases based on the gene levels or set levels, and improve the powers. Several methods have been proposed based on grouping SNPs into SNP sets as higher level units, such as principal component analysis, supervised principal component analysis, kernel principal component analysis and sliced inversed regression.In this study, we compare the performances of PCA, SPCA, KPCA and SIR in genome-wide association study based on simulated datasets and real datasets. And we use SPCA to analyze genetic susceptibility to lung cancer GWAS data.Simulations contents are as follows:1. Simulations based on virtual datasets:Simulated SNP sets are generated under scenarios of null model, single causal SNP model and two causal SNPs model, respectively. Datasets are generated based on virtual structures whose LD structures and MAF of SNPs are set artificially. Scenarios are set in three LD structures (LD-r2=0.2 for any two SNPs, LD-r=0.5 for any two SNPs, LD-r2=0.8 for any two SNPs), three different MAFs (MAF=0.05,0.1 or 0.2 for all SNPs).2. Simulations based on the real genes:Simulated SNP sets are generated under scenarios of null model, single causal SNP model and more causal SNPs model, respectively. And we generate datasets based on the phased haplotypes of CEU samples from the website of the International HapMap project. Simulations are used to do the association studies based on four methods.The main results of the study are as follows:1. Results of analysis based on virtual datasets:All of four methods can control the type I error at the specified significant level. Results of test power with single causal SNP show that SPCA has the most power in most situations. As MAF is fixed as 0.05,0.1 or 0.2 and LD is set as 0.2, powers of PCA、KPCA and SIR are approximate. When LD is 0.5, KPCA is more powerful than PCA. Due to strong LD structure, the efficiency of KPCA is close to SPCA. In most situations, the efficiency of SIR is lower than the other methods. As those scenarios with two causal SNPs, the change trends of powers are nearly the same as the single causal SNP model. While the power of every scenario based on two causal SNPs is obviously higher than the single causal SNP model.2. Results of analysis based on actual datasets:Four methods can control the empirical type I error at the specified significant level. In general, all methods have power when the causal SNP is in high LD with the other SNPs. In most occasions, SPCA still has the most power, which is followed by KPCA. When the MAF of the causal SNP is low, powers of four methods are all weak, which are only about 10%.3. The real data analysis:To investigate systematically the performances of four methods, we apply SPCA to analyze Nanjing, Beijing and pooled dataset of lung cancer GWAS data. Results show that Nanjing data as a screen set (p<=0.0001) and Beijing data as a validation set (p<=0.05),13 genes are validated.Conclusion Results of simulations datasets and real datasets show that SPCA is superior to the other methods in extracting SNPs information and recommended to gene-based or SNP set-based analysis in genome-wide association study.
Keywords/Search Tags:Genome-wide association study, SNP set, Principal component analysis, Supervised principal component analysis, Kernel principal component analysis, Sliced inversed regression
PDF Full Text Request
Related items