Font Size: a A A

Statistical Methods For Interaction Analysis In High Dimensional Data And The Application In Genome-wide Association Study Of Lung Cancer

Posted on:2014-01-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:R Y ZhangFull Text:PDF
GTID:1224330398493385Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Despite the great success in genome-wide association study (GWAS) since year2005, the identified single nucleotide polymorphisms (SNPs) with main effect onlyaccount for a little proportion of genetic variation for complex diseases. Both externalfactors (environmental exposure) and internal factors (genetic mutation) contribute tothe complex diseases. Neglecting the gene-environment interaction and/or thegene-gene interaction is one of the most important reasons for missing heritability inGWAS.Hundreds of thousands of SNPs are available in GWAS nowadays. Due to thecomplexity of statistical algorithms and/or the limited computation speed of softwares,the traditional methods for interaction analysis are not appropriate in highdimensional data. Lots of novel methods for GWAS interaction analysis have beenproposed since year2007. However, they have both advantages and disadvantages.Meanwhile, there is no dedicated method for high-order interaction analysis. Thus,firstly, we did a systematic comparative analysis for ten representative methods.Secondly, we proposed a novel method for high-order interaction analysis. Thirdly,we proposed a three-step based strategy to reduce the high dimensional data into lowdimensional data when detecting high-order interaction. Finally, we applied theproposed strategy in GWAS real dataset for genome-wide epistasis analysis. Thethesis is organized as follows.In Section1, we did a systematic comparative analysis for ten methods in sevensoftwares based on literature review, including BOOST, BiForce, iLOCi, SIXPAC_D, SIXPAC_R, SIXPAC_lod, SNPRuler, AntEpiSeeker_pruned, AntEpiSeeker_raw andTEAM. Simulation1and Simulation2were designed to detect only one and morethan one genetic epistasis respectively. Both two simulations indicate that twomethods (BOOST and BiForce) are recommended for interaction analysis, since theycan control the type one error and have acceptable power. BOOST has the sameperformance with BiForce, indicating that "screening before testing" is a reasonableway for dimensional reduction. SXIPAC_lod only supports datasets in which SNPsare in dominant or recessive genetic model. The type one error was inflated up to15%for SIXPAC_lod when detecting more than one epistasis. However, it has higherpower than BOOST or BiForce in all scenarios, indicating that SNPs should be indominant or recessive genetic model when samplesize is limited. BOOST andBiForce are flexible in SNPs coding (additive, dominant or recessive). Thus, werecommend these two to be the best methods in GWAS interaction analysis. Both twosimulations show that the other two methods (AntEpiSeeker_raw and TEAM)perform best in filtering out noise SNP. They can control type one error for noise SNP,and have high power to detect SNPs whatever main effect or interaction effect exists.In Simulation3, BOOST and BiForce are the fastest tools. They can finish exhaustivesearch of epistasis on genome-wide scale in a few days.In Section2, we proposed a new method, iterative entropy epistasis (IEE) ininformation framework. IEE was appropriate for detecting high-order interactionwhatever linkage disequilibrium (LD) structure exists among SNPs. Simulation4andSimulation5were designed to evaluate the performance of IEE in aspect of statisticalmethod and real application respectively. Intensive simulations indicate that IEE isable to control the type one error in nominal level, and exhibits higher power thanlog-linear model and other entropy-based methods. Additionally, IEE with lessiterations executes faster than log-linear model. The lower accuracy for IEE initeration, the faster it runs. In Simulation6, we found that IEE was able to maintainits original performance when reaching25%and50%accuracy of iteration fordetecting one-order and high-order interaction respectively. Thus, the calculationspeed was improved by4-fold and2-fold respectively. In Section3, we proposed a three-step based strategy for high-order interactionanalysis in GWAS. The first step is fast-screening using Kirkwood superpositionapproximation (KSA), which filters out a great proportion of noise SNPs. The secondstep is testing using IEE, which again removes the false positive results. The finalstep is confirmation using logistic regression model, which provides the statisticalsignificance of interactions. The strategy is referred as KIL. Simulation7indicatesthat statistics of KSA are no less than those of IEE, and it is the fastest compared withIEE or logistic regression model. Thus, KSA is qualified in fast-screening withoutmissing of potential positive interactions. IEE is faster than logistic regression model,and is appropriate for screening epistasis in high dimensional data. In Simulation8,KIL can reduce the computational burden as low as30%-40%of original ones.Meanwhile, it keeps more than92%of the power of logistic regression modelaveragely. Compared with KSA and logistic regression model, the integrated strategyis able to control type one error, and guarantees power basically.In Section4, we firstly did an exhaustive search of gene-gene interaction andgene-smoking interaction, as well as biological pathways in GWAS of lung cancer inChinese Han populations.(1) Gene-gene interaction analysis. We adopted a three-stage designedcase-control study. The first one is the discovery stage in GWAS. The second and thethird ones are the replication stages. Totally,13,392subjects (6,377cases and7,015controls) were collected with591,370genotyped SNPs. Four pairs of epistatic lociwere screened out using KIL strategy. Among them, only rs2562796-rs16832404wassuccessfully validated in two independent replication stages. In the discovery stage,the interaction OR=2.58,95%CI=2.24-2.97, P=1.37×10-39. In the replication1,the interaction OR=1.17,95%CI=0.99-1.38, P=6.37×10-2. In the replication2,the OR=1.21,95%CI=1.06-1.38, P=4.61×10-3. In the combined dataset of threestages, the interaction OR=1.33,95%CI=1.23-1.43, P=1.03×10-13. We also didstratification analysis according to age, gender, smoking, et al. The indentifiedepistatic loci is still significant in sub-populations. Additionally, we observed clusterof interaction signals in genotype imputation analysis. (2) Gene-environment interaction analysis. We adopted a two-stage designedcase-control study. The populations are the same as that of the first two stagesmentioned before. Totally, we used8,440subjects (3,865cases and4,575controls).Six SNPs have potential interaction with smoking in the GWAS discovery stage. Onlytwo SNPs (rs1316298and rs4589502) were successfully validated in the replicationstage. In the discovery stage, the interaction P values for rs1316298and rs4589502are4.15×10-5and2.61×10-5respectively. In the replication stage, the interaction Pvalues are8.87×10-4and4.40×10-2respectively. SNP rs1316298has antagonisticinteraction with smoking, whereas rs4589502has synergetic interaction with smoking.The interaction P values are6.73×10-6and3.84×10-6respectively in combineddataset of two stages. In genotype imputation analysis, we also observed a cluster ofSNPs in high or low LD with these indentified SNPs, which contribute to lung cancerrisk with smoking interactively.(3) Biological pathway analysis. We did epistasis analysis based on biologicalpathway information. The GWAS of lung cancer is composed of two independentstudies: the Nanjing study and the Beijing study. We did an exhaustive search forpathways based on KEGG (Kyoto Encyclopedia of Genes and Genomes) andBioCarta database in the Nanjing study. The significant pathways were thenreplicated in the Beijing study. As a result, four pathways (achPathway, At1rPathway,metPathway and rac1Pathway) were successfully validated with P values0.012,0.022,0.010and0.005respectively in the combined data of two studies. Sensitivityanalysis was performed using different SNP-to-gene mapping strategy or removingoverlapped genes in four pathways. The results indicated that what we found wasrobust. Then, we did exhaustive search for interactions among representative SNPs ineach pathway. We only identified one epistasis (rs17057065-rs17194885). Theinteraction P value were4.98×10-2,4.42×10-2and4.69×10-3in the Nanjing study,the Beijing study and the GWAS respectively.Simulation experiments and real data analysis provide evidence that KIL is aneffective and efficient way to detect epistasis in GWAS. Both environmental exposureand genetic mutation contribute to lung cancer risk interactively. This study is highlighted with four innovations below:(1) We did a systematic comparative analysis for ten methods to evaluate theirstatistical performance. What we found provides evidences for selection ofappropriate methods in GWAS interaction analysis.(2) We proposed a novel high-order interaction analysis method, IEE. It isrobust even with50%accuracy of iteration, and is qualified as afast-screening method in interaction analysis of high dimensional data.(3) We proposed a three-stage KIL strategy for high-order interaction analysis.It is effective in statistics and computation speed for high dimensionalreduction.(4) We first did an exhaustive search of gene-gene and gene-environmentinteraction, as well as biological pathways in GWAS of lung cancer in HanChinese population. What we found may provide novel insight into themultifactorial etiology of lung cancer.
Keywords/Search Tags:Genome-wide association study, High dimensional data, Gene-environment interaction, Gene-gene interaction, Biological pathway, Dimensionality reduction strategy, Statistical method, Data minning, Lung cancer, Entropy
PDF Full Text Request
Related items