Font Size: a A A

An Assessment Of Random Forest For Detecting Interactions In Case-control Data Of Lung Cancer

Posted on:2013-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:J J ZhuFull Text:PDF
GTID:2234330374492859Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of high-throughput genotyping platform, researcherscan genotype large numbers of genes and SNPs in association studies, as a result thehigh-dimensional data were yielded. Currently, statisticians established thedimensional reduction models to identify main effects and interactions. Statisticalanalysis method based on trees is one of nonparametric statistical methods, whichcarries out regression and classification analysis through iteration. These methods canreduce the number of variables that need to be retained for further study, of whichrandom forest is an excellent one.In this study, we used the principle and methods of random forest to perform twoanalyses based on simulated data sets and a real data set. We used multi-stage analysisin a high dimensional genotyping data. The main contents are as follows:(1) We generated simulated data based on the phased haplotypes of CHBsamples from the website of the International HapMap project (HapMap Data Rel27PhaseII+III, Feb09, on NCBI B36assembly, dbSNP b126). Simulations were used todemonstrate the validity of the random forest method in identifying main effects andinteractions.(2) To investigate systematically the performance of random forests as a SNPscreening procedure and an interaction predictor, we assembled a dataset of580single-nucleotide polymorphisms (SNPs) from2331lung cancer patients and3077controls. Random forest method was used to assess the accuracy in screening SNPs and predicting interactions compared to traditional logistic regression. Thehigher-order SNP-by-SNP and SNP-by-environment interactions identified by theclassification and regression tree (CART) were further analyzed using the likelihoodratio test.The main results of the study are as follows:(1) The simulation studies show that if interactions among SNPs exist, they willbe exploited within the trees, and the variable importance scores will reflect theinteraction effects. Random forests can detect main effects and interactionssimultaneously. Especially when the interacting SNPs are in the absence of maineffects, random forest method will detect interactions but traditional methodscouldn’t.(2) Real data set analysis shows: This study included580SNPs in20classicalcandidate DNA repair genes in2pathways and investigated the effects of thesevariants using the dimension reduction method of random forest. Thirty-threeimportant SNPs that had the highest importance scores and lowest classification errorrates were identified by the random forest algorithm. The univariate logisticregression based on580SNPs found less information than the two-stage methods..The result showed that the RF analysis contained all results identified by individuallogistic regression. Random forest is a useful tool in the exploration of potentialinteracting loci. Our study suggests that after identifying the top-ranked SNPs andother variables, multiple complementary analytic strategies, including logisticregression, CART can be performed to identify interactions.
Keywords/Search Tags:Case-control, Polymorphism, Interaction, Machine learning, Random forest
PDF Full Text Request
Related items