| The identification of susceptibility single nucleotide polymorphism(SNP) relevent to common complex diseases is one of the central goals of genome-wide association studies(GWAS), Most studies have used a single-locus statistic test to detect individual association with phenotype, however, multiple genetic factors and their interaction effects are speculated to contribute to complex diseases. Some existing method performed poorly in detecting disease loci with little marginal effect, meanwhile, these methods exhaustively search interacting variants is combinatorial in nature thus making them computationally infeasible. Random Forest(RF) is a powerful machine learning technique that have been proposed for use in discovering disease related SNP, it uses bootstrap sampling to produce many datasets, each dataset trained with decision tree such as CART. RF produces measures of variable importance that can be used to rank the genetic variants, such as gini importance measure and permutation importance measure. It have been investigated that RF variable importance measure can capture marginal effects rather than interaction effects in high dimensional data.As the number of genetic variants decrese, probablility of SNPs interacting effects detection increases. Relief is another powerful filter method that can produce weights to rank the features, Relief F imporved Relief by increased the nearest neighbour to k, The success of relief F have been proved by large hypothesis marginal theory, which also inspired feature ranking methods similar to Relief F. Relief F has been designed to detect strong interactions but is sensitive to noise. We have carried out a research on Genome-Wide SNP detection, and main contributions are outlined as below:1. In order to increasing power to detect interacting loci, we propose a backward feature elimination method combines RF and Relief F. This method uses the backward feature elimination method to remove SNPs ranked bottom iteratively, In each iteration, We use the Relief F method to rank the SNPs, then a portion of last SNPs uses the Random Forest gini importance measure to rank, the last SNPs then remvoed.2. We investigated the formula of RF gini importance and Relief F weight, these terms are related with each other, their weight formulas have a common factor like the L2norm and L1 norm in elastic net in linear regression. We propose another elastic net method based on this relationship, this method also combines RF and Relief F, and uses similar backward feature elimination method to remove the last SNPs iteratively. In each iteration, we get gini importance from RF and weights from Relief F, and weight average the two weights to get a rank, then remove a portion of last SNPs.By using a wide range of simulated data and a real data AMD, we demonstrate the two method outperforms RF and Relief F significantly and is promising for practical use in detecting disease loci in common complex diseases. |