Font Size: a A A

Study On Disease-associated SNPs Based On Genetic Model And Random Forest

Posted on:2018-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:J Z JuFull Text:PDF
GTID:2334330515472759Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
With the completion of the human genome project and the development of high-throughput sequencing technology,in particular,it can be chip-based,which makes genome-wide association studies(GWAS)possible.In the GWAS,it is a very attractive and promising problem to study the single nucleotide polymorphisms(SNPs).If disease-associated SNPs can be roughly determined before the biological experiments,which guide the experiment at the same time can also be a great cost savings.In GWAS,whole genome data has the following two important characteristics:large noise,high dimensional characteristics.Complex diseases are generally caused by the interaction of multiple SNPs,which poses a challenge to traditional statistical methods that can only study the relationship between a single SNPs and disease.The Random Forest model is known for dealing with high dimensional data and selecting important feature variables,which makes it very attractive to bioinformatics researchers.However,because the whole genome data dimension is too large,even the Random Forests model is difficult to find the SNPs from such a large amount of noise data.Based on the two basic premise that complex disease is caused by the interaction of a small number of SNPs and the whole genome data is a high-dimensional data with a large amount of noise,a method of selecting SNPs based on genetic model and Random Forests is proposed.Considering the Random Forest is difficult to select the characteristics of the whole genome data of the omnidirectional noise,the first step in this paper is to use Dominant,Recessive,Co-Dominant,Over-Dominant genetic model screening data set,the results show that most of them are independent of the disease,remove a lot of noise data.In the second step,we selected SNPs in the selected data set by Random Forest model.In this section,we compared the different performance of the Random Forest on the original data set and the removal of the noise data set.The results show that the predictive rate of the Random Forest on the post-screening dataset is increased by 30%.Then we compared the difference between the Random Forests model and SVM,GBDT,NB,KNN,and the results show that the Random Forest outperforms,but it is also a relatively simple algorithm.Other competitive advantages Random Forests has are the parallelization characteristics and the selection of important variable properties.Thus,the Random Forests model selecting disease-associated SNPs in the whole genome data is of great advantage.Finally,we designed a algorithm based Random Forest to select SNPs.In order to verify the correctness of selected SNPs based on genetic model and Random Forest method,the logistic regression model was used to validate the SNPs.Finally,Points are interacting.
Keywords/Search Tags:Disease-associated SNPs, Genetic model, Random Forest, Genome-wide association studies
PDF Full Text Request
Related items