Font Size: a A A

Research And Implement On Genome Wide Association Study Techniques Based On MapReduce

Posted on:2016-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:W C XiaFull Text:PDF
GTID:2370330542457249Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Human Genome Plan,biologist has made the SNP(Single Nucleotide Polymorphism)data in genome wide better and better in details,new challenge towards GWAS(genome wide association study)has becoming.The distribution of the single nucleotide polymorphism loci in the whole genome wide range from thousands to tens of thousands,the difference of these loci is the fundamental reason for the human being to show different phenotypes under the congenital condition.In order to carry out genome-wide association analysis,the SNP sites can provide strong support for the prevention and treatment of diseases and other fields.The selection of SNP loci is the same as that of the traditional feature selection,there are many problems based on the existing feature selection methods to select SNP loci in genome-wide SNP data.On the one hand,the traditional feature selection method if not using machine learning can only choose single SNP loci and SNP-SNP locus interactive features except multi-locus interactive features,but this for SNP data can't be ignored.On the other hand,although feature selection method based on machine learning can choose multi-locus interactive features,but in the face of small samples of high dimension data,often appear the phenomenon of "Curse of dimensionality".There many problems in the existing feature selection methods,combining the characteristics of genome-wide SNP data.In order to solve the problem of SNP locus selection in genome-wide association analysis presents a new framework.The framework is divided into four steps to solve the problem.The first step is to divide the high-dimension SNP sequence feature into many small regions use the method of sequence mine.The second step uses the idea of the minimum independent dominating set to select a part of the association region,which can cover the result of the first step.The third step is to reduce the association regions from the second step,under the existence of interaction between features,this paper define an approximate strongly relevant features measurement methods:consistency contribution rate based on the strongly relevant features definition,designs an algorithm to removed the non strongly correlated features from the associated region.The fourth step is to get non redundant interacting feature subset of high-order from the result of third step,existing high-order non redundant interacting feature subset selection algorithm NIFS cannot be applied directly in the class labels with data and can only deal with the feature values are binary,we will generalize and improve it In this paper.Finally,we implement the proposed solution under the MapReduce framework,use the power of cluster can reduce the running time of the algorithm obviously.At last,through the experiment analysis shows that the solutions in this paper realized the high-order non redundant interacting SNP locus selection problem on the genome wide,there good performance in disease locus selection of diabetic and population classification.
Keywords/Search Tags:Feature Interaction, Consistency Contribution Rate, NIFS, MapReduce
PDF Full Text Request
Related items