| The application of high-dimensional small sample data is quite common in many fields,such as bioinformatics.However,there is very little research on this kind of problem compared with that on the ‘big data’.Research on high-dimensional small sample data problem often requires a combination of relevant theoretical backgrounds.The thesis provides general ideas for the study of high-dimensional small sample problems by exploring the pathogenic loci of genetic diseases.The thesis is divided into three parts to explore the relationship between genetic loci and genetic diseases.In the first part,we use known disease information to explore the pathogenic loci,which is actually about feature selection.The thesis uses filter method(chi-square test and information gain),wrapped method(logistic stepwise regression)and embedded method(LASSO regression and random forest)for feature selection.The "voting" method is used to fuse several methods to produce the final feature selection results.In the second part,we try to use the selected features to build a prediction model for the prediction of disease.We first propose a numerical simulation method based on "crossover" and "mutation" to expand sample.Then we establish logistic regression model,random forest model and adaptive neural network model.The results show that the prediction performance of random forest and adaptive neural network is better.The third part is about how to determine the pathogenesis using multiple traits related to genetic diseases.This part is important in the absence of disease information or other sample tags.A common method for studying this type of problem is canonical correlation analysis.Considering that the relationship between the genetic locus and the trait may be non-linear,we introduce the KCCA algorithm for improvement.The results show that when the sample label is unknown,we can introduce other related variables and select feature by examining the degree of correlation,linear or non-linear,between each feature and these variables. |