Exploration Of Pathogenic Loci Of Genetic Diseases And Research On A Kind Of High-dimensional Small Sample Problem

Posted on:2021-02-05

Degree:Master

Type:Thesis

Country:China

Candidate:B C Liu

Full Text:PDF

GTID:2514306455481864

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

The application of high-dimensional small sample data is quite common in many fields,such as bioinformatics.However,there is very little research on this kind of problem compared with that on the ‘big data’.Research on high-dimensional small sample data problem often requires a combination of relevant theoretical backgrounds.The thesis provides general ideas for the study of high-dimensional small sample problems by exploring the pathogenic loci of genetic diseases.The thesis is divided into three parts to explore the relationship between genetic loci and genetic diseases.In the first part,we use known disease information to explore the pathogenic loci,which is actually about feature selection.The thesis uses filter method(chi-square test and information gain),wrapped method(logistic stepwise regression)and embedded method(LASSO regression and random forest)for feature selection.The "voting" method is used to fuse several methods to produce the final feature selection results.In the second part,we try to use the selected features to build a prediction model for the prediction of disease.We first propose a numerical simulation method based on "crossover" and "mutation" to expand sample.Then we establish logistic regression model,random forest model and adaptive neural network model.The results show that the prediction performance of random forest and adaptive neural network is better.The third part is about how to determine the pathogenesis using multiple traits related to genetic diseases.This part is important in the absence of disease information or other sample tags.A common method for studying this type of problem is canonical correlation analysis.Considering that the relationship between the genetic locus and the trait may be non-linear,we introduce the KCCA algorithm for improvement.The results show that when the sample label is unknown,we can introduce other related variables and select feature by examining the degree of correlation,linear or non-linear,between each feature and these variables.

Keywords/Search Tags:

High-dimensional small sample, Feature selection, LASSO, Random Forest, Adaptive Neural Network, KCCA

PDF Full Text Request

Related items

1	Research On Feature Selection Method For Chinese Medicine Metabolomics Data Based On Lasso
2	Research On Machine Learning Method Of High Dimensional Small Sample (Medical) Data
3	The Application Of Random Forest And Support Vector Machine In High Dimensional Transcriptome Data Of Breast Cancer
4	Prediction Of Local Recurrence Of Head And Neck Cancer Unimodality Based On Small Sample And High-dimensional Gene Expression Data
5	The Research Of Classification Method Of Tumor Data Based On BP Neural Network
6	Selection Of Tb Susceptible Genes Based On Improved Random Forest Algorithm
7	The Application Of Random Survival Forest In High Dimensional Genomic Data Of Cancer
8	Statistical Learning Based On Thyroid Cancer Staging Characteristic Genes And Prognostic Genes Selection Study
9	Research On Swarm Intelligence Feature Selection Algorithm For Small Sample(Medical) Data
10	Application Of Multi-objective Fuzzy Clustering Method In Cancer Classification