The Application Of Random Forest And Support Vector Machine In High Dimensional Transcriptome Data Of Breast Cancer

Posted on:2020-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:Z W Guo

Full Text:PDF

GTID:2404330590964965

Subject:Epidemiology and Health Statistics

Abstract/Summary:

PDF Full Text Request

Objective:Random forest and support vector machine algorithms were used to process the gene expression data of breast cancer.The differentially expressed genes of tri-negative breast cancer and non-tri-negative breast cancer were screened,providing more reference targets for clinical diagnosis and the development of new drugs.Methods:1.The data were obtained from gene expression data of 989 breast cancer patients with TCGA,and 60483 genes were detected in each breast cancer patient.2.Using t-test and random forests methods for dimensionality reduction of data processing.Support vector machine,support vector machine-Recursive feature elimination,random forests were used to sequence the importance of two dimentionality reduction genes.Random forests and support vector machine classifiers were used for variable selection,and the forward variable selection method was used to gradually include variables according to their importance.Cross-validation was used to select the most accurate feature setting for the prediction of tri-negative breast cancer as the feature subset of the final screening variables.3.R3.5.1 software was adopted for data processing and analysis.The software packages adopted including the main packages of randomForest,e1071 and sigFeature as well as the basic software packages of caret.Results:1.18702 genes remained after dimensionality reduction of FDR by t-test.After random forest with ntree value was 100000,6326 genes remained after dimension reduction.2.Random forest was used to rank and set model.When the number of selected variables is 8,each evaluation index of the model achieves the optimum.3.Random forest sequencing and support vector machine modeling were used.When a variable was selected,the Yoden index and recall rate reached the maximum,indicating that ESR1 gene had a great impact on triple-negative breast cancer.When the selected variables of FDR dimensionality reduction model are 8,the overall evaluation effect of the model is the best;when the selected variables of random forest dimensionality reduction model are 5,the overall evaluation effect of the model is the best.4.After the denoising reduction of t test and random forest,the random forest was used to do gene importance ranking and got two models.6 of the first 8 genes of the two models were same,and the rest two of them were ranked in the top of the two models.5.The highest Youden index was 0.8271 for gene importance ranking model based on support vector machine recursive feature elimination method,and 0.8392 for classification model based on support vector machine w~2.Both methods was less effective than random forest method which based on Gini index descending.6.The evaluation indexes of the random forest dimension reduction results are inferior to the FDR dimension reduction results of t-test;and the recall rate of the predictive model support vector machine is much stronger than that of the random forest,while the accuracy of the random forest is stronger than that of the support vector machine,but the classification of the random forest by the support vector machine is better than that of the random forest on the whole.Conclusions:1.Dimensionality reduction of FDR based on t-test of this study is better than that of random forest according to model evaluation indexes.2.The sequencing results of gene importance based on random forest importance score are more stable and accurate than those based on SVM and SVM-RFE.3.When used in binary classification prediction model of gene expression data,each evaluation index of support vector machine is better than that of random forest.4.For the variable selection of high-dimensional gene expression data,we can consider using FDR dimensionality reduction of t-test first,then using random forest to rank the importance of variables.Finally,SVM is used to establish the prediction model.According to this method,most of the selected genes were related to the diagnosis,metastasis or poor prognosis of cancer.

Keywords/Search Tags:

High dimensional transcriptome data, Random forest, Support vector machines, Recursive feature elimination, Forward variable selection

PDF Full Text Request

Related items

1	Gastric Cancer Characteristic Gene Selection And Survival Analysis Based On Gene Expression Data
2	Support Vector Data Description-based Feature Selection Method And Its Application
3	Selection Of Tb Susceptible Genes Based On Improved Random Forest Algorithm
4	Analysis Of Cancer Gene Data Base On Random Forest And Support Vector Machine
5	Principle Component Analysis And Recursive Feature Elimination Based Support Vector Machine Classification Methods Research
6	Study On The Calculation Method Of Individualized Medical
7	Variable Selection Methods Based On Variable Importance Measurement From Random Forest And Its Application In Diagnosis Of Tumor Typing
8	Research On Risk Prediction Of Diabetes Based On Random Forest And Support Vector
9	Statistical Learning Based On Thyroid Cancer Staging Characteristic Genes And Prognostic Genes Selection Study
10	Study On The Prediction Of Intracranial Hypertension Based On Waveform Feature Extraction And Support Vector Machines Classification