Font Size: a A A

The Application Of Random Forest And Support Vector Machine In High Dimensional Transcriptome Data Of Breast Cancer

Posted on:2020-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z W GuoFull Text:PDF
GTID:2404330590964965Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective:Random forest and support vector machine algorithms were used to process the gene expression data of breast cancer.The differentially expressed genes of tri-negative breast cancer and non-tri-negative breast cancer were screened,providing more reference targets for clinical diagnosis and the development of new drugs.Methods:1.The data were obtained from gene expression data of 989 breast cancer patients with TCGA,and 60483 genes were detected in each breast cancer patient.2.Using t-test and random forests methods for dimensionality reduction of data processing.Support vector machine,support vector machine-Recursive feature elimination,random forests were used to sequence the importance of two dimentionality reduction genes.Random forests and support vector machine classifiers were used for variable selection,and the forward variable selection method was used to gradually include variables according to their importance.Cross-validation was used to select the most accurate feature setting for the prediction of tri-negative breast cancer as the feature subset of the final screening variables.3.R3.5.1 software was adopted for data processing and analysis.The software packages adopted including the main packages of randomForest,e1071 and sigFeature as well as the basic software packages of caret.Results:1.18702 genes remained after dimensionality reduction of FDR by t-test.After random forest with ntree value was 100000,6326 genes remained after dimension reduction.2.Random forest was used to rank and set model.When the number of selected variables is 8,each evaluation index of the model achieves the optimum.3.Random forest sequencing and support vector machine modeling were used.When a variable was selected,the Yoden index and recall rate reached the maximum,indicating that ESR1 gene had a great impact on triple-negative breast cancer.When the selected variables of FDR dimensionality reduction model are 8,the overall evaluation effect of the model is the best;when the selected variables of random forest dimensionality reduction model are 5,the overall evaluation effect of the model is the best.4.After the denoising reduction of t test and random forest,the random forest was used to do gene importance ranking and got two models.6 of the first 8 genes of the two models were same,and the rest two of them were ranked in the top of the two models.5.The highest Youden index was 0.8271 for gene importance ranking model based on support vector machine recursive feature elimination method,and 0.8392 for classification model based on support vector machine w~2.Both methods was less effective than random forest method which based on Gini index descending.6.The evaluation indexes of the random forest dimension reduction results are inferior to the FDR dimension reduction results of t-test;and the recall rate of the predictive model support vector machine is much stronger than that of the random forest,while the accuracy of the random forest is stronger than that of the support vector machine,but the classification of the random forest by the support vector machine is better than that of the random forest on the whole.Conclusions:1.Dimensionality reduction of FDR based on t-test of this study is better than that of random forest according to model evaluation indexes.2.The sequencing results of gene importance based on random forest importance score are more stable and accurate than those based on SVM and SVM-RFE.3.When used in binary classification prediction model of gene expression data,each evaluation index of support vector machine is better than that of random forest.4.For the variable selection of high-dimensional gene expression data,we can consider using FDR dimensionality reduction of t-test first,then using random forest to rank the importance of variables.Finally,SVM is used to establish the prediction model.According to this method,most of the selected genes were related to the diagnosis,metastasis or poor prognosis of cancer.
Keywords/Search Tags:High dimensional transcriptome data, Random forest, Support vector machines, Recursive feature elimination, Forward variable selection
PDF Full Text Request
Related items