Font Size: a A A

The Application Of Random Survival Forest In High Dimensional Genomic Data Of Cancer

Posted on:2019-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y C FengFull Text:PDF
GTID:2394330566479393Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective:Random Survival Forest?RSF?is a kind of machine learning which is added survival anaysis based on Rrandom Forest.This study applies the method of RSF to analyse gene expression data about lymph node metas-tasis in breast cancer patients to evaluate the application effect of the RSF in high dimensional genomic data of cancer.Method:1.The data comes from the study of Van't Veer et al[1]in Netherlands,DNA microarray analysis on primary breast tumors.This study select 78 pati-ents,who does not make lymph node metastasis when is selected,each case has4751 genes.2.In this study,Random Survival Forest analysis,Cox regression analysis and ROC analysis were performed by using R3.4.3 software,which was impl-emented by randomForestSRC,survival,and survival ROC packages.3.The data is randomly divided into training set?2/3?and test set?1/3?.The simulated iteration of RSF models were performed at the different ntreeree values,and then selected optimal parameters.Then the RSF model was constru-cted according to this optimal parameters,and the importance of each variable was evaluated.The forward variable selection method was performed accord-ing to the variable importance rating from big to small,then the most influen-tial loci are select from 4751 genes using the RSF algorithm again.For the selected data set,the traditional Cox regression model was used to analyze the influencing factors.Finally,a cross validation method was used to draw the ROC curve of the Cox regression model with statistical significance gene loci,thereby evaluated the effect of the model by average AUC.Results:1.The optimal parameter ntreeree by RSF model is 10000.2.Twenty-five of the most influential genetic loci for breast cancer metas-tasis by RSF.3.There are 9 statistical significant genetic loci were selected by Cox re-gression model finally.The protective loci are NM015955?NM003748?Contig43983RC and AB020713;The risky loci are NM000436?NM001204?Contig55574RC?NM018964 and Contig37562RC.4.With the increase of time,AUC decreased,but with cross validation,AUC was above 0.85,indicating the model was more reliable.Conclusion:1.The more survival tree in the RSF model,the error rate is reduced and tends to be stable.We should adjust the value of ntree multiple times to find the optimal parameters.2.For the gene expression of lymph node metastasis in breast cancer pati-ents,the prediction accuracy of selected variables by random survival forest is high.And the error rate of test set is lower than that of training set,thus the good generalization ability is shown.3.The method that the Random Survival Forest model combines with Cox regression can deal with high dimensional survival data effectively.The RSF model can select the important variable set applicable to the traditional Cox regression analysis.The meaningful variables can be identified through analyz-ing the important variable set by Cox regression model,and specify the rela-tionship whether a variable is beneficial or harmful to the endpoint event.
Keywords/Search Tags:High dimension, Genome, Rrandom Forest, Random Survival Forest, Cox regression model, AUC
PDF Full Text Request
Related items