Font Size: a A A

Analysis Of Machine Learning In Survival Analysis

Posted on:2023-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z G JiaoFull Text:PDF
GTID:2544307058997639Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objectives:In medical research,time-to-event data is common and survival analysis is a statistical method that combines the two factors of the follow-up outcome and follow-up time of the research subject.The traditional statistical methods mainly include log-rank test,Cox regression and parametric regression.In recent years,with the development of machine learning methods,some studies have applied machine learning methods to survival data.This study conducted the benchmark experiment based on two research questions:Question 1: How the performance of traditional statistical models(only Cox regression models)and machine learning survival models vary with different simulated survival dataset conditions,including the strength of associations between variables,censored proportion,sample size,number of variables,failure to meet proportional hazards assumption,etc.?Question 2: Do machine learning-based models provide more accurate predictions than traditional models in survival data?This benchmark experiment aims to compare the traditional Cox proportional hazards regression model(CPH)with three types of modeling methods in the field of machine learning(penalized Cox regression,random forest-based and boosting-based methods).The Cox regression model(CPH),Elastic Net Cox regression model(EN-Cox),Random Survival Forest(RSF),Gradient Boosting Machine(GBM)and extreme gradient boosting(XGBoost)will be built to answer the above two research questions.The differences of performance between machine learning and Cox regression models will be explored through data simulation and benchmark experiment methods.Material and methods:The simulation study used Weibull distribution,different censoring ratios(20%,50%,80%),different association strengths(0,0.5,0.8),different sample sizes(100,200,400),and different number of variables(8,30,50),non-proportional hazards and other conditions,a total of 89 scenarios.K-fold cross-validation was used to evaluate the performance of the models constructed by the five algorithms using Harrell’s C-index.Simulation study I and II were simulated 500 times,and simulation study III was simulated 100 times.The interactive analysis used 3 real datasets in the R packages(diabetic,mgus,ova),3datasets in the Python packages(breast_cancer,gbsg2,whas500),and 5 cancer TCGA RNAseq(HTSeq-Counts)datasets(TCGA-BRCA,TCGA-HNSC,TCGA-KIRC,TCGA-LIHC,TCGA-STAD),and then 4 machine learning-based models were compared with the Cox proportional hazards model.Nested cross-validation was also used to evaluate the 5 models’ performance based on Harrell’s C index.In addition,global interpretation and local interpretation were performed on the final model constructed using train/test set split.This study used R v4.1.0 and Python v3.8.8 for data processing,data analysis,and visual interpretation.Results:The simulation results showed that the 5 models’ performance under the Weibull distribution increased with the increase of the censoring rate.The CPH model performed best under different simulation conditions without requiring model tuning.Even in survival data that did not satisfy the proportional hazards assumption,the CPH model and the EN-Cox model outperformed the other three machine learning models when there were few variables.Besides,the other three machine learning models have merits.When the variables reached a certain number(m=30/50),survival data with strong correlation between variables was more suitable for processing by machine learning methods.The results of interactive analysis showed that: from the overall point of view,the CPH model was relatively unstable,and the EN-Cox model did not fit well in some data sets,and the C-index was 0.5.The performance of the remaining 3 machine learning models was relatively close,while GBM-Cox was more stable in some datasets.XGBoost was slightly lower than GBM-Cox in the TCGA datasets.The performance of RSF and GBM-Cox was relatively close.RSF appeared better in some datasets but was often accompanied by instability.In addition,the results of Friedman’s test of the five models with different numbers of data sets were not statistically significant,which meant that there was not enough evidence to show that the algorithms used were statistically different in prediction accuracy.At last,a comparison of the visual interpretation of the results of these 5 models was provided.The SHAP summary plots vividly show the contribution of each feature in the best model trained on the training-set to the model output on the test-set.To check the impact of related features on the risk of cancer death in cancer prognosis research is very useful.The partial-dependency plots can provide information about the distribution of features.The SHAP library also identifies nonlinear interactions and observes how salient features interact with other features,capturing and quantifying the magnitude of joint contributions,helping to complement clinical intuitions for risk stratification of patients.By comparing the SHAP values of different models with similar performance,it is possible to check whether different models have learned consistent information from the training data.Conclusions:Based on the results of the benchmark experiment(simulations and interactive analysis),the model performance of CPH could theoretically outperform the machine learning-based models.But in complex real-world scenarios,the CPH model was not always better than machine learning-based models.Furthermore,machine learning-based models without too many assumptions have great application value.This study also highlighted the need to apply interpretability methods,through which SHAP values can provide more faceted and valuable insights for survival analysis.By comparing the SHAP values of different models with similar performance,it is possible to check whether different models have learned consistent information from the training data.
Keywords/Search Tags:Survival analysis, Cox proportional hazards model, Elastic Net Cox regres sion, Random Survival Forest, Gradient Boosting Machine, XGBoost
PDF Full Text Request
Related items