| PurposeThe application of machine learning in the medical field is becoming more and more extensive.Random survival forest(RSF),as an extension of random forest in survival analysis,is one of the most representative algorithms in machine learning.This study aims to compare the performance of random survival forest and traditional Cox proportional hazard regression in constructing a prognostic model of non-metastatic colorectal adenocarcinoma,and to compare the variable selection ability of random survival forest and Lasso.Choose better methods for variable selection and construct a predictive model,and evaluate the value of its clinical application.MethodsThe data of this study are from the SEER(the surveillance,epidemiology,and end results)database of the National Cancer Institute of the United States.The data in "incidence-seerl8 custom data(with additional treatment fields),Nov 2018 sub(1975-2016 varying)" are selected.A total of 13,866 patients with first-onset colorectal adenocarcinoma with no distant metastasis who underwent surgical treatment and were pathologically diagnosed from 2010 to 2011 were included.The Kaplan-Meier method was used for univariate analysis,and eligible variables were selected for subsequent model construction.Prognostic models were constructed based on random survival forest and Cox proportional hazard regression respectively,and the out of bag error rate and integrated Brier score of the two were compared,and the better one was selected as the final modeling method.According to the variable importance ranking given by the random survival forest,the variables were gradually eliminated by the backward selection method and the Cox regression model was constructed.According to the order in which the regression coefficient of each independent variable is compressed by Lasso until it is reduced to 0 and eliminated from the model,similar models were gradually constructed again,and compared with the former,the better was selected as the variable selection method.On the basis of not significantly affecting the performance of the model,the more important variables were selected to construct the final model,and a nomogram was drawn to predict the patient’s risk of death in 1,3,and 5 years.Finally,the internal and external verification of the model was carried out,and the time-dependent receiver operating characteristic curve(tdROC)and calibration plot were used to evaluate the performance and generalization ability of the model.ResultsA total of 13866 patients were included in the study,of which 4385 died,and the median follow-up time was 70 months.In univariate analysis,all variables were included in the subsequent model construction except that gender had no significant effect on the prognosis.Our study found that in non-metastatic colorectal adenocarcinoma,the Cox proportional hazard regression model has a lower out-of-bag error rate and a smaller integrated Brier score than the random survival forest model,indicating that it has higher discrimination and accuracy,which is better Modeling method.In addition,the ranking of variable importance given by random survival forest is more concise and accurate than that obtained by Lasso compression regression coefficient.From high to low,they are age,Lymph node ratio.T stage,carcinoembryonic antigen,tumor deposit,chemotherapy situation,marital status,perineural invasion,pathological type,tumor differentiation,race,tumor size,tumor location.Under the premise that the performance of the model is not significantly affected,the random survival forest method was used to select variables.Based on the variables of age,lymph node ratio,T stage,carcinoembryonic antigen,tumor deposit,marital status,and chemotherapy situation,a Cox regression model is constructed and drawn a nomogram to visualize the model.Internal verification showed that the area under curve(AUC)of tdROC was 0.793,0.769 and 0.753 at 1,3 and 5 years respectively,and the corresponding calibration plot performed well,with Brier scores of 0.055,0.125,and 0.169 respectively.In the external verification,the AUC was 0.82,0.789 and 0.766 at 1,3 and 5 years respectively.The corresponding calibration plot also performed well,with Brier scores of 0.045,0.127,and 0.177 respectively.Combined with internal and external verification,the model’s predictive ability is stable and reliable,and it has a good generalization ability.Conclusion1.Although machine learning models represented by random survival forests are widely used in many fields and perform well,Cox proportional hazards regression is better than random survival forests in the construction of prognostic models for non-metastatic colorectal adenocarcinoma.2.The random survival forest can concisely and accurately measure the contribution of each variable to the model.Lymph node ratio,tumor deposit,marital status,and carcinoembryonic antigen are of great value to the prognosis of patients,especially the prognostic value of Lymph node ratio surpasses T staging,which deserves futher study.3.We used random survival forest for variable selection and constructed a Cox proportional hazard regression model in non-metastatic colorectal adenocarcinoma based on age,Lymph node ratio,T stage,carcinoembryonic antigen,tumor deposit,marital status and chemotherapy situation,and draw a nomogram to make it easier to use.The model performs well in internal and external data and has certain clinical application value. |