Objective:Hepatic encephalopathy is an extremely serious complication of cirrhosis and is the most common cause of death in various liver diseases,with complex and diverse clinical manifestations and a poor prognosis and low cure rate,which is one of the main reasons for the low survival rate of patients with cirrhosis.Therefore,it is clinically important to construct a disease risk prediction model for patients with cirrhosis whose disease is complicated by hepatic encephalopathy.Due to the low incidence of cirrhosis complicated by hepatic encephalopathy,the clinical data is characterised by inter-class imbalance,and traditional machine learning and statistical models have poor performance in identifying positive minority class samples.The aim of this paper is to construct a risk prediction model for cirrhosis complicating hepatic encephalopathy based on resampling and integration learning methods,and to explore the effectiveness of resampling techniques combined with heterogeneous integration algorithms for modelling inter-class imbalance data,in order to provide a basis for prevention and early identification of hepatic encephalopathy in patients with cirrhosis.Methods:Data were collected from patients with cirrhosis with complete medical records in the Department of Gastroenterology at the First Affiliated Hospital of Shanxi Medical University during the time period from January 2006 to December 2015.(1)Firstly,multi-factor logistic regression analysis,SVM-RFE and Elastic Net methods were used to initially screen the characteristic variables and select significant correlates of liver cirrhosis complicating hepatic encephalopathy.(2)For the feature set after initial screening,various resampling techniques such as SMOTE,Borderline-SMOTE,SVM-SMOTE and SMOTE-Tomeklinks were used to deal with the category imbalance problem,and homogeneous integrated classification algorithms such as random forest,gradient boosting tree and extreme gradient boosting tree were combined to construct a classification model for cirrhosis complicated by hepatic encephalopathy.Evaluating the performance of homogeneous integrated classification algorithms in terms of accuracy,precision,recall,F1 score and area under the ROC curve,and comparing them with support vector machines,logistic regression and plain Bayes of single classification algorithms to select the three models with the best overall performance.(3)Finally,the Stacking heterogeneous integrated classification algorithm model was constructed using such models as base classifiers and logistic regression and multilayer perceptron MLP as meta-classifiers.Performance evaluation metrics were used as before to build an optimal model for risk prediction of cirrhosis complicated by hepatic encephalopathy.Results:1.Sixty-eight of the 950 patients with cirrhosis were complicated by hepatic encephalopathy,with a mean incidence of 7.16%.After initial screening of 24 variables by logistic regression analysis,SVM-RFE and Elastic Net methods,14 variables were cumulatively screened.Seven variables,hepatorenal syndrome,depression,elevated total bilirubin,prolonged prothrombin time,infection,electrolyte disturbance and hepatogenic diabetes mellitus,were common features of the three post-selection databases after primary screening of variables by logistic regression analysis,SVM-RFE and Elastic Net methods.The correlation coefficients among the feature set variables retained by all three methods were below 0.35.2.The process of base classifier selection based on resampling techniques: the feature variables retained by the SVM-RFE feature screening method are more reasonable and have better modelling performance than the Logistic Regression and Elastic Net methods.The performance of the classification model after using various resampling methods such as SMOTE,Borderline-SMOTE,SVM-SMOTE and SMOTE-Tomeklinks was overall better than that of the unbalanced data model,with the SVM-SMOTE method being the best.The models constructed by the ensemble classification algorithms RF,GBDT and XGBoost outperform the models constructed by the single classification algorithms SVM,Logistic Regression and Parsimonious Bayes.3.The Stacking heterogeneous integrated model for risk prediction of cirrhosis complicated by HE showed that the Stacking heterogeneous integrated model with RF,GBDT and XGBoost as base classifiers and MLP as meta classifier had the best performance for risk prediction of cirrhosis complicated by HE with the SVM-RFE feature variable screening and SVM-SMOTE resampling techniques,and its AUC was0.956,accuracy of 0.879,precision of 0.841,recall of 0.932 and F1 score of 0.886.Conclusion:1.Logistic regression,SVM-RFE and Elastic Net were used to initially screen 24 variables with slightly different retention variables,of which seven variables were common to hepatorenal syndrome,mental depression,total bilirubin,prothrombin time,infection,electrolyte disorders and hepatogenic diabetes mellitus;using model performance as the evaluation criterion,the characteristic variables retained by the SVM-RFE feature screening method were more reasonable and had better modelling performance than the Logistic regression and Elastic Net methods.2.Classification models using various resampling methods such as SMOTE,Borderline-SMOTE,SVM-SMOTE and SMOTE-Tomeklinks outperformed the unbalanced data model,with the SVM-SMOTE method being the best.3.The models constructed by RF,GBDT and XGBoost homogeneous integrated classification algorithms outperformed those constructed by the single classification algorithms SVM,Logistic Regression and Parsimonious Bayes,with RF performing the best.4.The risk prediction model for cirrhosis complicated by hepatic encephalopathy constructed by Stacking heterogeneous integrated model with RF,GBDT and XGBoost as base classifiers and MLP as meta-classifier performed optimally under the SVM-RFE feature variable screening and SVM-SMOTE resampling techniques. |