Font Size: a A A

Classifying Recurrence Rate For Patients With DLBCL Using Imbalanced Data And Machine Learning Methods

Posted on:2021-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2404330623475904Subject:Public health
Abstract/Summary:PDF Full Text Request
Objective:Diffuse large B-cell lymphoma(DLBCL)is the most common hematopoietic malignancy.Currently,with the standard chemotherapy of R-CHOP,some patients in advanced disease stages can achieve complete remission(CR).Nevertheless,approximately 30%–40% of patients will exhibit refractory disease or relapse because of the drug resistance.Treatments are limited for patients with relapsed/refractory DLBCL,and their survival rate is low.Prediction of the recurrence hazard for each patient could provide a reference regarding chemotherapy regimens for clinicians to extend patients' period of long-term remission.As the DLBCL was affected by various factors,and there is data imbalance problem that largely affects the model accuracy,current strategies cannot satisfy such need.Thus,we set the predictive model for the DLBCL recurrence problem with high accuracy to provide a reference for the setting of clinical strategies.Methods:In the strategy selection process,we first provide 48 model setting strategies composed of 9 data imbalanced methods,2 machine learning methods,logistic regression,and 4 ensemble methods.Second,12 public databases set classifiers and probability models with the above strategies,respectively.Besides,we add Platt scaling after each probability model for calibration and list the performance of all strategies together as a reference format according to the model assessments.Third,select 5 best strategies as candidates from the format for data with the imbalance ratio between 3 and 5.In the DLBCL patients' recurrence model setting process,we first set the classifier andprobability models for recurrence with these 9 DLBCL recurrence databases and 5candidates above.Then the highest model assessment values identify the final recurrence models.Results:(1)The 5 candidates include random forest methods with RACOG sampling data and Stacking ensemble methods with unbalanced data,SMOTE sampling data,RACOG sampling data,RACOG sampling data,and cost-sensitive matrix,respectively.(2)The disease stages,HBV?Ki-67?GCB,and URI are the common variables in all recurrence models.(3)The Stacking ensemble model built with imbalanced data performs the best in DLBCL patients' 2 year(classifier Accuracy=0.9129,Sensitivity=0.9073,F score=0.9132,AUC=0.9129,G-means=0.9129.probability model AUC=0.9710,RMSE=0.2798,MXE=0.2796,Cal mean=0.0112,BS1=0.0817,BS0=0.0756,BSall=0.0783),3 year(classifier Accuracy=0.9132,Sensitivity=0.8684,Fscore=0.9086,AUC=0.9132,G-means=0.9115.probability model AUC=0.9578,RMSE=0.2651,MXE=0.2512,Cal mean=0.0227,BS1=0.0992,BS0=0.0418,BSall=0.0703)and 5 year(classifier Accuracy=0.9134,Sensitivity =0.8762,F score =0.9098,AUC=0.9134,G-means=0.9125.possibility model AUC=0.9597,RMSE=0.2627,MXE=0.2524,Cal mean=0.0234,BS1=0.0951,BS0=0.0413,BSall=0.0690)recurrence models.Conclusion:(1)The dataset after VSURF has the best performance in all databases.(2)The disease stages,HBV,Ki-67,GCB,and URI are the common variables in all recurrence models.(3)The Stacking model built with imbalanced data performs the best in all models.
Keywords/Search Tags:Diffuse large B-cell lymphoma, imbalanced data, machine learning method, probability calibration
PDF Full Text Request
Related items