Font Size: a A A

Prediction Of Disease Indices Based On Ensemble Learning

Posted on:2022-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:R LiFull Text:PDF
GTID:2544306323969859Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the development of the times,people’s awareness of their health has been greatly aroused and there is more ardent expectation for the cure or prevention of disease.Heart cerebrovascular disease,a common chronic disease,poses a major threat to people’s health and life.If the information closely related to the disease can be mined from existing medical data,it will greatly help the prevention of the disease.In this thesis,the prediction of five disease indicators is carried out based on medical data.The research process can be divided into four stages.At the data processing and feature engineering stage,mixed numeral and text data are first divided according to the discrimination ratio and then converted to standard numerical data according to regular-expression-matching rules and text-and-numerical-mapping rules.Based on the text feature information,feature extraction and feature crosses construction are finally realized.Feature selection is completed by calculating the outof-bag error.At the second stage,a random forest model and LightGBM model are established,whose parameters are optimized by GridSearchCV and Scikit-Optimize.From the perspective of training efficiency and prediction accuracy,LightGBM performs better than the random forest model.The rationality of the model can be verified by analyzing the actual relationship between important features in the model and disease indicators.At the third stage,the loss function value of the models under the five test set division rations is compared and the result shows that LightGBM model’s prediction is more accurate.The XGBoost model’s performance is weaker than LightGBM,but far better than that of the random forest model.At the fourth stage,Focal Loss function of binary classification task is improved,which is used to realize the sample equalization of multi-classification.Finally,the performance of the model is further improved by stacking.
Keywords/Search Tags:Regular Expression, Random Forest, GridSearchCV, LightGBM
PDF Full Text Request
Related items