Font Size: a A A

Application Of A New Tree-based Ensemble Learning Method In The Study Of Factors Related To Digestive Diseases In Middle-aged And Elderly People

Posted on:2022-06-03Degree:MasterType:Thesis
Country:ChinaCandidate:R ZhangFull Text:PDF
GTID:2494306554488664Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective:In this study,a new thought of Ensemble learning method based on the Tree was put forward,which was based on the principle of Boosting Tree and the thoughts of the Deep Forest,and it was applied to screening variables related to the digestive diseases through the establishment of multilayer Random Forest model “manually” in China Health and Retirement Longitudinal Study.After classifying the population,models were built to screen variables respectively,in order to use a better model to explain variables associated with digestive diseases,and to provide strategies and ideas for relevant researchs.Methods:1.In this study,China Health and Retirement Longitudinal Study in 2015 were used among Chinese people aged 45 and above.We asked the question "Have you been diagnosed with conditions listed below by a doctor?" The answer "yes" to the question " Stomach or other digestive disease e(except for tumor or cancer)" is defined as having digestive disease.2.We selected samples by the undersampling method to the crowd for a balance sampling data,and we built 500 random forest models,before each model,the training set and the test set was selected randomly.According to the predicted results were inconsistent with the actual situation,the misclassification rates of prediction was calculated.Then we selected different misclassification rate as boundary values to divid samples into two parts,computing,searched for the best classification node according to the evaluation indexs then groups can be divided into two parts,defined as "T population" and "F population" respectively.3.The random forest model was established for the two parts of the population respectively,and the evaluation indexes of the model were calculated.Compared with the model established for the undersampled data,the optimal model was selected to screen the influencing factor model for digestive diseases.4.In this study,Random Forest Classifier in Python 3.7 and GLM software package in R3.6 were used for analysis.Results:1.There are 21095 samples and 4349 variables in the database.We combined 10 individual files remaining 13420 people,4349 variables,After cleaning and sorting,12378 valid samples and 389 variables are left,including3044 samples with digestive disease,and 9334 samples did not suffer from digestive diseases.The data was not balanced,and we adopted undersampling method to sampl 3267 cases from people without digestive diseases,and all of3044 cases with digestive diseases.2.500 random forest models were established for classification,and the misclassification rates were selected as 5%,10%,15%...95% was taken as the boundary,and the samples were divided into two parts.Then,random forest models were established for the two groups respectively.After comprehensive consideration of accuracy,precision,specificity,sensitivity,Youden index and F1,the best misclassification rate with the best overall effect of each index of the results of the two models was selected as the optimal classification point,and the misclassification rate was 60%.A portion of the population with a misscore rate greater than or equal to 60% is defined as "F population",while those with a misscore rate less than 60% are defined as "T population".A total of 4176 people were classified as T population,among which 1739 were diagnosed with digestive system diseases,with a prevalence rate of41.6%.There were 2135 people in the F population,among which 1305 cases were diagnosed with digestive diseases,with a prevalence rate of 61.1%.3.Two random forest models were established respectively of T population and F population,and compared to direct model of random forest before the classification,then accuracy increased from 0.6432 to 0.9339(T)and 0.9082(F),the precision increased from 0.6589 to 0.9665(T)and 0.8898(F),the specificity increased from 0.7284 to 0.9788(T)and 0.8134(F),the sensitivity increased from 0.5534 to 0.8701(T)and 0.9692(F),the Youden index increased from 0.2818 to 0.8489(T)and 0.7826(F),and F1 increased from 0.6432 to 0.9158(T)and 0.9278(F),both of which were significantly improved.4.After building random forest models of two groups,the important variables of digestive diseases were selected according to the variable importance measures,then established a logistic model,the results show that the two parts of people’s digestive disease factors influencing the most consistent,but on the contrary,and the same variable to the dangers of T group,the greater the protective of F group,the greater the these variables including psychological factors,physical pain index and other related diseases.5.Factors that differentiated the two groups were IBM,self-reported health satisfaction,presence of kidney disease,head and neck pain,difficulty extending the arm up the shoulder,absence of a doctor telling them to be hospitalized in the past year,and self-medicating in the past month.Conclusions:1.In the face of interdisciplinary and multi-dimensional data,data cleaning and collation should be carried out first.For example,considering that the subjects under study may contain different types of people,or the pathogenic factors of diseases are complex,the population can be subdivided and then analyzed separately.2.The tree-based ensemble learning algorithm proposed in this study can be used for crowd classification in the above data by superimposed multi-layer random forest model to make the model better.3.In this paper,the population is finally divided into two categories.Factors related to digestive system diseases include psychological factors,physical pain,and other diseases.However,the same factors have different effects on diseases of different groups of people.
Keywords/Search Tags:Digestive diseases, Ensemble learning, Factors related to digestive diseases, Random forest
PDF Full Text Request
Related items