| Objective: Applying Random Forest,Ada Boost and Light GBM algorithm to screen related variables of fracture in China Health and Nutrition Survey.Logistic regression were build using important variables screened by best model to explore related variables of fracture and its effect direction.Methods:1.This study based on China Health and Nutrition Survey in 2009,which is an international collaborative project between the Carolina Population Center at the University of North Carolina at Chapel Hill and the National Institute for Nutrition and Health at the Chinese Center for Disease Control and Prevention.Question “HISTORY OF BONE FRACTURE?” in physical examination questionnaire was deemed as dependent variable in this study,respond “yes” were deemed as bone fracture patients,respond “no” were defined as health control group.2.Total 98 variables possible related to fracture according to same field reference were retrieved from 1580 variables for further analysis,these variables came from “demography information”,“biomarker”,“individual jobs”,“nutrition”,“physical examination”,“physical activity”,“medical insurance”,“health care”,“individual education” 9 datasets.After combining datasets,variables that missing proportion larger than 20% were deleted,correlation analysis were applied to explore correlation between variables,variables that correlation coefficient larger than 0.8 were selected and retained.3.Hybrid sampling algorithm was applied to solve imbalanced data issues.Random Forest,Ada Boost and Light GBM algorithm were applied on balanced data.Evaluation index of model consist of sensitivity,specificity,area under ROC and log loss were calculated to compare fitting effect of 3models,best model was applied to screen out important factors related to bone fracture.Initially screened important variables were applied to establish the Logistic model,OR value was calculated to evaluate effect direction of important variables.4.Python 3.8 software were applied to establish Random Forest model,Ada Boost model and Light GBM models,respectively.SPSS 21.0 software was used to build Logistic regression.Result:1.There were 5998 subjects remains after combining 7 datasets,5822 people did not suffered from bone fracture,176 people once suffered from bone fracture.6000 subjects remains after hybrid sampling algorithm,among which 3042 did not occurred fracture,2958 once suffered from bone fracture.79 variables remains after data cleaning and include for further study.2.Random Forest,Ada Boost model and Light GBM model were establish based on balanced data.Best parameters of Random Forest were:n_estimators=3000,max_depth=5.Best parameters of Ada Boost model were:n_estimators=1000,learning_rate=0.10,algorithm=“SAMME.R”.Best parameters of Light GBM model were: num_boost_round=100,num_leaves=29,boosting_type= ’gbdt’.3.The sensitivity of Random Forest was 96.3%,specificity was 91.5%,area under ROC was 0.939,log loss was 0.540.The sensitivity of Ada Boost model was 90.6%,specificity was 77.6%,area under ROC was 0.916,log loss was 0.686.The sensitivity of Light GBM model was 99.7%,specificity was91.3%,area under ROC was 0.955,log loss was 0.438.Light GBM model has best fitting effect compared to other machine learning.4.Logistic regression model were build base on important variables screened out via Light GBM model to further screen and explain effect direction of variables,age and gender as known related factors of fracture were also fit into Logistic regression for adjustment.Result of Logistic regression indicate that related factors of fracture including,blood platelet every increase 1 unit OR value was 1.002(95%CI:1.001~1.003),blood glucose every increase 1 unit OR value was 1.137(95%CI:1.094~1.181),serum insulin every increase 1 unit OR value was 1.013(95%CI:1,008~1.019),APO_A every increase 1 unit OR value was 1.006(95%CI:1.004~1.008),age every increase 1 unit OR value was 1.012(95%CI:1.008~1.016),with increased these factors,disease prevalence increase,were risk factors of fracture.Serum total protein every increase 1 unit OR value was 0.972(95%CI:0.962~0.983),mean artery pressure every increase 1 unit OR value was 0.986(95%CI:0.981~0.992),Alanine transaminase(ALT)every increase1 unit OR value was 0.981(95%CI:0.976~0.986),high density lipoprotein cholesterol every increase 1 unit OR value was 0.991(95%CI:0.986~0.995),lipoprotein(a)every increase 1 unit OR value was 0.999(95%CI:0.998~0.999),serum magnesium every increase 1 unit OR value was 0.371(95%CI:0.296~0.464),serum creatinine every increase 1 unit OR value was0.983(95%CI:0.978~0.987),with increased these factors,disease prevalence decrease,were protective factors of fracture.Male compared to female OR value was 0.826(95%CI:0.711~0.959).Conclusions:1.In this study,Light GBM model has best fitting effect compared to Random Forest and Ada Boost model.Fitting effect of machine learning on other datasets need further validate.2.Risk factors of fracture include blood platelet increase,blood glucose increase,serum insulin increase,APO_A increase and age increase.Protective factors of fracture include serum total protein increase,mean artery pressure increase,ALT increase,high density lipoprotein cholesterol increase,lipoprotein(a)increase,serum magnesium increase,serum creatinine increase.Female were more susceptible to fracture compared to male. |