Objective: This study applied the 2015 China Health and Retirement Longitudinal Study database to compare the differences between simple random forest,random forest combined with logistic regression model(joint model)and simple logistic regression model in variable screening.Then explore the factors associated with metabolic syndrome based on the best model to provide a scientific basis for the prevention of metabolic syndrome.Methods:1.This study used data from the 2015 China Health and Retirement Longitudinal Study,which was conducted among people aged 45 years and older in China.The subjects were divided into the metabolic syndrome group and the group without metabolic syndrome according to the unified criteria proposed in the provisional joint statement of the International Diabetes Federation,the National Heart,Lung,and Blood Institute,the American Heart Association,the World Heart Federation,the International Atherosclerosis Society,and the International Society for the Study of Obesity in 2009.2.After merging the data files,13420 observation units and 4033 variables remained after excluding undiagnosed observation units and variables associated with metabolic syndrome;12,330 observation units and343 variables remained after deleting variables and observation units with missing values of 10% or more;The variables with correlation coefficients>0.8 were excluded,leaving 12330 observation units and 304 variables.3.Simple random forest,joint model and simple Logistic regression model were constructed for the collated data.The joint model first was used random forest to obtain the importance scores of each variable,sorted the variables in the order of importance scores from largest to smallest,and then combined random forest and forward variable selection method to screen the best combination of variables;the screened variables were further constructed into Logistic regression models.The differences in the screening variables and model evaluation indexes between pure random forest,jointed model and pure logistic regression model were compared.Results:1.This study ended up with 12330 study subjects,including 5792 males and 6538 females,with a prevalence of metabolic syndrome of 38.79%.The prevalence rate was 29.40% in males;47.11% in females;46.03% in urban and 34.41% in rural areas.2.The random forest was constructed by combining the forward variable selection method,and 41 variables were selected as the best combination of variables after considering the model evaluation indexes such as accuracy,Jorden index and variable categories.3.Comparison of model evaluation indexes of the three models: the best evaluation indexes such as accuracy and Jorden index of the simple random forest,followed by the joint model,and finally the simple Logistic regression model,and the BIC of the joint model is better than that of the simple Logistic regression model.Comparison of variable screening in the three models: 41 variables were screened in the simple random forest,including 11 blood test indicators,9 physical test indicators,15 economic factor variables,4 "noise variables",and 2 other variables;31 variables were screened in the simple Logistic regression model,including 9 blood test indicators,4 physical test indicators,and 6 economic factor variables.The simple Logistic regression model screened out 31 variables,including 9 blood test indicators,4 physical test indicators,6 economic factor variables,3 "noise variables",and 9 other variables;the joint model screened out 20 variables,including 9 blood test indicators,4 physical test indicators,4 economic factor variables,and 4 other variables,and the joint model removed the "noise variables " were eliminated from the joint model.Compared with the simple random forest,9 variables not screened out by the joint model were not related to metabolic syndrome and 8variables were related to metabolic syndrome;compared with the simple Logistic regression model,7 variables not screened out by the joint model were not related to metabolic syndrome and 3 variables were related to metabolic syndrome.4.Results of the joint model factor analysis: OR for blood tests were1.020,1.738,1.294,1.071,2.589,1.062,0.991,0.977 and 0.710 for C-reactive protein,glycosylated hemoglobin,uric acid,leukocytes,serum cystatin C,hemoglobin,HDL cholesterol,blood urea nitrogen and creatinine,respectively.The OR of pulse,upper arm length,height and weight were 1.014,0.961,0.986 and 1.111,respectively,and the OR of economic factors were less than 1for the past 1 week of food and drink(excluding eating out and buying cigarettes and alcohol)and the past 1 month of utilities;the OR of personal income and the past 1 month of postage and electricity were greater than 1,the OR of age,gender and work intensity were 6.266,1.026 and 1.145.Conclusions:1.When analyzing large samples with too many variables,the random forest model could be used to filter the variables first,and then the Logistic regression model could be used to analyze the filtered variables,which reduces the number of noisy variables and increases the accuracy and interpretability of the model.2.The risk of metabolic syndrome was higher in women than in men;age,personal income and work intensity were associated with metabolic syndrome.C-reactive protein,leukocytes,uric acid,serum cystatin C,hemoglobin and glycosylated hemoglobin were positively correlated with metabolic syndrome.Creatinine,high density lipoprotein cholesterol and blood urea nitrogen were negatively correlated with metabolic syndrome.Upper arm length,height,weight,and pulse were also associated with metabolic syndrome. |