Analysis Of Related Factors Of Metabolic Syndrome In The Middle-aged And Elderly Aged 45 And Over Based On The Variable Selection Strategy Of Random Forest And Logistic Regression Model

Posted on:2022-07-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Yuan

Full Text:PDF

GTID:2494306554988659

Subject:Epidemiology and Health Statistics

Abstract/Summary:

PDF Full Text Request

Objective: This study applied the 2015 China Health and Retirement Longitudinal Study database to compare the differences between simple random forest,random forest combined with logistic regression model(joint model)and simple logistic regression model in variable screening.Then explore the factors associated with metabolic syndrome based on the best model to provide a scientific basis for the prevention of metabolic syndrome.Methods:1.This study used data from the 2015 China Health and Retirement Longitudinal Study,which was conducted among people aged 45 years and older in China.The subjects were divided into the metabolic syndrome group and the group without metabolic syndrome according to the unified criteria proposed in the provisional joint statement of the International Diabetes Federation,the National Heart,Lung,and Blood Institute,the American Heart Association,the World Heart Federation,the International Atherosclerosis Society,and the International Society for the Study of Obesity in 2009.2.After merging the data files,13420 observation units and 4033 variables remained after excluding undiagnosed observation units and variables associated with metabolic syndrome;12,330 observation units and343 variables remained after deleting variables and observation units with missing values of 10% or more;The variables with correlation coefficients>0.8 were excluded,leaving 12330 observation units and 304 variables.3.Simple random forest,joint model and simple Logistic regression model were constructed for the collated data.The joint model first was used random forest to obtain the importance scores of each variable,sorted the variables in the order of importance scores from largest to smallest,and then combined random forest and forward variable selection method to screen the best combination of variables;the screened variables were further constructed into Logistic regression models.The differences in the screening variables and model evaluation indexes between pure random forest,jointed model and pure logistic regression model were compared.Results:1.This study ended up with 12330 study subjects,including 5792 males and 6538 females,with a prevalence of metabolic syndrome of 38.79%.The prevalence rate was 29.40% in males;47.11% in females;46.03% in urban and 34.41% in rural areas.2.The random forest was constructed by combining the forward variable selection method,and 41 variables were selected as the best combination of variables after considering the model evaluation indexes such as accuracy,Jorden index and variable categories.3.Comparison of model evaluation indexes of the three models: the best evaluation indexes such as accuracy and Jorden index of the simple random forest,followed by the joint model,and finally the simple Logistic regression model,and the BIC of the joint model is better than that of the simple Logistic regression model.Comparison of variable screening in the three models: 41 variables were screened in the simple random forest,including 11 blood test indicators,9 physical test indicators,15 economic factor variables,4 "noise variables",and 2 other variables;31 variables were screened in the simple Logistic regression model,including 9 blood test indicators,4 physical test indicators,and 6 economic factor variables.The simple Logistic regression model screened out 31 variables,including 9 blood test indicators,4 physical test indicators,6 economic factor variables,3 "noise variables",and 9 other variables;the joint model screened out 20 variables,including 9 blood test indicators,4 physical test indicators,4 economic factor variables,and 4 other variables,and the joint model removed the "noise variables " were eliminated from the joint model.Compared with the simple random forest,9 variables not screened out by the joint model were not related to metabolic syndrome and 8variables were related to metabolic syndrome;compared with the simple Logistic regression model,7 variables not screened out by the joint model were not related to metabolic syndrome and 3 variables were related to metabolic syndrome.4.Results of the joint model factor analysis: OR for blood tests were1.020,1.738,1.294,1.071,2.589,1.062,0.991,0.977 and 0.710 for C-reactive protein,glycosylated hemoglobin,uric acid,leukocytes,serum cystatin C,hemoglobin,HDL cholesterol,blood urea nitrogen and creatinine,respectively.The OR of pulse,upper arm length,height and weight were 1.014,0.961,0.986 and 1.111,respectively,and the OR of economic factors were less than 1for the past 1 week of food and drink(excluding eating out and buying cigarettes and alcohol)and the past 1 month of utilities;the OR of personal income and the past 1 month of postage and electricity were greater than 1,the OR of age,gender and work intensity were 6.266,1.026 and 1.145.Conclusions:1.When analyzing large samples with too many variables,the random forest model could be used to filter the variables first,and then the Logistic regression model could be used to analyze the filtered variables,which reduces the number of noisy variables and increases the accuracy and interpretability of the model.2.The risk of metabolic syndrome was higher in women than in men;age,personal income and work intensity were associated with metabolic syndrome.C-reactive protein,leukocytes,uric acid,serum cystatin C,hemoglobin and glycosylated hemoglobin were positively correlated with metabolic syndrome.Creatinine,high density lipoprotein cholesterol and blood urea nitrogen were negatively correlated with metabolic syndrome.Upper arm length,height,weight,and pulse were also associated with metabolic syndrome.

Keywords/Search Tags:

random forest, joint model, variable screening, metabolic syndrome, influencing factors

PDF Full Text Request

Related items

1	Variable Selection Methods Based On Variable Importance Measurement From Random Forest And Its Application In Diagnosis Of Tumor Typing
2	Study On Syndrome Differentiation Model Of Rheumatoid Arthritis Based On Random Forest
3	Construction Of A Better Prognostic Model For Non-Metastatic Colorectal Adenocarcinoma Based On Random Survival Forest And Cox Proportional Hazard Regression
4	Selection Of Tb Susceptible Genes Based On Improved Random Forest Algorithm
5	Research On The Spatial Distribution Characteristics Of Arthritis In Middle-aged And Old People In China And The Risk Assessment Model Of Arthritis Based On Influencing Factors
6	Research On The Treatment Cost And Influencing Factors Of Infectious And Parasitic Diseases In Gansu Province From 2015 To 2020
7	Based Supervision Of Singular Value Decomposition And The Class Of Random Forest Decision-making Method, Tumor Characteristics, Genetic Screening Studies
8	Analysis Of The Prognostic Factors Of Patients With Bullous Pemphigoid Based On Random Forest Model
9	The Application Of Random Survival Forest In High Dimensional Genomic Data Of Cancer
10	Study On MDROS Infection Prediction Model Of ICU Patients Based On Random Forest