Font Size: a A A

Research On Gestational Diabetes Risk Prediction And Online Calculation Based On Machine Learning Algorithm

Posted on:2021-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:H W LiuFull Text:PDF
GTID:2504306470976509Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Background Pregnant women with gestational diabetes mellitus(GDM)are susceptible to dystocia and neonatal metabolic abnormalities.Establish a predictive model to assess the risk of gestational diabetes after pregnancy in the early stages of pregnancy and lifestyle interventions can be taken in advance to reduce the risk.Therefore,the purpose of this study is to build a prediction model for gestational diabetes mellitus based on machine learning algorithms and use traditional logistic regression as a baseline model.Methods From July 1,2010 to September 30,2012,pregnant women who performed early pregnancy check-ups and established a maternal health manual in Tianjin were included in the study cohort,including 19,669 pregnant women who were 4-12 weeks pregnant.All pregnant women fill out the "Early Pregnancy Health Questionnaire" to collect basic information and anthropometric data of pregnant women.At 24-28 weeks of gestation,all pregnant women were routinely given a 50-g fasting glucose test(GCT)at the community health service center.Women with GCT positive will be informed and recommended to go to Tianjin Women and Children’s Health Care Center to undergo a standard 75g-2h oral glucose tolerance test(OGTT).At the same time,all pregnant women are required to complete the "Midterm Health Questionnaire",which includes basic information and anthropometric information.The variables used to construct the predictive model include pregnancy age,education level of pregnant women,monthly family income,family history of diabetes,gravidity,parity,waist circumference,hip circumference,ALT,pre-pregnancy BMI,fasting blood glucose,systolic blood pressure,diastolic blood pressure,and weight change.The data set was randomly divided into a training set(70%)and a test set(30%)according to the outcome(whether gestational diabetes).The training set was used to train the model,and the test set was used to test the performance of the prediction model.The prediction models used include logistic regression,lasso,random forest,XGBoost(e Xtreme Gradient Boosting),and support vector machine(SVM),where logistic regression is used as the baseline prediction model.To avoid overfitting,the training set is divided into five fold by cross-validation method.After determining the range of the model hyperparameters value,the grid search method was used to obtain the model prediction results under each combination of hyperparameters.The performance of the developed prediction model was assessed with respect to its discrimination and calibration.Because our data set is unbalanced data,the precision-recall curve is used as the main evaluation criterion,Receiver Operating Characteristic curve is used as the second evaluation criterion to evaluate the discrimination of the prediction model.Calibration was measured by the Hosmer-Lemeshow test combined with the calibration plot.Models with poor calibration are re-calibrated using isotonic regression or Platt’s method.The optimal prediction model obtained was embedded into the backend of the webpage as an API interface to construct a gestational diabetes risk prediction system to achieve realize real-time and accurate prediction of the risk of gestational diabetes in pregnant women.Results Based on the completeness of key information such as GDM and previous diabetes,the number of pregnant women included in the analysis was 19,331,of which 1484(7.6%)were women with gestational diabetes.The prediction results show that the XGBoost model has the best prediction performance,and AUPRC(Area Under the precision-recall curve)is 0.212(95% CI,0.201-0.223),which is5.1% higher than the baseline logistic regression model and 4.9% higher than lasso.3.9% improvement in the random forest,2.8% improvement over support vector machine;AUROC(Area Under the Receiver Operating Characteristic curve)is 0.739(95% CI,0.712-0.766),which is 5.4% improvement over baseline logistic regression and relative to lasso Improved by 4.8%,5.2% compared to the random forest,and1.1% compared to support vector machine;and after recalibration by Platt’s method,the XGBoost model has better calibration(Hosmer-Lemeshow test P = 0.313).According to the variable importance result from the developed XGBoost model,waist circumference,pre-pregnancy fasting blood glucose,pre-pregnancy BMI,and ALT are the most important for predicting the risk of gestational diabetes.The online version of the gestational diabetes risk prediction system was established through the shiny package in R language.Users can enter the web page(https://liuhongwei.shinyapps.io/GDM_RISK_SCORE)to calculate and obtain the risk of gestational diabetes mellitus results online.Conclusion Compared with traditional logistic models and other common machine learning models,the XGBoost model established in this study has better prediction performance.For unbalanced samples,using Precision-Recall curve as the main evaluation index can more accurately evaluate the performance of the prediction model,and the established prediction model is applied to the actual situation through the web page and enhance the value of forecasting model.Moreover,this prediction system met the need of public health and has great application value.
Keywords/Search Tags:Gestational diabetes mellitus, Machine learning, Prognostic prediction model, XGBoost, Precision-Recall curve
PDF Full Text Request
Related items