Diabetes is a chronic disease that seriously affects people’s health.According to the statistics by World Health Organization,the number of adults worldwide with diabetes has exceeded 400 million,which means that on average nearly 1 in 11 adults will have diabetes.The number of people worldwide with diabetes has kept rising in the past few years.Gestational Diabetes Mellitus(GDM)is a common type of diabetes.It is an abnormal glucose metabolism initially found in pregnancy,and its prevalence rate is increasing year by year,bringing great harm to the health of mother and fetus.Therefore,prevention and prediction of GDM are increasingly important.Nowadays,the research and application of machine learning algorithm and data mining technology in medical health field are more and more extensive,and at the same time,more and more in-depth.Nowadays some researchers have used machine learning algorithms and data mining technology to predict the risk of some diseases.However,in the application of medical data modeling for disease prediction and etiology analysis,problems such as too many data outliers,too many missing values,insufficient sample size,and imbalance of positive and negative samples are often encountered.Therefore,the accuracy of the former predictive diagnostic models in dealing with such data cannot meet the actual needs.Ensemble learning is an algorithm that is good at processing real data of enterprises.By combining basic models,weak learners are integrated into strong learners,which has stronger generalization ability than a single algorithm.In this paper,1200 pieces of real medical data containing physical examination indicators,physiological information and genetic information were used to establish a prediction model for GDM based on a variety of integrated learning algorithms.With the main goal of improving the prediction ability of the model,the main research work is as follows:First thing is about data preprocessing and feature selection.Firstly,the missing values of features are filled with null value,and the continuous features are divided equidistantly to calculate the information value of each feature,which represents the importance of a single feature.The importance of features is sorted from the largest to the smallest.The optimal feature subset is further determined by combining with the forward search algorithm.25 features are finally selected for the prediction model:TG,BMI,VAR00007,AST,SNP37,SNP20,age,SNP11,SNP46,SNP53,SNP31,SNP43,pregnant times,SNP40,systolic pressure,SNP3,hsCRP,SNP6,wbc,diastolic pressure,SNP5,SNP35,SNP52,SNP34 and Cr.Feature selection reduces the interference of useless and insignificant features on the model,reduces the risk of overfitting,and improves the generalization performance of the model.Second job is to improve the generalization ability of model by Averaging and Stacking Algorithms.Logistic regression model,decision tree model,random forest model,AdaBoost model and GBDT model were separately established,random search and grid search were used to find the bests super-parameters for the models.Then the models with best super-parameters were combined by simple average and stacking algorithms.By final comparison and analysis the results are as follows:ensemble learning algorithms are better at handling poor quality data with high missing values or abnormal values,random forest and GBDT model performs best in accuracy,GBDT performs best in AUC and F1-score.Simple averaging slightly improves the performance of prediction,Stacking algorithm is more effective to improve the generalization of the prediction model.Evaluated by AUC,F1-score and accuracy,the Stacking algorithms performs best in all the other models.The study in this paper can play a certain auxiliary support for doctors’ diagnosis decision,reduce the misdiagnosis probability of doctors in the diagnosis of GDM,and enrich the application of integrated learning in the medical field. |