With the continuous development of smart medical care,it has become a new research trend to predict diseases by using historical diagnostic data.In the field of medical big data,the increase of the number of data types and dimensions has provided more references for disease prediction.By using machine learning and data mining technology in analyzing the potential information in medical data,more valuable information can be extracted for disease prediction,which can not only make it possible to screen early high-risk groups,but also reduce the difficulty of disease diagnosis.However,there are still two major problems in disease prediction,namely,the combination of outlier detection and medical knowledge is not close enough,and the accuracy of prediction model is low.Aiming at the problem that outlier detection is not closely combined with medical knowledge,this thesis proposes an outlier detection method based on BIF(Box-plot-iForest,BiF).The use of median in Box-plot to represent the overall level has certain limitations.Therefore,it is necessary to adjust the boundary of Box-plot to distinguish abnormal values according to medical knowledge,so as to provide a practical basis for the judgment of abnormal values.As a result,Isolation Forest(iforest)is further used to detect outliers.After dealing with the detected abnormal values,the influence of abnormal values on the classification and prediction of gestational diabetes mellitus is reduced,so as to improve the accuracy of prediction.Aiming at the low accuracy of the prediction model,this thesis presents a prediction model based on Stacking integration method.In the process of building the prediction model,Random Forest,XGBoost,LightGBM and CatBoost are used as the base learners of the first layer of Stacking integration model.In order to avoid the problem of over-fitting training,the logistic regression model is selected as the meta learner of the second layer.The data are from Tianchi gestational diabetes mellitus data set and Pima Indian diabetes data set.The prediction effect of the model is evaluated by the values of accuracy,precision,recall and area under ROC curve(AUC).The experimental results show that the effect of stacking integrated model is better than that of random forest,XGBoost,LightGBM and CatBoost.The Stacking integrated model has higher accuracy and strong performance stability.It plays a positive role in screening early high-risk groups and reduces the difficulty of diagnosing gestational diabetes. |