| Objective:Preterm birth is the major cause of neonatal and childhood disease burden and mortality worldwide.Preterm birth rates are rising globally in recent years.In addition to its significant contribution to health,the effect of preterm birth amongst some survivors may continue throughout life.At present,the mechanisms and causes of preterm birth are unknown and due to the cost of special screening lacks an effective assessment of early warning of preterm birth in pregnant women with schizophrenia,and most studies have been limited to the general pregnant population.Therefore,this study will use electronic medical record data of pregnant women with schizophrenia,combine demographic variables,lifestyle factors,underlying maternal conditions,prenatal care and hospital characteristics,and use different feature selection methods and ensemble learning algorithm to construct a risk prediction model of preterm birth.Methods:We used data from the Cerner health facts database on pregnant women with schizophrenia with medical records whose pregnancy outcomes were spontaneous preterm and term births during the time period from January 2001 to December 2016.The demographic variables,lifestyle factors,underlying maternal conditions,prenatal care and hospital characteristics were collected.Due to the category imbalance between preterm and normal birth outcomes in pregnant women with schizophrenia,we used three resampling methods including SMOTE,Borderline-SMOTE and ADASYN to deal with the category imbalance.Two feature selection algorithms,Recursive Feature Elimination-Random Forest(RFE-RF)and Boruta were used for preprocessing to eliminate redundant features.Different machine learning methods: Logistic Regression(LR),Support Vector Machine(SVM),Multilayer Perceptron(MLP),Extreme Gradient Boosting(XGBoost),Random Forest(RF)and Stacking integrated strategy were used to establish preterm risk prediction model,using fivefold cross validation.Stacking integrated strategy is stacked by multiple algorithms.In this study,SVM,MLP,XGBoost and RF were used as the primary classifiers for combination learning,and the secondary classifier used LR algorithm to build preterm risk prediction model.Accuracy,Precision,Recall,F1 score and Area Under ROC Curve(AUC)were used to compare and evaluate the prediction ability of single algorithm and Stacking ensemble learning algorithms.Results:A total of 18,277 pregnant women with schizophrenia were identified in this study,of which 2,687 were preterm born(14.7%)and 15,590 pregnant women delivered at term(85.3%).Comparing the model performance of three resampling methods SMOTE,Borderline-SMOTE and ADASYN dealing with the category imbalance,the ADASYN algorithm performed best.The results of feature selection among 44 variables: RFE-RF selected 12 variables: partner smoking,obesity,anaemia,teaching facility,maternal smoking,hypnotics,pre-pregnancy hypertension,partner drinking,antipsychotics,age,previous adverse pregnancy,previous cesarean section;Boruta selected 20 variables:partner smoking,obesity,anaemia,hypnotics,partner drinking,maternal smoking,parity,previous cesarean section,antipsychotics,teaching facility,area type,pre-pregnancy hypertension,age,pre-pregnancy diabetes,education,psychological counselling,substance abuse,maternal drinking,previous adverse pregnancy,thyroid autoimmunity.For the five single classification algorithms: LR,SVM,MLP,XGBoost and RF,the classification performance of RF was better in both RFE-RF and Boruta feature subsets,and the prediction performance of the five models in Boruta feature subset was overall better than RFE-RF.Stacking integrated strategy constructed based on two feature subsets performed better than the five single classification algorithms,and the RFE-RF feature subset based Stacking ensemble model achieved an accuracy of 86.48%,precision of 87.37%,recall of 85.69%,F1 score of 0.8650,and AUC of 0.9249;the Boruta feature subset based Stacking ensemble model had an accuracy of 95.34%,precision of 93.27%,recall of 93.77%,F1 score of 0.9352,and AUC of 0.9776.Overall,the Stacking ensemble model with SVM,MLP,XGBoost and RF as the primary classifiers and LR as the secondary classifier showed the best performance for the risk prediction of preterm birth with ADASYN resampling method and Boruta feature selection.Conclusions:In this study,SMOTE,Borderline-SMOTE and ADASYN were used to handle the classification imbalance,ADASYN performed the best and was used to balance the samples which improved the prediction performance of preterm birth risk model.RFE-RF and Boruta feature selection methods were used to screen variables which can reduce the influence of redundant features on the prediction performance of preterm birth risk model.Under the premise of ADASYN resampling and Boruta feature selection,the Stacking ensemble model with SVM,MLP,XGBoost and RF as the primary classifiers and LR as the secondary classifier had the best performance for predicting the risk of preterm birth in pregnant women with schizophrenia and showed better than any single model,which can provide early identification of high risk of preterm birth in pregnant women with schizophrenia and provide methodological reference for the prediction of obstetric outcomes such as preterm birth. |