| In recent years,ensemble learning algorithms have attracted much attention in the field of machine learning because of their advantages in maximizing the learning effect.Random forest and XGBoost algorithm,as outstanding representatives of integrated learning,have good performance in many fields such as medical health,intrusion detection and speech recognition.However,when applied to imbalanced data sets,the two algorithms can not classify the positive samples with small sample size correctly,which leads to low classification accuracy and large generalization error.However,in practical applications,the identification of positive samples is the focus of data analysis,and the consequences of misclassification are far more serious than the misclassification of negative samples.Considering that the classification results of imbalanced data sets are easily affected by negative samples with a large sample size,this article will combine the ensemble learning algorithm with the optimization algorithm at the data level of unbalanced data sets to construct models with higher classification performance.This paper uses random forest and XGBoost algorithms to combine SMOTE oversapling,random undersampling and SMOTEtomek mixed sampling to construct RUS-RF,SMOTE-RF,SMOTEtomek-RF,RUS-XGBoost,SMOTE-XGBoost and SMOTEtomek-XGBoost.In the empirical analysis stage,this article chooses the Adult data set in UCI,and compares the results with the Bank Marketing data set and the Credit Card data set with a different imbalance ratio from the Adult data set.In the experiment,AUC and G-mean are selected as performance metrics,and RF and XGBoost are used as benchmark models to observe the classification performance of each model parameter after tuning.Through comparative experiments,it is found that:(1)Overall,the classification effect of the models based on the XGBoost algorithm are better than that of the models based on the random forest;(2)From the perspective of model selection,when the sample size is sufficient,the AUC and G-mean values of the RUS-XGBoost model are the highest,which is more suitable as an effective classification model for imbalanced data sets than other models;(3)From the perspective of data resampling methods,those models that use random undersampling have better classification results than those models that use SMOTE oversampling or SMOTEtomek mixed sampling. |