| Classification problems are often encountered in daily decision-making.The premise of the traditional classification algorithm is assumed that the dataset is balanced or the cost of misclassification of each class is the same,but the datasets faced in reality are generally unbalanced,especially in the fields of medical diagnosis,commodity recommendation,etc.Studying the classification algorithms of unbalanced datasets is very valuable for solving practical problems.In this paper,firstly,through literature analysis,the existing solutions to the classification problem of unbalanced datasets at the dataset level and classification algorithm level are introduced in detail.Hybrid XGBoost model,a method of combining the re-sampling algorithm and the XGBoost algorithm,is proposed to deal with the binary classification problem of unbalanced datasets,and the model is applied to the prediction of user’s commodity preferences.This paper selects 31 feature variables from four aspects when constructing a prediction model of user’s product preferences,and predict whether the user will purchase the recommended product B by establishing a logistic regression model,a hybrid logistic regression model,an AdaBoost model,a random forest model,an XGBoost model,and a Hybrid XGBoost model.Using Recall,F1,AUC value and other indicators for comparative analysis,the results show that the EasyEnsemble-XGB model has the best prediction effect.Through the analysis of the fearture importance of the EasyEnsemble-XGB model,five important features are derived.This information can better portray the target user.In the practical application of the unbalanced dataset classification model,this paper proposes to adjust the threshold according to the actual business goal to output the classification label instead of using 0.5 as the classification threshold. |