Research On Unbalanced Data Classification Based On Ensemble Learning

Posted on:2022-09-11

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Liu

Full Text:PDF

GTID:2507306509489094

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

In recent years,ensemble learning algorithms have attracted much attention in the field of machine learning because of their advantages in maximizing the learning effect.Random forest and XGBoost algorithm,as outstanding representatives of integrated learning,have good performance in many fields such as medical health,intrusion detection and speech recognition.However,when applied to imbalanced data sets,the two algorithms can not classify the positive samples with small sample size correctly,which leads to low classification accuracy and large generalization error.However,in practical applications,the identification of positive samples is the focus of data analysis,and the consequences of misclassification are far more serious than the misclassification of negative samples.Considering that the classification results of imbalanced data sets are easily affected by negative samples with a large sample size,this article will combine the ensemble learning algorithm with the optimization algorithm at the data level of unbalanced data sets to construct models with higher classification performance.This paper uses random forest and XGBoost algorithms to combine SMOTE oversapling,random undersampling and SMOTEtomek mixed sampling to construct RUS-RF,SMOTE-RF,SMOTEtomek-RF,RUS-XGBoost,SMOTE-XGBoost and SMOTEtomek-XGBoost.In the empirical analysis stage,this article chooses the Adult data set in UCI,and compares the results with the Bank Marketing data set and the Credit Card data set with a different imbalance ratio from the Adult data set.In the experiment,AUC and G-mean are selected as performance metrics,and RF and XGBoost are used as benchmark models to observe the classification performance of each model parameter after tuning.Through comparative experiments,it is found that:(1)Overall,the classification effect of the models based on the XGBoost algorithm are better than that of the models based on the random forest;(2)From the perspective of model selection,when the sample size is sufficient,the AUC and G-mean values of the RUS-XGBoost model are the highest,which is more suitable as an effective classification model for imbalanced data sets than other models;(3)From the perspective of data resampling methods,those models that use random undersampling have better classification results than those models that use SMOTE oversampling or SMOTEtomek mixed sampling.

Keywords/Search Tags:

imbalanced dataset, random forest, XGBoost

PDF Full Text Request

Related items

1	Research On Classification Of Imbalanced Datasets Based On Random Forest
2	Research On Imbalanced News Text Mining Based On Improved Random Forest
3	Research On High Dimensional Imbalanced Data Classification Based On Random Forest
4	Forecasting Loan Default Based On Random Forest Model Fusion
5	Identification And Application Of Fake Pictures
6	A Study On The Popularity Of Hotel Reservation Software Based On LDA And Random Forest
7	Application Of Machine Learning In Credit Intelligence Assessment
8	Research And Implementation Of Educational Resource Recommendation System Based On α-divergence And Improved Random Forest
9	EEG Signal Classification Based On Iterative Random Forest Algorithm
10	Random Forest Algorithm Based On Optimized Auto-encoder