Font Size: a A A

Research On Classification And Application Of Unbalanced Data Based On Resampling And Ensemble Learning

Posted on:2024-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:2568307052983449Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of computer technology has brought complex information.How to obtain practical information from it remains to be explored.Classification algorithm in machine learning plays an indispensable role.The traditional classification method assumes that the number of samples of different categories and the cost of misclassification of different categories are not very different.However,in the classification problem,data imbalance often occurs.At this time,the traditional classification method is not suitable for the classification of unbalanced data,so the classification of unbalanced data is a problem of practical significance.Both under-sampling and over-sampling have some disadvantages,for example,the former does not consider some information contained in most classes,and the latter is prone to overfitting.So researchers have come up with a new approach.The Balance Cascade method removes most of the correctly classified class samples after each training,which improves the disadvantage that under-sampling may lose potential information and improves the efficiency.However,the new training data set is generated without any processing of the minority class samples,that is,the minority class data are the same in each round,which may affect the training of the base classifier.In order to improve the above problems,improve the diversity of base classifier and reduce variance in the training process,this paper proposes an improved Balance Cascade algorithm based on Bootstrap integration of XGBoost,namely b Cascade algorithm,which adopts Bootstrap self-sampling method to sample positive samples.Sample the negative class sample data according to the steps in the Balance Cascade algorithm,so as to obtain the same number of samples,and combine the two sample sets into a training set for data balance.Using XGBoost as the trained base classifier,iterate T times and combine the trained base classifiers to obtain the final integration.If the non-equilibrium ratio is relatively large,this paper proposes the second algorithm based on the b Cascade algorithm using Borderline-SMOTE oversampling,namely bs-b C.Oversampling a few types of samples before the training of the base classifier to improve the sample diversity in the training set and the number of samples required for each training of the base classifier.Finally,the two algorithms proposed in this paper are applied to open unbalanced data sets related to medical diagnosis and bank marketing,and F1-score,G-mean and AUC are used to evaluate the performance of the algorithms,and comparison experiments are set from the improvement of sampling and the improvement of base classifier,so as to compare the classification performance of the algorithms.In addition,the two algorithms are compared with other commonly used algorithms,and the results show that the two algorithms are generally better than other algorithms under the three indexes,so as to provide assistance for medical diagnosis and bank marketing.
Keywords/Search Tags:unbalanced data, bootstrap, Borderline-SMOTE oversampling, ensemble learning
PDF Full Text Request
Related items