Empirical Research On Imbalanced Classification Based On Cost-Sensitive

Posted on:2020-08-15

Degree:Master

Type:Thesis

Country:China

Candidate:L Dai

Full Text:PDF

GTID:2417330578453147

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

In the classification task of data mining,data incline or data imbalance refers to'the difference in the number of samples of different categories in the training data.Traditional classification algorithms show high classification accuracy on roughly balanced data sets,but they cannot be mature and stable in unbalanced data.The reason for this is that algorithm models that minimize classification accuracy or error rate cannot reflect the special needs of a few categories of samples.Generally,when the ratio of positive and negative samples is lower than 1:3,the classifier begins to favor the judgment of negative samples.When all the samples are discriminated as negative samples,the error rate is minimized,but such a model does not have any actual significance.The classification error rate is a performance metric corresponding to the zero-one loss,including the assumption of sample distribution equilibrium.The commonly used classification loss function is used as the approximation of the zero-one loss.As the loss function is optimized,the error rate is also reduced.From the perspective of algorithm training and model evaluation,formal symmetric loss functions and performance metrics lack the inclusiveness of unbalanced data.In this paper,the loss function correction and performance metric selection are combined.Based on the loan default data of positive and negative samples close to 1:13,the Xgboost framework is used to study the influence of different processing methods on the classification effect.Firstly,the exploratory analysis of target variables is carried out,and the potential strong correlation features are mined.The missing values are analyzed and processed from the two dimensions of sample and feature,and the outliers are analyzed.Then the data cleaning and uniform format is adopted,using different coding methods suitable for the tree model.At the same time,using correlation power plot to eliminate multi-collinearity between features.Then used some feature engineering methods such as discretization binning of numerical features,class feature vectorization,and bad rate derivative.Selecting features by the tree model node splitting gain.ROC-AUC was used as the main performance metric.Depending on cross-validate and parameter tuning to obtain the baseline model based on cross entropy loss.By using the cross-entropy loss and the Focal Loss instead of the cross entropy loss,the newly introduced hyperparameters are adjusted,and the model based on the loss function correction is obtained.The classification effect is better than the baseline model on the test set.At the same time,the comparison experiment was carried out.Using the idea of online difficult example mining(OHEM),the loss value was used as the sampling standard,and the heuristic oversampling and undersampling were performed respectively.The improved loss function was better than baseline on the empirical data.And the reason that sampling methods were worse than baseline were analyzed.

Keywords/Search Tags:

data incline, imbalanced classification, cost-sensitive, loss function, difficulty of classification, Xgboost

PDF Full Text Request

Related items

1	Research On Ensemble Learning Algorithm Of Classification Based On Cost-sensitive
2	A New Classification Model For Imbalanced Classification
3	Research On High Dimensional Imbalanced Data Classification In The Identification Of Risk User
4	Research On Unbalanced Data Classification Based On Ensemble Learning
5	A Study On Classification Of Imbalanced Data And Evaluation Metrics
6	Research On Classification Method Of Poor Students Based On Cost-sensitive
7	Research On Negative Comments Of Online Courses For Unbalanced Data
8	Research On Classification Based On Unbalanced Data Sets
9	Research On Multi-classification Ensemble Algorithm Based On Stochastic Configuration Network
10	Research On Classification Of Imbalanced Datasets Based On Random Forest