Font Size: a A A

Empirical Research On Imbalanced Classification Based On Cost-Sensitive

Posted on:2020-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:L DaiFull Text:PDF
GTID:2417330578453147Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
In the classification task of data mining,data incline or data imbalance refers to'the difference in the number of samples of different categories in the training data.Traditional classification algorithms show high classification accuracy on roughly balanced data sets,but they cannot be mature and stable in unbalanced data.The reason for this is that algorithm models that minimize classification accuracy or error rate cannot reflect the special needs of a few categories of samples.Generally,when the ratio of positive and negative samples is lower than 1:3,the classifier begins to favor the judgment of negative samples.When all the samples are discriminated as negative samples,the error rate is minimized,but such a model does not have any actual significance.The classification error rate is a performance metric corresponding to the zero-one loss,including the assumption of sample distribution equilibrium.The commonly used classification loss function is used as the approximation of the zero-one loss.As the loss function is optimized,the error rate is also reduced.From the perspective of algorithm training and model evaluation,formal symmetric loss functions and performance metrics lack the inclusiveness of unbalanced data.In this paper,the loss function correction and performance metric selection are combined.Based on the loan default data of positive and negative samples close to 1:13,the Xgboost framework is used to study the influence of different processing methods on the classification effect.Firstly,the exploratory analysis of target variables is carried out,and the potential strong correlation features are mined.The missing values are analyzed and processed from the two dimensions of sample and feature,and the outliers are analyzed.Then the data cleaning and uniform format is adopted,using different coding methods suitable for the tree model.At the same time,using correlation power plot to eliminate multi-collinearity between features.Then used some feature engineering methods such as discretization binning of numerical features,class feature vectorization,and bad rate derivative.Selecting features by the tree model node splitting gain.ROC-AUC was used as the main performance metric.Depending on cross-validate and parameter tuning to obtain the baseline model based on cross entropy loss.By using the cross-entropy loss and the Focal Loss instead of the cross entropy loss,the newly introduced hyperparameters are adjusted,and the model based on the loss function correction is obtained.The classification effect is better than the baseline model on the test set.At the same time,the comparison experiment was carried out.Using the idea of online difficult example mining(OHEM),the loss value was used as the sampling standard,and the heuristic oversampling and undersampling were performed respectively.The improved loss function was better than baseline on the empirical data.And the reason that sampling methods were worse than baseline were analyzed.
Keywords/Search Tags:data incline, imbalanced classification, cost-sensitive, loss function, difficulty of classification, Xgboost
PDF Full Text Request
Related items