Research On Methods Of Imbalanced Data Set Classification

Posted on:2021-03-02

Degree:Master

Type:Thesis

Country:China

Candidate:L Wang

Full Text:PDF

GTID:2518306047988129

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

The problem of imbalanced data set classification exists widely in real life.Because the minority class samples contain important information and are difficult to classify,the classification of imbalanced data sets has attracted more and more attention from scholars.Up to now,the research on classification of imbalanced data sets has mainly focused on the data level and the algorithm level.This paper mainly improves the Synthetic Minority Oversampling Technique(SMOTE)at the data level and the Cost-sensitive Support Vector Machine(DEC)at the algorithm level,the specific work can be summarized as follows:Firstly,this paper explains the reasons for the difficulty in classifying the minority class samples and introduces the methods for dealing with the problem of imbalanced data set classification from multiple aspects.It focuses on the principles of the SMOTE algorithm and the classical classification algorithms such as Support Vector Machines(SVM),and points out their limitations in dealing with the classification of imbalanced data set.Secondly,this paper proposes a new over-sampling algorithm(DB-MCSMOTE)which is a hybrid of density clustering algorithm(DBSCAN)and improved SMOTE,it can overcome the shortcomings of the SMOTE algorithm,such as synthesizing highly similar samples,ignoring within-class imbalance and extending the classification region of minority class.DB-MCSMOTE algorithm uses DBSCAN clustering algorithm to cluster minority class samples and filter noise samples.According to the definition of cluster density distribution function and oversampling weight proposed in this paper,different numbers of minority samples are generated for clusters of different densities,so as to reduce the imbalance between and within classes of the data set.In the oversampling phase,the DB-MCSMOTE algorithm uses the Midpoint Centroid Synthetic Minority Over-sampling Technique(MCSMOTE)to oversample on the lines of the location-distant minority class samples in each cluster,which improves the diversity of the synthetic samples and restrains the synthetic minority class samples from intruding into the area of the majority class sample.The experimental results on the synthetic data set and the actual UCI data sets verify the effectiveness of the DB-MCSMOTE algorithm.Finally,this paper improves the DEC algorithm.To overcome the shortcomings that the DEC algorithm is susceptible to noise samples when determining the hyperplane of classification and extremely sensitive to the distribution of minority class samples,a Cost-sensitive Support Vector Machine(DB-DPCSVM)based on oversampling technique and different penalty terms is proposed.DB-DPCSVM algorithm redefines the penalty factor of DEC algorithm.According to the concept of local density and relative density ratio proposed in this paper,the normal instance and the noise instance are given different penalty factors,so that the model ignores the misclassification cost of noise samples,and the location of the classification hyperplane is modified.At the same time,before the training phase,the DBDPCSVM algorithm selects the DB-MCSMOTE algorithm to synthesize the new minority samples,it can overcome the defect that the DEC algorithm causes large fluctuations of the classification hyperplane shape due to the sparse distribution of the minority class samples.The experimental results on the actual UCI data sets verify the effectiveness of the DBDPCSVM algorithm.

Keywords/Search Tags:

Imbalanced data, Classification, Oversampling, Support vector machines, Costsensitive

PDF Full Text Request

Related items

1	Research On Support Vector Machine Classification Method For Imbalanced Datasets
2	Research On Classification Methods For Large-scale Imbalanced Data
3	Research On Classification Algorithm For Imbalanced Data Sets Based On Support Vector Machines
4	An Improved Classification Algorithm Of SVM For Learning Unbalanced Datasets
5	Research On Outlier Detection Based On Support Vector Machines
6	Research Of Imbalanced Data Ensemble Classification Algorithm Based On Oversampling
7	Research And Application Of Imbalanced Data Classification Based On Oversampling Algorithm
8	Research And Applications On Intrusion Detection Based On Support Vector Machines For Imbalanced Datasets
9	Research On Imbalanced Data Classification Methods Based On Ensemble Learning
10	Application Research Of Used-car Recommendation Based On Classification Method On Imbalanced Data Sets