Font Size: a A A

Research On Methods Of Imbalanced Data Set Classification

Posted on:2021-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2518306047988129Subject:Statistics
Abstract/Summary:PDF Full Text Request
The problem of imbalanced data set classification exists widely in real life.Because the minority class samples contain important information and are difficult to classify,the classification of imbalanced data sets has attracted more and more attention from scholars.Up to now,the research on classification of imbalanced data sets has mainly focused on the data level and the algorithm level.This paper mainly improves the Synthetic Minority Oversampling Technique(SMOTE)at the data level and the Cost-sensitive Support Vector Machine(DEC)at the algorithm level,the specific work can be summarized as follows:Firstly,this paper explains the reasons for the difficulty in classifying the minority class samples and introduces the methods for dealing with the problem of imbalanced data set classification from multiple aspects.It focuses on the principles of the SMOTE algorithm and the classical classification algorithms such as Support Vector Machines(SVM),and points out their limitations in dealing with the classification of imbalanced data set.Secondly,this paper proposes a new over-sampling algorithm(DB-MCSMOTE)which is a hybrid of density clustering algorithm(DBSCAN)and improved SMOTE,it can overcome the shortcomings of the SMOTE algorithm,such as synthesizing highly similar samples,ignoring within-class imbalance and extending the classification region of minority class.DB-MCSMOTE algorithm uses DBSCAN clustering algorithm to cluster minority class samples and filter noise samples.According to the definition of cluster density distribution function and oversampling weight proposed in this paper,different numbers of minority samples are generated for clusters of different densities,so as to reduce the imbalance between and within classes of the data set.In the oversampling phase,the DB-MCSMOTE algorithm uses the Midpoint Centroid Synthetic Minority Over-sampling Technique(MCSMOTE)to oversample on the lines of the location-distant minority class samples in each cluster,which improves the diversity of the synthetic samples and restrains the synthetic minority class samples from intruding into the area of the majority class sample.The experimental results on the synthetic data set and the actual UCI data sets verify the effectiveness of the DB-MCSMOTE algorithm.Finally,this paper improves the DEC algorithm.To overcome the shortcomings that the DEC algorithm is susceptible to noise samples when determining the hyperplane of classification and extremely sensitive to the distribution of minority class samples,a Cost-sensitive Support Vector Machine(DB-DPCSVM)based on oversampling technique and different penalty terms is proposed.DB-DPCSVM algorithm redefines the penalty factor of DEC algorithm.According to the concept of local density and relative density ratio proposed in this paper,the normal instance and the noise instance are given different penalty factors,so that the model ignores the misclassification cost of noise samples,and the location of the classification hyperplane is modified.At the same time,before the training phase,the DBDPCSVM algorithm selects the DB-MCSMOTE algorithm to synthesize the new minority samples,it can overcome the defect that the DEC algorithm causes large fluctuations of the classification hyperplane shape due to the sparse distribution of the minority class samples.The experimental results on the actual UCI data sets verify the effectiveness of the DBDPCSVM algorithm.
Keywords/Search Tags:Imbalanced data, Classification, Oversampling, Support vector machines, Costsensitive
PDF Full Text Request
Related items