Font Size: a A A

Research On Imbalanced Data Classification Methods Based On Probabilistic Oversampling

Posted on:2020-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:K Y TianFull Text:PDF
GTID:2428330596487274Subject:computer science and Technology
Abstract/Summary:PDF Full Text Request
Imbalanced data classification is one of the important research directions in the field of data mining.In imbalanced classification problems,the minority data is the focus of our attention.However,due to the imbalance of data,it is difficult for the classifier to identify the minority data,which may easily lead to the model being under-fitting for the minority data.The processing of imbalanced classification problems can be used to synthesize new minority data by oversampling to balance the original data.Many oversampling methods generate new data by simply copying the original minority data or using the feature space similarity between samples.The newly generated data does not consider the probability distribution of the original data,and it is easy to yield false data.The method of synthesizing new minority classes by probability approximating data distribution considers the original probability distribution of data,and the newly synthesized data not only reflects the true law of the data but also has a excellent repre-sentativeness.In this paper,two models are proposed which are based on the probabilis-tic oversampling technique from the perspective of combining data level and algorithm level for dealing with imbalanced data classification problems:k-means Clustering and Majority Voting Strategy based Probabilistic Oversam-pling.Firstly,the original majority data is clustered to reduce the imbalanced rate of the data set,and each maj ority dataset of the aggregated class is merged with the origi-nal minority data to form imb alanced data sub sets.Then,the prob abilistic oversampling method is used to approximate the probability distribution of the minority data of the data subsets,and the new minority datas are resampled from the approximating data distribution.Finally,balanced sub data sets are obtained.The model is built on the sub data sets and decision matrix is acquired.Data class labels are obtained through the majority voting strategy.Experiments are carried out with C4.5 and Bayes classi-fiers,and verified on 15 KEEL imbalanced data sets.The proposed method is compared with SMOTE,SMOTEBoost,RUSBoost and RACOG.Results show that the proposed method has obtained the best average classification performance on the evaluation indi-cators Sensitivity,G-mean and AUC.Filtering-based probabilistic oversampling method.The non-cooperative game theory based filtering method is used to identify the most likely class labels of the minority data of the probabilistic oversampling synthesis,and the non-class synthetic data in the minority data is filtered to obtain higher quality minority data for improving data skew and balancing the original data set.Experiments are carried out with CART and SVM classifiers.The proposed filter-based probabilistic oversampling methods PDFOS+F and RACOG+F are compared with the original probabilistic oversampling methods on 8 KEEL imbalanced data sets.Results show that the proposed methods have obtain better classification performance on the evaluation indexes F-measure,G-mean and AUC.
Keywords/Search Tags:Imbalanced data classification, Probability approximation, Oversam-pling, Clustering, Majority voting, Filtering
PDF Full Text Request
Related items