Font Size: a A A

Pruning And Grid Sampling Combination Of Imbalanced Data Sets Classification Method

Posted on:2013-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2230330371999688Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
Imbalanced data sets classification problem is common problems in the field of pattern recognition, machine learning and data mining as well as a hot issue. Imbalanced data set is a data set of categories because of the presence of skew, namely a kind of category samples more than other categories of sample. The traditional classifiers in order to pursue a high rate of accuracy focus on classification accuracy of the majority class samples of Imbalanced data sets, on the other hand the minority class samples of imbalanced data sets should be considered because of the cost of classification and its true information.. Therefore, research of Imbalanced data processing problem is very important.At present, domestic and foreign scholars have obtained some achievements in data preprocessing and algorithms of two level about imbalanced data sets classification problem. Scholars are trying to improve the traditional algorithms and improve the classification performance in Imbalanced data set on the algorithm level. In the data pretreatment level, scholars generally remove the negative samples of noise data and separate from the classification of surface data in under-sampling, otherwise they add noise data to over-sampling data in order to balance. In a word, many methods are different on data reduction or data additionIn this paper, new sampling methods about imbalanced data sets classification are considered in order to prevent the important data loss from general under-sampling method on the basis of previous studies. Grid sampling method by pruning puts forward, namely the majority class of the samples will be divided into absolute safety data, data of edge and noise data before grid sampling basing on adaptive boosting method to carry out sampling data learning. The artificial data and typical UCI data sets are verified with ROC curve as the evaluation criterion by test. In light of test conclusion, the AUC value is greater than the other types of algorithms, which shows the model has good performance. The other new method, namely the mixed sampling on the Random-SMOTE method, put forward and is valid by test.
Keywords/Search Tags:Imbalanced data sets, Pruning, Grid sampling, AdaBoost, ROC curve
PDF Full Text Request
Related items