Pruning And Grid Sampling Combination Of Imbalanced Data Sets Classification Method

Posted on:2013-03-08

Degree:Master

Type:Thesis

Country:China

Candidate:J Zhang

Full Text:PDF

GTID:2230330371999688

Subject:Computational Mathematics

Abstract/Summary:

PDF Full Text Request

Imbalanced data sets classification problem is common problems in the field of pattern recognition, machine learning and data mining as well as a hot issue. Imbalanced data set is a data set of categories because of the presence of skew, namely a kind of category samples more than other categories of sample. The traditional classifiers in order to pursue a high rate of accuracy focus on classification accuracy of the majority class samples of Imbalanced data sets, on the other hand the minority class samples of imbalanced data sets should be considered because of the cost of classification and its true information.. Therefore, research of Imbalanced data processing problem is very important.At present, domestic and foreign scholars have obtained some achievements in data preprocessing and algorithms of two level about imbalanced data sets classification problem. Scholars are trying to improve the traditional algorithms and improve the classification performance in Imbalanced data set on the algorithm level. In the data pretreatment level, scholars generally remove the negative samples of noise data and separate from the classification of surface data in under-sampling, otherwise they add noise data to over-sampling data in order to balance. In a word, many methods are different on data reduction or data additionIn this paper, new sampling methods about imbalanced data sets classification are considered in order to prevent the important data loss from general under-sampling method on the basis of previous studies. Grid sampling method by pruning puts forward, namely the majority class of the samples will be divided into absolute safety data, data of edge and noise data before grid sampling basing on adaptive boosting method to carry out sampling data learning. The artificial data and typical UCI data sets are verified with ROC curve as the evaluation criterion by test. In light of test conclusion, the AUC value is greater than the other types of algorithms, which shows the model has good performance. The other new method, namely the mixed sampling on the Random-SMOTE method, put forward and is valid by test.

Keywords/Search Tags:

Imbalanced data sets, Pruning, Grid sampling, AdaBoost, ROC curve

PDF Full Text Request

Related items

1	Research Of Machine Learning On Imbalanced Data Sets And Its Application In Geosciences Data Processing
2	Research On Classification Algorithm Of Typical Imbalanced Data Sets
3	Research On Classification Algorithm Of Meteorological Imbalanced Data
4	Research On An MCP-based Ensemble Pruning Technique Of Adaboost
5	Research On Credit Scoring Model Based On Imbalanced Data Sampling And Convolutional Neural Network
6	Decision Tree On Imbalanced Data Sets
7	Study On The Grid Of Housing Data Based On The Classification Of Earthquake Damage
8	The Study And Application Of The Simulation Of Huge Data Samples Based On The Approximation Theory
9	Automatic Categorization Of Bioscience Literature Based On Imbalanced Data
10	Study On Mineralization Forecast Of Xinjiang Hongyuntanchilongfeng Iron Mine Belt Based On Improved AdaBoost Algorithm