Imbalanced Data Classification Based On Active Learning

Posted on:2013-10-12

Degree:Master

Type:Thesis

Country:China

Candidate:Z R Li

Full Text:PDF

GTID:2268330395479886

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

People have accumulated a lot of data from different areas for a long time. Moreover, with the rapid developing of computer technology in recent years, collecting and storing data have become more and more simple and quick, which result in amount of data accumulated. How to receive useful information from these massive data becomes an emergency problem. Data mining, as the data analysis technique to obtain useful information from massive data, comes into being. This technology has improved the utilization of a large number of idle data effectively, and is highly instructive to the future development.Classification task, which determines the object sample belong to predefined target class, is the most common and important technology in data mining. So far, the development of this technology has entered a more mature stage. Most of traditional classification methods are based on balanced data, which data category and distribution are roughly balanced, and the misclassification cost is roughly same. However, the most data in reality are imbalanced, such as credit card fraud detection, medical diagnosis, information retrieval and text classification dataset, and so on. The sample number of one class may be far more than the other classes. In these cases, the classifier will usually tend to classify test samples to the large number class and ignore the small number class, which will lead to the effect of training classifier become very poor.The characteristics of imbalanced datasets and the limitations of the traditional classification algorithm are the key to gain the accurate and reliable classification to imbalanced data set. Therefore, the classification of imbalanced data sets has become the hot research in the fields of machine learning and pattern recognition.Based on the importance of imbalanced data classification, this paper proposes two solutions:1) Imbalanced dataset classification based on active learning SMOTE:Synthetic Minority Over-sampling Technique (SMOTE) is a typical over-sampling data preprocessing method which can effectively balance the imbalanced data. However, it also will bring noise and other problems affecting the classification accuracy. To solve this problem, this paper presents an approach based on active learning SMOTE to classify the imbalanced data, called ALSMOTE. This method combines the active learning strategies based on distance and support vector machine to improve the limitations of SMOTE. Experimental results show that the proposed method can effectively improve the classification accuracy of imbalanced data. 2) Imbalanced dataset active learning algorithm based on Boosting:At present, one of the popular methods to process imbalanced dataset classification is resampling. The main ways of resampling include over-sampling and under-sampling. However, over-sampling and under-sampling both have their own shortages. This paper proposes a split-boost active learning algorithm, called SBAL. The proposed algorithm splits the majority class dataset into subsets according to the proportion of imbalance samples, combines with minority class dataset, and trains the classifiers by AdaBoost algorithm, then boosts a total classifier. SBAL algorithm selects the effective training samples to join the last training based on QBC Active Learning algorithm, so it avoids the shortages of the over-sampling and under-sample fundamentally. Experiments show that the proposed algorithm gain higher classification accuracy to imbalanced datasets.

Keywords/Search Tags:

Imbalanced dataset, Active learning, Classification, Ensemble learning

PDF Full Text Request

Related items

1	Research On Imbalanced Dataset Classification Based On Ensemble Learning
2	Research On Imbalanced Dataset Classification Algorithm Based On Ensemble Learning
3	Imbalanced Data Classification And Its Application In The Prediction Of The Mobile Phone Replacement
4	Research On Imbalanced Classification And Multimodal Classification In Broad Learning System
5	Application Research Of Used-car Recommendation Based On Classification Method On Imbalanced Data Sets
6	Research And Application Of Imbalanced Data Classification Algorithm Based On Ensemble Learning
7	Hybrid Ensemble Learning For Imbalanced Data
8	Research On Ensemble Learning Approaches To Imbalanced Data Sets
9	Fault Classification Based On Modified Active Learning And Semi-Supervised Learning
10	Research On Imbalanced Data Classification Methods Based On Ensemble Learning