Font Size: a A A

Imbalanced Data Classification Based On Active Learning

Posted on:2013-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z R LiFull Text:PDF
GTID:2268330395479886Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
People have accumulated a lot of data from different areas for a long time. Moreover, with the rapid developing of computer technology in recent years, collecting and storing data have become more and more simple and quick, which result in amount of data accumulated. How to receive useful information from these massive data becomes an emergency problem. Data mining, as the data analysis technique to obtain useful information from massive data, comes into being. This technology has improved the utilization of a large number of idle data effectively, and is highly instructive to the future development.Classification task, which determines the object sample belong to predefined target class, is the most common and important technology in data mining. So far, the development of this technology has entered a more mature stage. Most of traditional classification methods are based on balanced data, which data category and distribution are roughly balanced, and the misclassification cost is roughly same. However, the most data in reality are imbalanced, such as credit card fraud detection, medical diagnosis, information retrieval and text classification dataset, and so on. The sample number of one class may be far more than the other classes. In these cases, the classifier will usually tend to classify test samples to the large number class and ignore the small number class, which will lead to the effect of training classifier become very poor.The characteristics of imbalanced datasets and the limitations of the traditional classification algorithm are the key to gain the accurate and reliable classification to imbalanced data set. Therefore, the classification of imbalanced data sets has become the hot research in the fields of machine learning and pattern recognition.Based on the importance of imbalanced data classification, this paper proposes two solutions:1) Imbalanced dataset classification based on active learning SMOTE:Synthetic Minority Over-sampling Technique (SMOTE) is a typical over-sampling data preprocessing method which can effectively balance the imbalanced data. However, it also will bring noise and other problems affecting the classification accuracy. To solve this problem, this paper presents an approach based on active learning SMOTE to classify the imbalanced data, called ALSMOTE. This method combines the active learning strategies based on distance and support vector machine to improve the limitations of SMOTE. Experimental results show that the proposed method can effectively improve the classification accuracy of imbalanced data. 2) Imbalanced dataset active learning algorithm based on Boosting:At present, one of the popular methods to process imbalanced dataset classification is resampling. The main ways of resampling include over-sampling and under-sampling. However, over-sampling and under-sampling both have their own shortages. This paper proposes a split-boost active learning algorithm, called SBAL. The proposed algorithm splits the majority class dataset into subsets according to the proportion of imbalance samples, combines with minority class dataset, and trains the classifiers by AdaBoost algorithm, then boosts a total classifier. SBAL algorithm selects the effective training samples to join the last training based on QBC Active Learning algorithm, so it avoids the shortages of the over-sampling and under-sample fundamentally. Experiments show that the proposed algorithm gain higher classification accuracy to imbalanced datasets.
Keywords/Search Tags:Imbalanced dataset, Active learning, Classification, Ensemble learning
PDF Full Text Request
Related items