Research On Imbalanced Dataset Classification In Semi-supervised Learning

Posted on:2016-03-22

Degree:Master

Type:Thesis

Country:China

Candidate:C Yu

Full Text:PDF

GTID:2308330461978536

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of modern science and technology, how to dig out the hidden information and useful rule from the huge data has attracted more and more attention. Classification methods are widely used in the real application as an important measure of data mining. However, Classification is restricted by many factors. Except for the cause of the classifier itself, but also includes sample complexity, sample distribution, etc. Amongthem, the sample distribution has important influence on classification methods. As most traditional classifiers are built on the hypothesis that every class has the same number of samples in dataset, once there is an imbalanced distribution in the dataset, the classifiers will be skewed to majority class, leading to the misclassification of minority class.The problem imbalanced datasets classification is not only puzzled supervised learning methods, but also semi-supervised learning methods have the same problem. However, traditional imbalanced datasets classification methods are mostly utilized in supervised learning. There are few work of imbalanced datasets classification in semi-supervised learning. In semi-supervised learning, the character of dataset is small number of labeled samples and a large quantity of unlabeled samples. Besides, it’s not suitable for resampling methods to judge classification boundary using too small labeled samples. Therefore, this thesis will mainly focus on imbalanced datasets classification in semi-supervised learning.Considering there are a large amount of unlabeled samples in semi-supervised leaning, this thesis proposes an iterative nearest neighborhood oversampling (SI-INNO) algorithm combining with sample information, which converts a few of unlabeled samples to labeled samples by sample similarity before classification methods. The SI-INNO algorithm that combining the sample information with selecting is very reasonable to improve the sample distribution of the dataset, and it not only can be applied in binary classification but also can be used to tackle multi-class classification.In Experiment, the thesis has analysis the relation with the imbalance ratio of total dataset and labeled dataset when applying the SI-INNO algorithm to imbalanced dataset classification. A large amount of experiments have been conducted on real datasets. It is showed that the proposed algorithm can help the semi-supervised classification methods with bias towards minority class after using the SI-INNO algorithm to balance the labeled dataset. Therefore, the semi-supervised classification methods combining with SI-INNO is robust on imbalanced datasets classification.

Keywords/Search Tags:

Imbalanced Datasets Classification, Semi-supervised Learning, NearestNeighborhood Oversampling

PDF Full Text Request

Related items

1	Research On Imbalanced Datasets Classification Based On Machine Learning And Oversampling Methods
2	A Research On Imbalanced Learning Based On Semi-supervised SVM
3	Research On Imbalanced Data Classification Method Based On Generation Model And Its Application
4	Classification On Imbalanced Datasets
5	Research Of Imbalanced Datasets Preprocessing Combined With Clustering
6	Research On Sentiment Classification Based-upon Imbalanced Data
7	Research On Oversampling Method For Multi-class Imbalanced Learning
8	Research And Implementation Of Semi-supervised Machine Learning Algorithms For Classifying The Imbalanced Protocol Flows
9	Selection And Classification Of Unbalanced Data Based On Semi - Supervised And Integrated Learning
10	Feature Selection And Semi-supervised Classification For Imbalanced Data