Font Size: a A A

Research On Imbalanced Dataset Classification In Semi-supervised Learning

Posted on:2016-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:C YuFull Text:PDF
GTID:2308330461978536Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of modern science and technology, how to dig out the hidden information and useful rule from the huge data has attracted more and more attention. Classification methods are widely used in the real application as an important measure of data mining. However, Classification is restricted by many factors. Except for the cause of the classifier itself, but also includes sample complexity, sample distribution, etc. Amongthem, the sample distribution has important influence on classification methods. As most traditional classifiers are built on the hypothesis that every class has the same number of samples in dataset, once there is an imbalanced distribution in the dataset, the classifiers will be skewed to majority class, leading to the misclassification of minority class.The problem imbalanced datasets classification is not only puzzled supervised learning methods, but also semi-supervised learning methods have the same problem. However, traditional imbalanced datasets classification methods are mostly utilized in supervised learning. There are few work of imbalanced datasets classification in semi-supervised learning. In semi-supervised learning, the character of dataset is small number of labeled samples and a large quantity of unlabeled samples. Besides, it’s not suitable for resampling methods to judge classification boundary using too small labeled samples. Therefore, this thesis will mainly focus on imbalanced datasets classification in semi-supervised learning.Considering there are a large amount of unlabeled samples in semi-supervised leaning, this thesis proposes an iterative nearest neighborhood oversampling (SI-INNO) algorithm combining with sample information, which converts a few of unlabeled samples to labeled samples by sample similarity before classification methods. The SI-INNO algorithm that combining the sample information with selecting is very reasonable to improve the sample distribution of the dataset, and it not only can be applied in binary classification but also can be used to tackle multi-class classification.In Experiment, the thesis has analysis the relation with the imbalance ratio of total dataset and labeled dataset when applying the SI-INNO algorithm to imbalanced dataset classification. A large amount of experiments have been conducted on real datasets. It is showed that the proposed algorithm can help the semi-supervised classification methods with bias towards minority class after using the SI-INNO algorithm to balance the labeled dataset. Therefore, the semi-supervised classification methods combining with SI-INNO is robust on imbalanced datasets classification.
Keywords/Search Tags:Imbalanced Datasets Classification, Semi-supervised Learning, NearestNeighborhood Oversampling
PDF Full Text Request
Related items