Font Size: a A A

Research On Imbalanced Data Classification Method Based On Generation Model And Its Application

Posted on:2020-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:C ShenFull Text:PDF
GTID:2428330596985197Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the field of machine learning and data mining,traditional classification algorithms are tailored for balanced data.However,the data to be processed in many practical applications are imbalanced.For example,the data used for spam filtering,the data used for detection of credit card fraud,and the data used for software defect prediction.When the traditional classification algorithms are used to solve the problem of imbalanced data classification,the generalization performance of the classification algorithms will decrease significantly.How to solve the problem of imbalanced data classification is a challenging research direction,which has important theoretical and application value.The problem of imbalanced data classification can be classified into two categories: binary imbalanced data classification and multi-class imbalanced data classification,this paper investigates the problem of imbalanced data classification.Based on generative models,two methods addressing the problem of imbalanced data classification are proposed.Specifically,the main works of this paper include the following four aspects:1.Based on extreme learning machine autoencoder(ELMAE),an approach for addressing the problem of binary imbalanced data classification is proposed.The proposed method includes 3 steps.(1)the minority(also called positive,the majority also called negative)instances are used as seeds,new samples are generated for increasing the number of positive instances by extreme learning machine autoencoder,the generated new samples are similar with the positive instances but not same.(2)step(1)is repeated several times,and a balanced data set is obtained.(3)a classifier is trained with the balanced data set and used to classify unseen samples.2.For high-class imbalanced data,oversampling many times for positive instances by ELMAE will result in dense overlapping of the sampled examples.In this case,the performance of the classifier cannot be effectively improved.In order to deal with this problem,a bagging-based method for addressing the problem of binary imbalanced data classification is proposed.The proposed method also includes 3 steps.(1)oversampling positive instances several times by ELMAE;(2)randomly sample negative instances with same size of oversampled positive instances by bagging method,and construct multiple balanced training sets;(3)train multiple classifiers on balanced training sets,and classify unseen data via majority voting method.3.An approach based on generative adversarial network(GAN)for binary imbalanced data classification is proposed,the proposed method includes three steps:(1)train GAN with positive instances;(2)generate positive instances by the trained GAN to modify the distribution of the imbalanced data,and construct balanced training set;(3)train classifier with balanced data set,and classify unseen instances.4.The proposed methods are applied to software defect prediction and five liver functions detection,we experimentally investigate the proposed approaches on seven data sets of software defect prediction,and compared with some related methods,the experimental results show that the proposed method is feasible and effective.
Keywords/Search Tags:Imbalanced Datasets, Oversampling, Autoencoder, Generative Adversarial Networks, Extreme Learning Machine
PDF Full Text Request
Related items