Font Size: a A A

Research And Application Of Imbalanced Data Classification

Posted on:2020-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:H Y WangFull Text:PDF
GTID:2428330578978036Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Classification is an important topic.Traditional classification algorithms are based on the balanced data.However,there are many imbalanced data sets in real-world applications.In imbalanced problems,classifiers bias the majority class which would lead to poor predictive accuracy over the minority class.In this thesis,imbalanced data classification methods are deeply studied.Since different data sets have different feature distributions,existing solutions can't be valid for all data sets.In this thesis,a preprocessing scheme for sample selection based on Euclidean distance(ED-SESS)is proposed to optimize the traditional oversampling methods.ED-SESS uses the Euclidean distance to determine the correlation between samples.Samples that are closer to the center of the minority samples are selected as suitable samples.The ED-SESS algorithm can be combined with most traditional oversampling methods.Before over-sampling,we first select the best samples from the original sample set to form a new training set.Then we oversampled on the new training set to get a balanced data set.ED-SESS has low algorithm complexity and is suitable for most traditional oversampling methods.The SMOTE algorithm has its drawbacks,including over-generalization.Most of the existing SMOTE improved algorithms only focus on the borderline minority samples,ignoring the non-boundary minority samples containing important information.To solve this problem,this thesis proposes an improved SMOTE algorithm,which is based on the local adaptive distance(LAD-SMOTE).The LAD-SMOTE algorithm uses local adaptive distance to find these non-boundary minority samples containing important information.Then we oversample these samples using linear interpolation.Compared with other algorithms,the LAD-SMOTE algorithm not only pays attention to the hard-to-learn minority samples,but also focuses on non-boundary minority samples with important information.In this way,LAD-SMOTE significantly improves the performance of the classifier.Existing oversampling methods are oversampled at the boundary of different classes.In this way,the synthesized minority samples will cause the decision surface to be offset.For the problem of decision surface migration caused by synthetic samples,a new over-sampling method based on feature space(FSOTE)is proposed in this thesis.FSOTE finds clusters of minority samples from the feature space.Then,new samples are synthesized in the inner space of these clusters.In this way,synthetic minority samples can be avoided from falling into the majority area.Thereby the generation of noise samples is avoided.FSOTE algorithm only synthesizes samples in the inner space of minority samples,solving the problem of decision surface migration.Due to the particularity of the imbalance classification problem,the traditional classification metrics are not sufficient to evaluate the performance of different algorithms.Some standardized evaluation metrics have been proposed,such as F-measure,G-mean and the Youden's index.A large number of data sets from UCI are used to testify the validity of these proposed algorithms.Experimental results demonstrate that the proposed ED-SESS algorithm can improve the classification performance of existing oversampling methods.both the LAD-SMOTE and FSOTE algorithms can effectively solve the imbalanced classification problem.
Keywords/Search Tags:Imbalanced Classification, K-means, The Local Adaptive Distance, SMOTE, Feature Space
PDF Full Text Request
Related items