Font Size: a A A

Research On The Classification Algorithm Of Imbalanced Data Based On Boosting

Posted on:2019-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:J MaFull Text:PDF
GTID:2428330566977500Subject:Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of Internet has led to the birth of massive data which provides some conveniences for people to get valuable information but makes people drown in the ocean of information.In order to improve the efficiency of people getting valuable information,automatic classification of massive data is a good choice.However,class imbalance is extremely prone to occur in massive data,that is,the number of samples in a certain class is significantly less than the number of samples in other classes.Traditional classification algorithms are often unsatisfactory in solving in data imbalance problem,so it is very necessary to study the classification of imbalanced data.The existing mainstream methods to solve the problem of imbalanced data classification are the combination of sampling algorithms and ensemble learning algorithms,such as SMOTEBoost(Synthetic Minority Over-sampling Technique and AdaBoost.M2),EE(Easy Ensemble),RUSBoost(Random Under-sampling Technique and AdaBoost.M2)and EUSBoost(Evolutionary Under-sampling Technique and AdaBoost.M2)algorithms.The above algorithms give the same weight to each sample in the training set before the iterative learning begins.However,the importance of each sample is significantly different in the presence of data imbalances.Therefore,the above algorithms ignore the prior information of the sample distribution,i.e.,they have some defects.In order to solve the above problems,this paper proposes a new weighted strategy PKW(Use Prior Knowledge to Weight Samples)for boosting algorithms,which first uses the clustering algorithm to find the clustering center of each class and secondly calculates the Euclidean distance between each sample and clustering center.Finally,it uses Gaussian kernel function to weight samples.The smaller the distance,the higher the importance of the sample.Therefore,the method can get the prior information(the degree of importance of the sample)of the sample distribution accurately.Then,the weighted strategy PKW is applied to the AdaBoost.M2,SMOTEBoost,EE,RUSBoost and EUSBoost algorithms to generate improved algorithms such as PKWA,PKWS,PKWEE,PKWR and PKWE respectively.In order to study whether there is a statistical difference between algorithms,comprehensive experiments are performed to test them via 30 publicly available data sets.The results show that it well outperforms the state-of-the-art methods in terms of three popular metrics for imbalanced classification,i.e.,AUC(Area under the Curve),F-Measure,and G-Mean.In order to verify the statistical differences between the algorithms,this paper uses Friedman test and Nemenyi test to analyze the experimental results of the algorithm for each evaluating indicator.
Keywords/Search Tags:classification, imbalanced data set, ensemble learning, PKW
PDF Full Text Request
Related items