Font Size: a A A

Research On The Classification Algorithm Of Imbalanced Data Sets

Posted on:2022-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:X CuiFull Text:PDF
GTID:2518306527983059Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,the total amount of data produced by all walks of life have grown in an amazing speed.In order to extract effective information from massive data,data mining technology came into being.Nowadays data mining has been widely used in many fields and has played an important role in global competition and social life.However,people have found that many data in real application show the characteristics of imbalanced distribution.And traditional classifiers are often built on balanced data,when faced with imbalanced data,the majority class samples will be concerned and the classified performance of the minority ones is hard to guarantee.Therefore,it is urgent to handle the classification problem of imbalanced data.Sampling-based methods are effective and common to see for solving this thorny problem,so this article starts with sampling and has made a further research.For solving the imbalanced classification problem,three methods are proposed in this article.The specific work is summarized as follows:By combining the synthetic minority over-sampling technique(SMOTE)and a clustering algorithm,this article proposes an improved algorithm(named CSMOTE).The algorithm discards the linear interpolation between nearest neighbors from SMOTE and synthesizes new samples within the range of clusters obtained by clustering of minority classes.By calculating the euclidean distances,the samples are screened so that the possibility of noisy samples participating in the synthesis can be reduced.After comparing with several state-of-the-art methods on multiple imbalanced data,the results illustrate that the proposed CSMOTE achieves higher classification performance and can solve the problem effectively.This article propose a two-stage sampling by starting with diversity and forms an imbalanced data ensemble classification algorithm based on sampling and feature selection(IDESF).Based on ensuring the reasonableness of the samples in the data,the two-stage sampling increases the difference between the data to implicitly enhance the diversity of the base classifiers,and it can balance the data distribution.The proposed IDESF is compared with other imbalanced classification algorithms on multiple data.The results demonstrate that IDESF can get higher AUCarea and G-mean values,and achieve outstanding classification performance.This article also proposes a new classification algorithm(named CSMOTE-Ada Boost)by combining Ada Boost and CSMOTE.The Ada Boost can increase the weights of the minority class samples in algorithm level which are difficult distinguish,so the recognition effect can be improved.The CSMOTE can increase the number of minority class samples and weaken its imbalance in the data level,so that the recognition effect can also be improved.Therefore,the integrated algorithm can increase the classifiers' attention to minority classes in both the data and algorithm levels so that the classification effect can be further improved.The effectiveness of the proposed CSMOTE-Ada Boost is verified by comparing with other advanced methods on multiple imbalanced data.To sum up,the algorithms proposed in this article can effectively solve the problem of imbalanced classification and improve the recognition effect of the minority classes.
Keywords/Search Tags:data mining, imbalanced data, classification, sampling, integrated learning
PDF Full Text Request
Related items