Font Size: a A A

Classification Algorithms For Class Imbalance Data

Posted on:2024-02-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:J J RenFull Text:PDF
GTID:1528307340453794Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The classification problem of the class-imbalance data widely exists in many domains,such as network intrusion detection,medical diagnosis,biomedicine,etc.Since the traditional classification algorithms neglect the class imbalance in datasets and suppose that both the number of samples and the misclassified cost on each class are the approximately same,the models learned by these algorithms by maximizing the overall accuracy may make the prediction skew the majority class samples and be less sensitive to the minority class samples,and may misjudges the minority class samples as noise.In many domains,the correct prediction of the minority class samples(i.e.,the intrusion behavior,patients,etc.)is generally more important than that of the majority class samples(i.e.,the normal behavior,non-patients,etc.).Therefore,in the fields of data mining and machine learning,how to improve the shortcomings of the traditional classification algorithms in dealing with the class-imbalance data had become a very challenging issue.To address the above issue,we proposed several effective algorithms from different technology levels by analyzing key difficulties of the imbalanced classification problem.The main works and innovations of this thesis are summarized as follows:(1)To solve the problem that the existing fuzzy membership functions are easily affected by the class-imbalance data and cannot measure the importance of samples,we design a new fuzzy membership function and combined it with cost-sensitive learning,and then a new algorithm for tackling noisy class imbalance problems is proposed,named Slack-Factorbased FSVM(SFFSVM).In SFFSVM,the relative distances between samples and an estimated hyperplane,called slack factors,are used to define the fuzzy membership function.To eliminate the impact of class imbalance on the function and gain more accurate samples’ importance,we rectify the importance according to the positional relationship between the estimated hyperplane and the optimal hyperplane of the problem,and the slack factors of samples.Comprehensive experiments on artificial and real-world datasets demonstrate that SFFSVM outperforms other comparative methods on F1,MCC,and AUC-PR metrics.(2)Aiming at the problem that the classification tasks became more complex because of the improper selection of reference samples using the oversampling methods for the class imbalance classification problem,according to the different possibilities of minority class samples appearing in the overlapping regions in the feature space,a grouping scheme for the minority class samples is first designed to identify the overlapping region samples.Then,a new oversampling method based on this grouping scheme is proposed to make the new samples far away from the overlapping region and rectify the decision boundary properly.Subsequently,a new effective classification algorithm is developed for imbalanced data.Extensive experiments show that the proposed algorithm is superior to the eighteen benchmark algorithms in terms of three performance metrics,especially on high imbalance ratio data sets.(3)Aiming at the problem that the existing ensemble method based on the under-sampling technology is easy to lose the useful information of the majority class and does not easy to generalize the learning model because of the improperly sampling strategies,we propose an equalization ensemble method(EASE)with two new schemes.First,we propose an equalization under-sampling scheme to generate a balanced data set for each base classifier,which can reduce the impact of class imbalance on the base classifiers;Second,we design a weighted integration scheme,where the G-mean scores obtained by base classifiers on the original imbalanced data set are used as the weights.These weights can not only make the better-performed base classifiers dominate the final classification decision,but also adapt to a variety of imbalanced datasets with different scales while avoiding the occurrence of some extremely bad situations.The extensive experiments have shown that the performance of the proposed method is not only significantly better than the contending methods on 56small-scale data sets with low IR(<130)(especially for F1 and MCC),but also superior to that of the methods using the under-sampling technique on larger-scale data sets with high IR(>270).Besides,it is found that the diversity of the base classifier in EASE is higher than that of the comparison algorithm under the four metrics.
Keywords/Search Tags:Class imbalance, Kernel method, Ensemble learning, Large-scale data, Cost-sensitive learning
PDF Full Text Request
Related items