| With the development of information technology and computer hardware,a large amount of data is generated and stored in various fields.Researchers are committed to extracting information that is valuable to human society from the massive data,but they have been plagued by the problem of imbalanced data.The so-called imbalanced data refers to the large gap between the number of samples of a certain type and the number of samples of other types in the data set.The problem of imbalanced data is widespread in the fields of malicious traffic detection,fault detection,disease diagnosis,and financial fraud detection.When dealing with imbalanced data,most of the existing machine learning methods have poor performance.For example,directly using traditional classification algorithms that aim to maximize the overall classification accuracy to process imbalanced data,the final training classification model will pay more attention to the majority class and ignores the minority one,resulting in lower classification accuracy of the minority class,and how to correctly classify imbalanced data faces a huge challenge.In this paper,we have carried out in-depth research on the problem of imbalanced classification,and the main research progress achieved is as follows:(1)Aiming at the problem of imbalanced binary classification,this paper designs and develops an oversampling ensemble algorithm OSEA based on the difficulty of sample classification.This paper introduces the concept of classification difficulty,which reflects the comprehensive influence of all factors that affect the accuracy of classification on the classifier.The OSEA algorithm uses a combination of oversampling algorithm and ensemble learning algorithm,and uses classification difficulty as the sampling weight to guide the classifier training process.The algorithm performance was tested on artificial synthetic datasets and real-world datasets,and OSEA’s AUCPRC index reached 93.1%.Compared with many current general imbalance classification algorithms with excellent performance,the multiple evaluation indexes of OSEA are improved.(2)Aiming at the problem of imbalanced multi-classification,this paper designs and develops a hybrid resampling ensemble algorithm MC-HSEA based on decomposition strategy.This paper introduces the spherical neighborhood cleaning technology,which can alleviate the sample overlap problem on the basis of preserving the neighborhood sample information.The algorithm uses the OVO decomposition strategy to simplify the multi-classification problem,and then uses the spherical neighborhood cleaning technology to clean the samples,and uses the oversampling integration algorithm to oversampling in the spherical area around the minority samples.The performance of the algorithm was tested on multiple imbalanced multi-classification datasets.Compared with the four general imbalanced multi-classification algorithms,the m GM of the MCHSEA algorithm increased by 9.93% on average,and the Av Acc increased by 9.33% on average.(3)Aiming at the problem of imbalanced time series classification,this paper designs and develops a density clustering under-sampling algorithm based on shared nearest neighbor similarity,SNN-DCUS.This paper uses shared nearest neighbor similarity to alleviate the dimensional disaster of time series data,and uses the concept of core points in density clustering to deal with the problem of clusters of different sizes and shapes in clustered data.The algorithm was tested on multiple imbalanced time series datasets.The F1-score of the SNN-DCUS algorithm increased by 5.4% on average,G-mean increased by 7.45% on average,and AUCPRC increased by 7.55% on average. |