| In the fields of data mining and machine learning,imbalanced data distribution is a common phenomenon.In general,imbalanced data distribution refers to the imbalance of sample numbers among different categories.Traditional classification algorithms face the problem of insufficient classification performance when they process imbalanced data classification issues.In the past few decades,researchers have proposed many resampling methods to reduce the impact of imbalanced data on traditional classification algorithms,and can improve the classification performance of traditional classification algorithms to a certain extent.This thesis focuses on the resampling methods for imbalanced data learning,and conducts research work from three key perspectives: oversampling method,undersampling method,and hybrid sampling method.This thesis considers the following aspects of this research problem and proposes corresponding solutions.1.An analysis method for the impact of imbalanced data on the performance of traditional classification algorithms is proposed.Specifically,this method first proposes the augmentation algorithms for imbalanced data based on the resampling methods,and augments the imbalanced data to obtain a group of imbalanced dataset with a gradually decreasing imbalance rate.In order to obtain a more objective classification performance,a new imbalanced data evaluation metric AFG is proposed by combining the three evaluation metrics of ROC(Receiver Operating Characteristic)curve AUC(Area Under the Curve),F-measure and G-mean(Geometric Means Metric),and combined with the coefficient of variation(CV),analyzed the classification performance stability of eight traditional classification algorithms on imbalanced data and the relationship between the classification performance changes and the imbalance rate changes.2.For the classification problem of absolute imbalanced data,an oversampling method based on conditional Wasserstein generation adversarial network with gradient penalty is proposed.This method is implemented by additionally adding auxiliary information to the generating network and the discriminating network,and the additional auxiliary information may be any related information,in this thesis,it refers specifically to the class label of the data.On the one hand,since this method can learn the original imbalanced data distribution,it can generate enough "real" minority class samples.On the other hand,the method used to measure the distance between the generated data distribution and the original imbalanced data distribution is the Earth-Mover Distance(EMD),which is no longer the Kullback-Leibler(KL)divergence and Jensen-Shannon(JS)divergence in the generated adversarial network and conditional generated adversarial network,thus also avoids the problems of unstable training,insufficient sample diversity and mode collapse in model training.Through this method,enough minority class samples are generated to make the ratio of majority class samples and minority class samples close to 1:1 in the original absolutely imbalanced data,and finally improve the classification performance of traditional classification algorithms on absolutely imbalanced data.3.For the classification problem of relatively imbalanced data,an undersampling method with denoising,fuzzy c-means clustering and representative sample selection is proposed.The goal of this method is to select representative majority class samples.By eliminating the unrepresentative and unimportant majority class samples in the original relatively imbalanced data,the ratio of the majority class samples to the minority class samples is close to 1:1,which ultimately improves the classification performance of traditional classification algorithms on relatively imbalanced data.This method involves three simple and basic stages: in stage 1,the noise,redundant and boundary majority class samples which are easy to affect the clustering effect of stage 2 are eliminated by denoising operation,and the time consumption of the subsequent stage is effectively reduced.In stage 2,fuzzy c-means method and Xie Beni index are used to obtain better clustering effect.Stage 3 selects representative majority class samples based on the idea of max-min distance,which avoids the problem of repeated selection caused by too dense selection of representative majority class samples.4.A parallel hybrid sampling framework for imbalanced data classification is proposed.Different from the current mainstream serial hybrid sampling frameworks,this framework simultaneously performs oversampling and undersampling in the hybrid sampling method in parallel.The theoretical analysis and experimental results show that,on the one hand,the parallel hybrid sampling framework can reduce the time consumption of oversampling and undersampling in the hybrid sampling framework.On the other hand,the parallel hybrid sampling framework can bring more information to the traditional classification algorithm than the current mainstream serial hybrid sampling frameworks,thereby improving the classification performance of the traditional classification algorithm on imbalanced data.5.For the current resampling methods,the sampling rate is determined randomly,and the traditional classification algorithm improves the classification performance on imbalanced data,and also has the limitation of randomness.A method to determine the sampling rate automatically in the resampling method based on genetic algorithm is proposed.In this method,the AUC classification performance of the traditional classification algorithm on imbalanced data is taken as the objective function value of the genetic algorithm.Through coding,selection operator,crossover operator and mutation operator,some individuals are retained in each iteration,which is repeated and evolved over several generations.After that,the corresponding sampling rate under the optimal objective function value is obtained,which ensures the better and stable classification performance when using the sampling rate. |