| With the advent of the big data era and the continuous development of scientific and technological advances,artificial intelligence technology has gradually become an indispensable and important part of daily life.The seemingly mature artificial intelligence technology is widely used in various aspects of life,and it can be said that these technologies have brought great convenience to people in the information industry.However,in order for AI technology to have more widespread and extensive applications,it also faces more challenges.In classification tasks,when faced with complex and imbalanced data,conventional basic models cannot accurately adapt,so developing robust learning algorithms is of great research significance.The essence of classification is to mine high-dimensional semantic information of data and divide samples with the same attribute into the same category.Existing classification models can achieve relatively high accuracy performance.However,when faced with imbalanced data,conventional learning models based on empirical risk minimization are easily affected by the prior distribution of samples.In other words,biased data distribution can lead to biased decision spaces,which in turn affects the robustness and generalization of the model.Most current research focuses on adjusting the learning machine using methods such as pre-sample balance sampling,in-cost balance learning,and post-decision compensation when dealing with imbalanced data.From the perspective of data prior distribution,this article conducts research on methods for learning from imbalanced data at both the algorithm and sample levels.The contents are as follows:(1)Algorithm level: We propose an improved Probability Density Machine(PDM)algorithm based on shared nearest neighbor clustering technology.Probability Density Machine is a new algorithm recently proposed to solve the problem of class imbalance learning.The algorithm can capture prior data distribution information well and demonstrate robust performance in various Class Imbalance Learning(CIL)applications.However,we also notice that PDM is sensitive to CIL data with varying density and/or small separations.To address this problem,we introduce the non-parametric Shared Nearest Neighbor(SNN)clustering technology into the PDM process and propose a new SNN-PDM algorithm.In particular,SNN can adapt to changing densities and capture small separations well.We evaluated the proposed algorithm on a large number of CIL datasets,and the results show that the proposed SNN-PDM algorithm is significantly better than PDM and several previous methods.(2)Sample level: A feature-level interpolation generation(FIG)method is proposed to address class imbalance problems.Many studies have shown that deep models also face the challenge of class imbalance,and how to effectively augment data for minority classes in the image domain has always been a problem.The traditional SMOTE oversampling method generates new samples by directly interpolating in the original space to alleviate the overfitting caused by random oversampling.However,in image data,this augmentation method cannot effectively improve the performance of deep models.To solve this problem,we propose a feature-level interpolation generation(FIG)method.The main idea is to transfer the SMOTE interpolation from the input space to the encoding space of the autoencoder.We hope that the improved encoding can provide better guidance for generating different images. |