| In recent years,with the explosion of knowledge brought by the development of information technology,unbalanced data processing has become a hot research topic in the academic and application fields.Class imbalance in machine learning refers to the fact that one class has more instances than another,which occurs in many real-world cases,such as customer credit risk prediction,product fault diagnosis,medical data analysis,fraud prediction,etc.In these cases,a few classes are interesting,important,and have a high cost of misclassification.However,traditional methods tend to cause classifier bias in unbalanced data sets,leading to poor classification performance of a few classes.Therefore,to maximize the classification accuracy of a few types of samples is the goal of many scholars.This paper starts with the data level oversampling algorithm and aims to build a classification model with good generalization performance to study and improve.In class imbalance learning,SMOTE is the most classic way to ameliorate the data imbalance,by interpolating between two existing minority samples to create a new minority sample.Although SMOTE has improved the prediction accuracy of a few classes,SMOTE did not select each sample in the synthesis of new samples.Under the condition of few class samples and containing more noise information,SMOTE algorithm will be disturbed by near neighbor randomness,easy to combine redundant samples and noise samples,making the generated sample quality is not high.Suggest two more universal solutions to the limitations of SMOTE.Specific research contents are as follows:1.A weighted oversampling method to optimize the composite sample distribution is proposedAn improved SMOTE method called OWOSD(Optimized Weighted Oversampling of Synthetic Sample Distribution),in which Smote samples are selected according to weight.Specifically,this method firstly applies the feature-weighted clustering algorithm WKMeans to preprocess the original data set.The processed data set divides the minority classes into security samples and boundary samples based on the K-nearest neighbor algorithm,and assigns different weights to the two minority class sample classes.According to the weight of sample classes,the number of new samples needed to be synthesized for the allocation of safe sample areas and boundary sample areas,and then the weight distribution was carried out according to the European-style clustering of each minority sample in the two sample sets relative to the majority sample,so that different numbers of new samples could be generated for each sample.Then apply SMOTE oversampling depending on the smote number of new samples to be synthesized per minority sample to ameliorate the imbalance in the dataset.Finally,the effectiveness of the proposed method is verified by experiments.2.An improved oversampling method based on boundary factor is proposedAnother oversampling method based on boundary factor is BFOM(Boundary Factor Improved Oversampling Method).This method introduces the boundary factor to improve SMOTE algorithm to generate more samples at the boundary of a few classes of samples and improve the overall classification accuracy of samples.The minority samples were divided into boundary samples and non-boundary samples based on the number of the minority samples’ nearest neighbors containing the majority samples.Only the boundary samples were weighted.According to the distribution of boundary samples,different weights are assigned.The closer the sample points are to the boundary,the greater the weight will be,and the more new samples will be generated.In this way,the final minority class boundary will be enhanced,which is conducive to the classification of minority class samples and improves the difficult classification problem of samples located at the boundary of minority class.Secondly,grid search algorithm is introduced to optimize the parameters of random forest when the classification model is constructed by using random forest classification algorithm.Finally,the experimental comparison with different sampling methods and different classification algorithms verifies that the classification performance of BFOM sampling method combined with random forest classification algorithm model is improved to a certain extent compared with the comparison model,indicating that the effectiveness of the model is concerned.Finally,sandstorm data in some regions of Gansu were extracted from the Series of Strong Sandstorms in China and its Supporting Data set and the Daily Value Data Set of Surface Climatic Data in China,and combined with the improved oversampling method based on boundary factors proposed in this paper,a classification prediction model was built for sandstorm data in some regions of Gansu. |