| With the emergence of the big data era,the issue of unbalanced data classification has become one of the hot research directions in the field of data mining.Real-world scenarios such as natural disaster prediction,financial risk assessment,and network intrusion detection are all problems of imbalanced data classification.However,imbalanced data sets present significant challenges to traditional classification algorithms,as class imbalance severely affects the accuracy of classification models,resulting in biased models performing poorly on minority classes.This study focuses on the imbalanced problem and mainly addresses it from two aspects: data preprocessing and optimization of classification algorithms.The specific research contents are as follows:This study focuses on the imbalanced problem and mainly addresses it from two aspects: data preprocessing and optimization of classification algorithms.The specific research contents are as follows:In terms of data preprocessing,in view of the shortcomings of traditional SMOTE and ADASYN oversampling algorithms,it is proposed to propose WSA(Weighted SMOTE-ADASYN)oversampling algorithm,which combines the advantages of SMOTE and ADASYN algorithms to over-sample a few types of data,so as to balance the unbalanced data set;First of all,select a minority sample point and calculate the K-nearest neighbor,and count the distribution of large class sample points and small class sample points around the small class sample points.Secondly,determine the unbalance loss of the unbalanced data set,calculate the number of sample points to be synthesized,and determine the position of the small sample through the ratio method.Finally,the weighted SMOTE method or ADASYN method is called to synthesize sample points according to the location of sub-class sample points.In terms of classification algorithms,in order to improve the classification accuracy of common ensemble algorithms,a weighted random forest algorithm(RRF)based on hierarchical sampling of Relief features is proposed.At first,the classification model applies the Relief algorithm to calculate the feature weights of every dataset,and then layers the dataset features according to their weights;The algorithm first calculates the feature weight of each dataset through the Relief algorithm,and then layers the dataset features according to the feature weight.Then,when the random forest algorithm uses Bootstrap sampling,samples are uniformly extracted from the layered features,Thereby reducing the interference of low correlation features on classification results;Then,based on the classification performance of a single decision tree in the algorithm,the decision tree is given weight to further improve the classification effect.Finally,the oversampling algorithm and classification algorithm proposed in this thesis are combined into the unbalanced data classification framework,and the experimental verification is carried out on the UCI imbalanced data collection.Through F-measure,AUC,G-mean and other indicators,it has been demonstrated through experimentation that the algorithm proposed in this thesis outperforms traditional oversampling algorithms and ensemble learning classification algorithms in processing imbalanced data. |