Research On Classification Method Of Unbalanced Data Based On Oversampling

Posted on:2024-02-06

Degree:Master

Type:Thesis

Country:China

Candidate:Z M Zhao

Full Text:PDF

GTID:2568306935983229

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years,with the explosion of knowledge brought by the development of information technology,unbalanced data processing has become a hot research topic in the academic and application fields.Class imbalance in machine learning refers to the fact that one class has more instances than another,which occurs in many real-world cases,such as customer credit risk prediction,product fault diagnosis,medical data analysis,fraud prediction,etc.In these cases,a few classes are interesting,important,and have a high cost of misclassification.However,traditional methods tend to cause classifier bias in unbalanced data sets,leading to poor classification performance of a few classes.Therefore,to maximize the classification accuracy of a few types of samples is the goal of many scholars.This paper starts with the data level oversampling algorithm and aims to build a classification model with good generalization performance to study and improve.In class imbalance learning,SMOTE is the most classic way to ameliorate the data imbalance,by interpolating between two existing minority samples to create a new minority sample.Although SMOTE has improved the prediction accuracy of a few classes,SMOTE did not select each sample in the synthesis of new samples.Under the condition of few class samples and containing more noise information,SMOTE algorithm will be disturbed by near neighbor randomness,easy to combine redundant samples and noise samples,making the generated sample quality is not high.Suggest two more universal solutions to the limitations of SMOTE.Specific research contents are as follows:1.A weighted oversampling method to optimize the composite sample distribution is proposedAn improved SMOTE method called OWOSD(Optimized Weighted Oversampling of Synthetic Sample Distribution),in which Smote samples are selected according to weight.Specifically,this method firstly applies the feature-weighted clustering algorithm WKMeans to preprocess the original data set.The processed data set divides the minority classes into security samples and boundary samples based on the K-nearest neighbor algorithm,and assigns different weights to the two minority class sample classes.According to the weight of sample classes,the number of new samples needed to be synthesized for the allocation of safe sample areas and boundary sample areas,and then the weight distribution was carried out according to the European-style clustering of each minority sample in the two sample sets relative to the majority sample,so that different numbers of new samples could be generated for each sample.Then apply SMOTE oversampling depending on the smote number of new samples to be synthesized per minority sample to ameliorate the imbalance in the dataset.Finally,the effectiveness of the proposed method is verified by experiments.2.An improved oversampling method based on boundary factor is proposedAnother oversampling method based on boundary factor is BFOM(Boundary Factor Improved Oversampling Method).This method introduces the boundary factor to improve SMOTE algorithm to generate more samples at the boundary of a few classes of samples and improve the overall classification accuracy of samples.The minority samples were divided into boundary samples and non-boundary samples based on the number of the minority samples’ nearest neighbors containing the majority samples.Only the boundary samples were weighted.According to the distribution of boundary samples,different weights are assigned.The closer the sample points are to the boundary,the greater the weight will be,and the more new samples will be generated.In this way,the final minority class boundary will be enhanced,which is conducive to the classification of minority class samples and improves the difficult classification problem of samples located at the boundary of minority class.Secondly,grid search algorithm is introduced to optimize the parameters of random forest when the classification model is constructed by using random forest classification algorithm.Finally,the experimental comparison with different sampling methods and different classification algorithms verifies that the classification performance of BFOM sampling method combined with random forest classification algorithm model is improved to a certain extent compared with the comparison model,indicating that the effectiveness of the model is concerned.Finally,sandstorm data in some regions of Gansu were extracted from the Series of Strong Sandstorms in China and its Supporting Data set and the Daily Value Data Set of Surface Climatic Data in China,and combined with the improved oversampling method based on boundary factors proposed in this paper,a classification prediction model was built for sandstorm data in some regions of Gansu.

Keywords/Search Tags:

Imbalanced Data, Oversampling, Sample Weight, Random Forest, Dust Storm Classification

PDF Full Text Request

Related items

1	Optimization Research And Application Of Unbalanced Data Classification Algorithm
2	Research On Imbalanced Data Classification Algorithm Based On Random Forest And Its Parallelization
3	Research On Imbalanced Data Classification Method Based On Random Forest Algorithm
4	Research For Imbalanced Big Data Classification Algorithm On Random Forest
5	Researches On Oversampling Methods For Imbalanced Data
6	Class-Imbalanced Data Stream Classification Method Based On Adaptive Random Forest
7	Research On The Method Of Solving Imbalanced Classification Problems Based On Random Forest Algorithm
8	The Improved Random Forests Based On The Imbalanced Data Classification
9	Neural Network Approaches For Imbalanced Data Classification
10	Research Of Imbalanced Data Ensemble Classification Algorithm Based On Oversampling