Font Size: a A A

Research On Imbalanced Data Classification Based On Interval Oversampling

Posted on:2023-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:S LiFull Text:PDF
GTID:2568307022497814Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Traditional classification algorithms have excellent performance when facing balanced dataset.But it is easy to misclassify minority samples when facing imbalanced dataset.In the face of imbalanced dataset such as spam,medical diagnosis,credit card fraud,etc.,the recognition accuracy of minority categories is of greater significance.Therefore,classification algorithms for imbalanced dataset are worthy of in-depth study.In the face of the classification problem of imbalanced dataset,the main idea is to improve the samples by oversampling.The traditional oversampling synthesizes samples based on the distribution the minority boundary samples.But it ignores the difference in the amount of information contained in the features,and fails to highlight the contribution of important features,resulting in poor classification results.By analyzing the traditional oversampling to improve the sample generation process on the basis of paying attention to the difference of sample distribution,oversampling based on interval is proposed.The interval oversampling selects boundary samples by the idea of neighbors when sampling.It takes into account the difference in the amount of information contained in each feature of the sample,and performs oversampling according to important features.Firstly the boundary samples in the minority class are obtained through the nearest neighbor method,and the oversampling ratio and the number of important features are determined.Then the important features are determined and oversampled through the feature selection based on information gain to obtain a balanced data sample.The classification model is obtained after the training of the classification algorithm.Finally the model is applied to the personal credit evaluation,and the personal credit evaluation system is given.The experimental results verify that the performance of the interval oversampling meets the theoretical expectations when combined with support vector machine and naive Bayes classification algorithm.Compared with other common oversampling,it has advantages in F-measure and G-mean.However,the interval oversampling introduces additional parameters,and currently depends on experience in the value and adjustment of the parameters.So how to adjust the parameters according to the data characteristics is worthy of further research.
Keywords/Search Tags:Data mining, Oversampling, Support Vector Machines, Uncertain interval data
PDF Full Text Request
Related items