Font Size: a A A

Research And Application Of Unbalanced Dataset Classification Problem

Posted on:2023-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:W Q XuFull Text:PDF
GTID:2568307100470934Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the advancement of informationize,the volume of data is increasing explosively.How to analysis and manipulate the enormous volume of data and retrieve the valid information included in the data has been the focus of ongoing studies.Among them,the effective classification of datasets is one of the research hotspots.Classification methods generally use available data for classification training to acquire prior knowledge and then identify the class of the sample to be classified.The research results of classification problems have been widely applied in practical scenarios,but the classification problem of unbalanced datasets is still a worthy studying.Based on the advancement of deep learning,such problems can be further solved by combining unbalanced datasets classification methods with deep learning models.Based on the review,summarize and dissect of relevant researches,for the current issues of poor identification rate of minority classes when classification of unbalanced datasets,an ensemble classification algorithm with cost sensitive convolutional neural network,and an adaptive synthetic oversampling method with outlier detection and mahalanobis distance are proposed.And based on this,experiments are conducted using the datasets in the UCI database and Kaggle platform for comparative studies and evaluations,and the validity of the algorithm is proved.The research is as follows:(1)A classification algorithm based on cost sensitive convolutional neural network and AdaBoost is proposed(AdaBoost-CSCNN)In order to fully explore the classification effect of convolutional neural network on unbalanced datasets,cost sensitive learning method is combined with convolutional neural network to construct a cost weighting mechanism and form a cost sensitive convolutional neural network.Meanwhile,based on AdaBoost ensemble learning theory,an ensemble classification algorithm based on cost sensitive convolutional neural network is constructed.Thus,the classification accuracy and robustness are enhanced by avoiding the misclassification of minority classes and the loss of key feature attributes while ensuring the classification accuracy.The empirical results indicate that the AdaBoost-CSCNN algorithm can efficiently cope with the lack of classification accuracies due to the category imbalance problem and improve the classification effect.(2)An adaptive synthetic oversampling method based on outlier detection and mahalanobis distance is proposed(ADASYN-OD-MD)To overcome the existence of problems in generating minority class samples by the adaptive synthetic oversampling method,an adaptive synthetic oversampling method based on outlier detection and mahalanobis distance are proposed.First,outlier sample points in minority classes are eliminated to avoid the generation of class overlap.Then adaptively determine the generated sample count for each minority class sample,depending on the sample distribution state.And the mahalanobis distance is used instead of euclidean distance considering the correlation among the feature variables.Thus,on the basis of increasing the number of minority classes,the validity of sample generation can be strengthened.Finally,the cost sensitive convolutional neural network ensemble classification algorithm proposed earlier is applied to construct the deep learning ensemble model based on the improved oversampling method for simulation experiment.Compared with other methods,the classification performance is further optimized.(3)A prototype system is implementedIn aiming to verify the effectiveness of the model in the practical field,the evaluation index system is established for the imbalance of credit risk evaluation datasets.And on the foundation of algorithm study,a prototype system of credit risk evaluation is implemented,which provides more accurate and satisfactory evaluation results for banks and other financial institutions.
Keywords/Search Tags:unbalanced dataset, convolutional neural network, cost weighting mechanism, oversampling, prototype system
PDF Full Text Request
Related items