Font Size: a A A

Research Credict Scoring Model Of Multiple Sampling Methods Based On Unbalanced Data

Posted on:2023-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y T WangFull Text:PDF
GTID:2568306806969969Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,my country’s credit market is developing vigorously,which not only promotes my country’s economic development,but also brings certain risks and challenges to the financial market.Therefore,we can make full use of the credit data characteristics of customers in the credit market to create an effective credit scoring model,which is not only beneficial for financial credit institutions to identify non-performing loans in transactions to reduce losses,but also promotes the sound development of the financial market.In addition,it is very important to build a stable and reliable credit scoring model,but considering the imbalanced characteristics of credit scoring data,it is also necessary to pay attention to the processing of imbalanced data in order to establish a credit scoring model with better predictive performance.This thesis studies the problem of data imbalance in the field of credit scoring,and compares the classification performance of LR,KNN,NB,SVM,RF and XGBoost classification models under 11 sampling algorithms using the German credit data set.At the same time,blank experiments were established for six classification models,that is,to evaluate the prediction performance of the default parameter model,and the parameters of the model were tuned using grid search.Based on the data set studied in this thesis,the following conclusions are drawn:(1)The classification performance of the model after parameter tuning using grid search is significantly better than that of a single model using only default parameters.The article lists some of the optimal parameters of the six classification models.The experiment found that the three evaluation index values of AUC,F-score and G-mean of the model after tuning have been improved,indicating that the credit scoring data set is being modeled.When adjusting the parameters of the model,it is of great help to improve the classification performance of the model.In addition,based on the data set used in this article,the top three AUC models after parameter tuning are SVM,LR,and XGBoost models.(2)The classification performance of the model processed by the sampling method is significantly better than the model under the original default parameters.The experimental results show that the AUC value of the model processed by most sampling methods will be higher than the model under the default parameters.This is because after the original data is sampled,the number of samples in the majority class is reduced or the number of samples in the minority class is increased,or both methods are combined to reduce the imbalance of the data set and increase the prediction performance of the classification model.Therefore,in view of the imbalanced characteristics of data in the credit scoring field,it is possible to consider first using correlation sampling methods to reduce the degree of imbalance of the data,thereby improving the classification performance of the model.(3)Based on the data set in this article,most of the models processed by the under-sampling method have a better classification effect than the models processed by the over-sampling method.The experimental results of this thesis show that the six classification models combined with multiple sampling methods have the highest AUC values,namely RUS-LR,RUS-KNN,ENN-NB,ENN-SVM,Tomek Links-RF and RUS-XGBoost models.And RUS,ENN and Tomek Links are all under-sampling methods.Therefore,based on the German credit data set,the under-sampling method can be considered to balance the data.
Keywords/Search Tags:Personal credit score, Unbalanced data, Sampling algorithm, Classification model
PDF Full Text Request
Related items