Font Size: a A A

Personal Credit Risk Assessment With Unbalanced Data

Posted on:2024-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:X L PanFull Text:PDF
GTID:2557307136497994Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
At present,China’s financial systemic risk is generally controllable,but the problem of personal credit risk still exists.Through the innovative means of financial technology,effective credit risk avoidance and sound financial supervision system have become the key work to improve the financial market.In the context of big data,the intelligent identification of defaulting customers from massive personal credit data through data mining technology can efficiently provide early warning information for financial institutions.In this paper,we use the personal credit dataset of Home Credit default risk competition project released by kaggle website,which has 7 data files with 202 features and 307,511 samples,and the ratio of non-defaulting customers to defaulting customers sample size is about 11:1,which is a typical unbalanced data classification problem.In this paper,we conduct personal credit risk assessment at three levels: feature engineering,data resampling,and integration algorithm,and provide reference suggestions for imbalanced data research and financial institutions.In terms of feature engineering,firstly,on the one hand,the feature derivation based on statistical methods is summarized,which includes three derivation methods based on statistical indicators,algebraic operations and realistic meanings;on the other hand,a feature derivation method based on the correlation between continuous variables and positive and negative categories is proposed;combining the two derivation methods,the total number of features obtained is 535.Next,the features were filtered based on variance filtering method,correlation filtering method and attribute bias method respectively,and finally 314 features were obtained.Then,the 5-fold cross-validation method was used to compare the performance of XGBoost,Lightgbm,and Catboost models,and the Lightgbm algorithm with the highest AUC value of 0.7599 was selected as the benchmark single model under the case of one-hot processing of the original data only.Finally,after the feature engineering process,the AUC value was improved to 0.7884,with an improvement rate of 3.75%,and among the top 20 important features,10 of them were derived based on the continuous variable and positive-negative category correlation method,which illustrated the effectiveness of the feature derivation method based on continuous variable and positive-negative category correlation proposed in this paper.In terms of data resampling,firstly,most of the class anomalies in the training set are identified and removed based on the isolated forest algorithm.Secondly,to clarify the class boundaries,the boundary noise samples in the training set are identified based on the boundary noise factor algorithm,and the minority class samples are first replicated using the random oversampling method in the boundary samples,and then the majority class samples are removed.Then,to further adjust the data imbalance,the minority class samples are synthesized based on the ADASYN algorithm in the ratio of 1:3.Finally,after feature engineering and improved hybrid sampling processing,the AUC value obtained based on the Lightgbm model is 0.7997,with an improvement rate of 5.24%.In terms of integration algorithm,after several combinations of base classifiers,meta-models,and integration methods were compared,Lightgbm and XGBoost models with full sample and full feature set were finally selected as base classifiers 1 and 2,and Lightgbm and XGBoost models with full sample and 70% feature set were used as base classifiers 3 and 4,and Lightgbm was used as the metamodel to form the Stacking integrated model.After feature engineering,improved hybrid sampling and Stacking integration processing,the AUC value is 0.8056 with an improvement rate of 6.01%,which reaches the middle and upper level of gold medal,which indicates that the integration of the three levels of processing can effectively improve the personal credit risk assessment results.
Keywords/Search Tags:Personal Credit Risk, Feature Engineering, Hybrid Sampling, Stacking Integration, Lightgbm
PDF Full Text Request
Related items