Font Size: a A A

Credit Risk Prediction Research Based On Feature Engineering And Improved SMOTE Algorithm

Posted on:2024-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z C XuFull Text:PDF
GTID:2557307136998049Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet industry and the combination of Internet technology and traditional financial industry have made the online consumer credit scene of commercial banks in China increasingly mature,and the loan scale of commercial banks’ consumer loans is expanding.However,the non-performing loan rate of commercial banks is increasing year by year,and commercial banks are challenged by high-dimensional data,timeliness,and imbalanced sample ratio in the process of identifying credit risk users.In this context,studying how to quickly and accurately mine effective features,solve the sample imbalance problem,and establish accurate credit risk prediction models is the key to improve the risk control capability of commercial banks.Based on the consumer credit data of more than 30,000 online loan users provided by a bank from December 2019 to September 2020,this paper contains 29,510 positive samples and 2,784 negative samples from the level of features,data resampling,and model fusion.data set for mining analysis.At the level of feature engineering,first,this paper constructs an 893-dimensional derived feature set,and performs feature binning and feature encoding.Then,this paper proposes a combined embedding method based on the combination idea,and performs feature selection based on the combined embedding method and the single embedding method respectively.Finally,this paper builds a credit risk prediction model based on different feature sets using the same base model.Through the comparative analysis of the results,it is found that compared with the feature set selected by the single embedding method,the ! of the prediction model constructed based on the feature set selected by the combined embedding method is increased by 5.2%,2.6% and 1.1% on average,indicating that based on the combined embedding method.Feature selection is more efficient.At the level of data resampling,this paper proposes an improved SMOTE algorithm to resample samples based on the relative density of neighborhoods,and compares the sampling results of various undersampling,oversampling,and integrated sampling algorithms,and uses the same base model to construct a credit risk prediction model.Through the comparative analysis of the results,it is found that compared with the unsampled data set and the conventional resampled data set,the average !of the prediction model constructed based on the improved SMOTE algorithm resampled data set is increased by 4.3% and 3.6%,respectively,indicating that the improved SMOTE algorithm can reduce the sample size.The impact of imbalance on model accuracy,and the effect is better than the conventional resampling algorithm.At the level of model construction,this paper builds credit risk prediction models based on different model combinations and model fusion frameworks,and compares the differences in the effects of credit risk prediction.Through the comparative analysis of the results,it is found that compared with the single model,tree ensemble combination and non-tree ensemble combination,the average ! of the fusion model based on the heterogeneous model combination is increased by 1.9%,0.5% and 1.4%,respectively.Compared with the credit risk prediction fusion model built by the Blending framework,the ! of the credit risk prediction fusion model built based on the Stacking framework increased by 1.0% on average,indicating that the Stacking fusion model based on heterogeneous model combinations is better.Overall,the Stacking fusion model based on the combination of heterogeneous models is optimal for credit risk prediction by screening features through the combined embedding method and improving the SMOTE algorithm for sampling.
Keywords/Search Tags:Feature Engineering, Sample Imbalance, Improved SMOTE Algorithm, Credit Risk Prediction
PDF Full Text Request
Related items