| With the rapid development of the social economy today,people’s enthusiasm for consumption in advance is increasing day by day.With the development of "Internet+" and big data technology,credit business also presents a variety of online and offline business forms.The growing demand for personal credit business has enabled financial institutions such as banks to obtain considerable profits and accumulate a large amount of customer data.At the same time,due to factors such as lax regulatory review and poor information,the rate of nonperforming loans has also gradually increased.Therefore,how to make full use of the useful information in customer data for effective credit default prediction has become one of the urgent problems to be solved.Based on the real credit data set,the thesis predicts the default behavior of borrowers by establishing an explanatory integrated model,and visualizes the main factors affecting the prediction results.Firstly,the data preprocessing is completed by means of univariate analysis of variance,feature transformation and feature encoding,and feature normalization.On this basis,a personal credit loan default prediction model is constructed based on Logistic regression,XGBoost and random forest,and the parameters are tuned according to the grid search method.On the basis of the existing random forest model,the prediction ability of a single base learner in the random forest is improved by the method of gradient boosting,and the weighted average is used to integrate the base learners to establish an optimized random forest model.In terms of model evaluation indicators,the model performance is evaluated by comprehensively considering AUC,recall rate,F1 value and accuracy rate.Through comparative analysis,it is found that for the data set used in this paper,the ensemble model is generally better than the traditional model,the XGBoost model and the improved random forest model have relatively good classification effect and generalization performance,and are suitable for credit default prediction;for the same test set,under the condition of the same model threshold,the AUC value of the improved random forest model,the recall rate of defaulters and other main evaluation indicators are higher than those of the other three models,indicating that it not only has a strong effect on sample data classification ability,but also can better identify defaulters.Finally,the SHAP interpretation framework is introduced to visually analyze the ensemble tree model,and the important influencing variables of the model and the influence effects of the variables are explored from the perspective of a single sample and the overall model,which solves the problem of weak interpretability of complex ensemble models to a certain extent.In the actual credit business scenario,SHAP-based explanatory charts can help business personnel to intuitively analyze the important factors that affect the user’s credit evaluation results,facilitate manual intervention and judgment,and have certain practical value. |