| Credit scores are widely used in various industries to determine the creditworthiness of customers.In the banking industry,credit scores are used to determine whether customers qualify for loans and credit cards,as well as to set the corresponding credit limits.In e-commerce platforms,credit scores also affect the limits customers can obtain.In addition,credit scores play an important role in the construction of the national credit system,such as China’s social credit system.Credit scoring can be a continuous variable prediction or a categorical variable classification,such as binary and ternary classification problems.Multiple linear discriminant analysis is a common method used to solve classification problems in the early stage.However,with the progress of research,more and more methods have emerged,such as logistic regression,early neural networks,support vector machines,decision trees,and various tree models based on decision trees,such as random forests.In recent years,the significant advancements in computer computing power and the explosive growth of data volume have led to a plethora of solutions to the credit scoring classification problem.Various machine learning and deep learning algorithms have shown remarkable performance in tackling classification problems on real datasets.This thesis takes the classical logistic regression model as the benchmark and compares it with the random forest model,the newer algorithms XGBoost and CatBoost.AUC,K-S value,PSI value,and balanced accuracy(BA)serve as the primary evaluation indicators.This thesis employs a publicly available dataset from Kaggle and implements the modeling process using R,with a training set and test set split ratio of 4:1.The modeling process and results are presented below.Firstly,a random forest model was established to classify the credit score dataset.Due to the sensitivity of the random forest model to parameters,a 10-fold cross-validation grid search was used to search for the parameters mtry and ntree.The search took 10088.3seconds,and the optimal parameters were found to be mtry=5 and ntree=1000.The model was then built using the optimal parameters and evaluated on the test set.The results showed an AUC of 0.8785,K-S of 0.63,PSI of 0.01,and BA of 0.8308.Compared to the logistic regression model,the AUC was improved by 11.08%.Secondly,an XGBoost model was established to classify the credit score dataset.A5-fold cross-validation grid search was used to search for parameters,and the optimal parameters found were nrounds=200,max_depth=9,eta=0.05,min_child_weight=0.7,and subsamp=0.8.The search took 13248.26 seconds.The model was then built using the optimal parameters and evaluated on the test set.The results showed an AUC of0.897,K-S of 0.67,PSI of 0.03,and BA of 0.83,with an AUC improvement of 13.43%compared to the logistic regression model.Finally,a CatBoost model was established to classify the credit score dataset.A5-fold cross-validation grid search was used to search for parameters,and the optimal parameters found were depth=8,learningrate=0.05,and iterations=400.The model was then built using the optimal parameters and evaluated on the test set.The results showed an AUC of 0.8196,K-S of 0.61,PSI of 0.02,and BA of 0.8196,with an AUC improvement of 3.63% compared to the logistic regression model.The conclusion of the thesis was that when considering only the classification accuracy of the models,the XGBoost and random forest models had similar performance,with CatBoost being slightly worse and logistic regression being the worst.However,the XGBoost model had severe overfitting,and overall,the random forest model had the best performance. |