| With the rapid development of consumer finance in recent years,personal credit business has also developed rapidly,not only with the increase of various online lending platforms,but also with the gradual enrichment of loan varieties,covering almost all aspects of personal production and life.However,the challenge from credit risk is getting more and more serious,and risk assessment through credit score of applicants is especially important.Currently,there are many credit scoring models,but different models have their own advantages and disadvantages.Previous studies have found that a single model is faster to train but has poor prediction accuracy and stability;if a suitable base classifier can be selected for integration,the prediction error can be reduced to a certain extent and the accuracy can be improved;moreover,in practice,due to the limitations of the credit score dataset itself,the positive and negative sample categories are extremely different,and the handling of the imbalance problem also has an important impact on the model performance.Based on the above issues,the following research is conducted in this paper.In this paper,we use random forest method for feature selection,which is able to measure the importance of all feature attributes after fitting the data,compared with the feature selection method of information value commonly used in financial risk control,which avoids the operation of binning each feature and can directly obtain the ranking of feature importance,which is simpler to implement and more efficient in selecting features;according to the ranking of feature importance and business logic,we finally select According to the importance ranking of features and business logic,the features with importance greater than 0.1 are finally selected,and a total of 27 features are selected as entry variables.To test the application of different types of models in practice,four single models of logistic regression(LR),decision tree(DT),simple Bayesian(NB)and support vector machine(SVM)with better performance and higher recognition in credit score classification prediction were selected for experiments;after that,four single models of LR,NB,DT and SVM were used as base classifiers for Bagging integration respectively In order to test the actual classification effect,four classification algorithms with better performance,LR,NB,DT and SVM,were used as base classifiers,and the base classifier with the highest AUC was selected by bootstrap sampling to build a subset of data for adaptive voting.The classifier with the highest AUC is selected by bootstrap sampling to construct a new heterogeneous integration model for experiments.For the problem of imbalance between positive and negative samples in the credit score dataset,an improved Balance Cascade method is proposed,which trains the Adaboost classifier by extracting positive and negative samples to form a balanced dataset to control the classification error rate within a certain range and ensure the accuracy of removing positive samples;after that,according to the imbalance ratio of positive and negative samples,an adjustable parameter is set to ensure the accuracy of removing positive samples.After that,an adjustable parameter is set according to the imbalance ratio of positive and negative samples,and by continuously removing a certain proportion of positive samples,the remaining proportion of positive and negative samples is made close to this parameter,and experiments are conducted on data sets with different proportions of positive and negative samples,combined with the new hierarchical model for training to find the optimal proportion parameter.Because of the greater advantage of RF and XGBoost in accuracy in credit scoring,RF and XGBoost are chosen as the base classifier of the first layer,while the second layer model should not be too complex,too complex may lead to problems such as overfitting and poor generalization of the model on the training set,so the layer model is chosen as a more stable single model logistic regression as the base classifier,through the The experimental results of the credit dataset on the Ali Tianchi competition show that when the ratio of positive and negative samples is set to 2,the accuracy of the credit score integration model based on the improved Balance Cascade method reaches0.80,the accuracy 0.90,the recall 0.84,the F1 value 0.88,and the AUC value 0.74,compared with the single classification model,the Bagging integration model,the Heterogeneous integration model for adaptive selection of AUC,the integration model based on the improved Balance Cascade method is better and more stable than other models. |