| Efficient and scientific credit risk assessment system is an effective way to solve the credit default risk caused by information asymmetry between banks and users.Given the problem that the current bank user data have unbalanced categories,the characteristic attributes of the original data set are limited.The general credit risk assessment model is to conduct a credit risk assessment on the original data set,but the deep information in the data cannot be effectively used.The generalization ability of the model and the reliability of the prediction results often cannot meet the needs of commercial banks.Therefore,this thesis proposed an effective bank user credit risk assessment system based on data mining theory.The system used the feature construction method to expand the data feature dimension of the original data set,thoroughly excavated the deep data resources,and balanced the data set by comprehensive sampling,which improved the generalization ability of the system and the reliability of the prediction results.Firstly,the original data set is preprocessed by missing value processing,abnormal value processing,and data conversion.Given the class imbalance in the data set,the SMOTE(Synthetic Minority Oversampling Technique)and TOMEK are integrated to balance the sample data.The possibility of the under-sampling algorithm to remove the samples with rich information content was avoided,and the misjudgment of the critical point samples caused by the overlapping of synthetic samples in the SMOTE algorithm sampling process was also avoided.Therefore,the recognition ability of the model was improved in the case of the categories with fewer samples.Secondly,in order to solve the problem of limited feature attributes in the original data set,this thesis uses expert knowledge in the field of financial risk control and data mining technology such as feature box and feature cross to derive features and filter features through IV(Information Value)and retains 25 feature fields conducive to model classification.Finally,balanced,feature engineering,and conventionally preprocessed data sets were used as the inputs of the model.The classification algorithms of single and integrated models were compared and analyzed on the user credit data set published by Lending Club: logistic regression,naive Bayes,decision tree,random forest,XGBoost,and Stacking algorithm.The experimental results demonstrated that the AUC values were improved for models constructed by the balanced data set.Besides,the best performance was obtained by the ST-Stacking model,and the AUC value and accuracy rate were improved to 91.77 % and 88.63 %,respectively. |