| Compared with traditional banks,peer-to-peer lending platform have the characteristics of high investment income,low borrowing threshold and simple operation.However,the ability of risk control and risk bearing of lending platform is weak.The overdue or default of the borrower’s repayment will seriously affect the business development of the lending platform and even causes the platform to fail.How to improve the credit evaluation ability of the lending platform to borrowers and avoid the occurrence of bad debts is important measure to ensure the the stable development of peer-to-peer lending platforms.Based on this,this thesis studies how the peer-to-peer lending platforms can use historical data to accurately predict the credit risk of new loan customers befor lending.Based on the review of related research on credit evaluation,the main research contents of this thesis include the following aspects:Firstly,aiming at problem of outliers,missing data and unbalanced data in the historical data of the lending platform,data preprocessing methods were designed.The abnormal data existing in the historical data is analyzed by means of data analysis,and the abnormal data is deleted.Different imputation methods were adopted according to the severity of data loss.For the feature with more data missing,the random forest model was trained to interpolate the missing values.In view of the shortcomings of smote algorithm in solving the problem of data imbalance,k-means algorithm is used to improve smote algorithm.Logistic regression model,support vector machine model,K nearest neighbor model and decision tree model were used on the datasets with 150,000 samples.The effectiveness of the improved SMOTE algorithm was verified by comparative analysis experiments.Secondly,aiming at the problem of few original features and the tedious process of manual exploration of original features by experts,this thesis designs a GBDT+GA+LR Stacking ensemble learning method to improve the model’s prediction ability.The first layer of Stacking ensemble learning uses the GBDT algorithm to combine the original features.In order to select the best set of features in the combined features to train the second-level logistic regression model,this thesis uses the genetic algorithm to find the optimal feature combination among the features of the combined GBDT according to the between-class and within-class discrete metric.The second layer of Stacking ensemble learning uses the first layer of GBDT algorithm combination and the filtered features to train the logistic regression model.By comparing the experimental analysis,the validity of the original feature combination using GBDT algorithm is verified,and the effectiveness of the Stacking ensemble learning method of GBDT+ GA+LR designed in this thesis is verified.Thirdly,aiming at the problem of weak generalization ability of a single machine learning model,a selective ensemble learning method was designed for customer credit risk assessment based on Bagging.In order to increase the difference between the base learners,the training set is perturbed to generate multiple sets of training sets and verification sets.Logistic regression and decision tree algorithms with fast training speed and strong predictive power in credit evaluation problems are used as base learners.Different machine learning algorithms are used as base learners to further increase the difference between base learners.Then we select a better model to participate in the ensemble based on the performance of the model in the verification set.The effectiveness of selective ensemble learning designed in this thesis is verified by comparative experiments. |