| With the popularization of the internet and the development of the new generation of information technology,we have entered the era of big data.In the face of a huge amount of data,how to dig out the value behind the data is a problem that all industries have to face.For financial institutions providing loan business,it is of great significance to predict the risk level of customers through these data,so as to match the corresponding interest rate level and reduce the risk of loan business to the maximum extent.Based on machine learning algorithm and resampling method,this thesis builds models to predict interest rate grades,finding out the important factors affecting interest rate grades,and compares the influence of Borderline-SMOTE method and Near Miss method on the model accuracy.First of all,this thesis introduces the background and significance of the topic,summarizes the research status of the influencing factors of interest rate level,credit score and data resampling methods at home and abroad,and introduces the related theories briefly.Secondly,descriptive analysis and preprocessing are carried out on the used data sets.Data preprocessing includes six parts: missing value processing,feature derivation,feature coding,feature normalization,feature selection based on Relief F algorithm and resampling.There are two methods for resampling: Borderline-SMOTE oversampling and Near Miss under-sampling.Through data preprocessing,three kinds of training set data are obtained: original training set,over-sampled training set and under-sampled training set.Then,four models of logistic regression,extremely randomized tree,LightGBM and Stacking are built on each training set data,and 12 models are obtained.Finally,a variety of evaluation indicators are used to evaluate the forecasting effect of the model,and the basic model with the best forecasting effect is used to determine the influencing factors of interest rate grades.The results show that the Stacking model has the best effect and the LightGBM model has the best effect among three base models.Resampling can increase the attention of the model to the minority samples,and the same resampling method has different effects on different models.According to the LightGBM model,the number of loan applications in the last six months,proportion of unpaid accounts,loan amount,verification status,loan-to-income ratio,and loan purpose can be regarded as significant factors having impact on the interest rate grades and need to be focused on. |