| Nobel Prize winner Mohammed Yunus pioneered a personal-to-person unsecured micro-loan model,which has since thrived in today's Internet age,which is nowadays P2 P online lending.P2 P online lending is an important part of Internet finance.It fills the small and medium loans of traditional lending.The state encourages financial innovation.We also need a well-developed P2 P online lending platform to improve the utilization of our idle funds.Judging from the development history of our country's P2P online loan industry,the unconstrained P2 P online loan industry will inevitably become the soil for financial fraud,and strict supervision will break the P2 P online loan platform that attracts investors by high interest rates.The road of growth requires a benign development of the P2 P online loan platform,and the benign development platform necessarily requires an intelligent risk control model.The most appropriate processing method is matched with the most suitable machine learning algorithm to build the most appropriate risk control model..Based on this background,this article selects the US P2 P online loan platform Lending Club's loan data from 2007 to 2018,which is the world's largest P2 P online loan platform,and the data is transparent and can be downloaded through the official website.After obtaining the data,perform exploratory data analysis first,and then perform a series of data cleaning based on the exploratory analysis.The address data is extracted separately,and the model is compared without adding the address data and adding the address data in different encoding methods to different models.The impact of generalization ability and analysis of the causes of these effects.The encoding methods used include unsupervised one-hot encoding,dumb variables and label encoding,supervised average encoding and improved average encoding.The model used was LightGBM launched by Microsoft Research Asia in 2016,and compared with traditional support vector machines and random forests,the AUC index was used to evaluate the generalization ability of the model.The results show that in most cases,supervised coding will improve the generalization ability of the model more than unsupervised coding,but it cannot be superstitious about a certain method.In fact,it is difficult to find a general method on all models.They all perform well,but the optimal combination for the data set in this paper is to use the LightGBM framework to model the address data by improving the average encoding method.It is also difficult to find the worst encoding method.Even the seemingly wrong encoding method of out-oforder category features,even by label encoding,can work well with a numerically insensitive tree model.Structurally,this article first introduces the research background and significance,summarizes the academic research on P2 P online loan platforms,and then gives a theoretical overview of P2 P online loans and the machine learning algorithms used in this article.Exploratory analysis was performed on the data source,and data cleaning was performed on the basis of exploratory analysis.Finally,experiments were performed and the effects of LightGBM and two other machine learning methods were compared to draw conclusions. |