| The P2P(Peer-to-Peer)network lending platform,an important mode of Internet finance,provides loan-related services to both borrowers and lenders.With the increase of lending transactions in P2P networks,the mining and analysis of P2P transaction data has attracted much attention.A large number of high-dimensional historical transaction data have been produced and advanced machine learning and data mining technology have been studied,of which the researches on aspects such as factors influencing the success rate of the Internet loan,fraud identification,default risk prediction and risk factors are important research projects.The network loan data,factors influencing the success rate of loan and risk factors of the loan are studied in this thesis,which mainly concludes the following three aspects:1、In terms of the factors influencing the success rate of internet loans,in view of the fact that the existing linear regression method does not consider the multicollinearity between variables,nor does it adopt the optimal variable subset to establish the regression model.In this paper,Lasso regression method combined with the most The regression model of the optimal variable subsets analyzes the factors affecting the success rate of network borrowing,avoids the interference of the multicollinearity problem on the model,and improves the fitting accuracy of the model to the data.This paper empirically analyzes the borrowing and lending data from the Lending Club platform,and shows that our method is significantly superior to the compared approach in the aspects of fitting precision of the model and avoiding the multicollinearity.2、In order to describe the qualitative attribute in the regression model,it is usually necessary to introduce the dummy variable.For the regression equation,a method to describe the different degree of importance of the different dummy variables in the regression equation is proposed.In this paper,this method decomposes the regression square with dummy variables,including the sum of the dummy variable part and that of the non-dummy variable part,calculates the proportion of the two parts in the regression equation,and takes the proportion as the index of relative importance of every dummy variable in regression equations.On lending data sets of Lending Club and Prosper network,the experimental results of the influence of the purpose of loan on the borrowing success rate and the influence of credit grade on the borrowing rate have shown that when compared with the traditional regression equation which only provides a dummy variable coefficient and cannot show its importance,the method can show the importance of different dummy variables,and provide an important means to quantitatively analyze the influence degree of qualitative independent variables on the dependent variable in the regression equation.3、In terms of the research on the factors influencing the default risk in online loans,the default samples only accounts for a minority of the samples,it is an imbalanced sample size.Most of the existing literatures do not take the imbalanced samples in the default data into consideration.This paper processes imbalanced data by using SMOTE(Synthetic Minority Oversampling Technique)oversampling,combined with the classic Logistic dichotomy method for risk factors of default risk mining.The empirical analysis of nearly 890,000 historical lending data of Lending Club,and draws a conclusion that loan amount,interest rate have a positive correlation with the default risk,while the credit level,working years and risk of default is negatively related,and so on. |