Font Size: a A A

A Study On Risk Identification Of P2P Lending Platform Based On Semi-supervised Learning

Posted on:2021-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2427330623965677Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
In recent years,China's P2 P lending industry has shown explosive growth.As of December 2018,the cumulative turnover of China's P2 P lending industry reached RMB 8028.742 billion.By April 2019,more than 6,600 P2 P lending platforms had been set up in China,involving 2.1509 million investors in that month.P2 P lending has become an important part of China's wealth management market that cannot be ignored.At the same time,however,chaotic phenomena such as self-financing and false financing of the platform,difficulties in withdrawing funds,and platform running away with money are also emerging.A large number of platforms have been closed down and transformed,and the number of platforms still in operation is decreasing.This not only poses a serious threat to the personal interests of investors,but also hinders the healthy development of "Internet + Finance" in China.Therefore,how to identify the credit risk of P2 P lending platforms has become the focus for investors.However,for individual investors,it is difficult to obtain the detailed information about a large number of P2 P lending platforms.This not only requires investors to master certain network data acquisition technologies,but also requires them to be able to pre-process the collected data,which leads to the limited ability of individual investors to identify the platform's credit risks.Based on the above situation,the author determines the research theme as the risk identification of P2 P lending platform,that is,based on the information that can be obtained,the research uses what kind of model can better reveal the potential association between the attributes X related to platform risk and platform credit status Y,so as to predict the credit status of the current operating platforms to assist investors to make rational investment,avoid high-risk platforms as much as possible,and reduce losses.The author downloaded the operating data of some platforms in recent years from the CSMAR database.For WDZJ and P2 PEYE,two P2 P lending industry consulting websites using AJAX technology,the author uses a combination of Selenium Server and related functions in the rvest package to realizes data crawling from such dynamic web pages.After that,the author found that the data of some platforms were still missing.Finally,a large number of required data on the official website of P2 P lending platforms were collected manually.After the data collection,the author processes the string by constructing a large number of regular expressions,corrects obvious errors,deletes invalid attributes,integrates data from different sources,deletes duplicate information,and provides data consistency.Because there are still missing values in the data set,the author uses missForest to fill in missing values in the data set.The final data set contains 6522 pieces of platform data with a total of 82 attributes,covering the basic information of the platform,operation status,investor impressions,and credit status.By analyzing the relationship between the operating status and credit status of the platform,the author labels the platforms of various operating status as "trusted","untrusted",and "unknown" platforms.In the end,there were 84 "trusted" platforms,2859 "untrusted" platforms,and 3579 "unknown" platforms.After that,the author uses random forest for feature selection and uniformly perform max-min normalization on all features.Because the imbalance of the data set is very serious,and the data set contains a large amount of unlabeled data,the author uses the trusted platform recall rate,untrusted platform recall rate,and cost-sensitive error rate as evaluation indicators according to the characteristics of the research question.By designing a comparison experiment,on the one hand,the problem caused by the imbalance of the sample is solved,on the other hand,the supervised learning model and the semi-supervised learning model are separately trained on the same labeled data set,and the evaluation indicators of various models are calculated.Finally,the best performing model is selected to predict the credit status of the operating platform.The author trained 7 supervised learning models including CART decision tree,Bagging,random forest,BP neural network,Naive Bayes,SVM,and kNN.In terms of semi-supervised learning,TSVM,graph semi-supervised learning model and collaborative training model based on the above seven supervised learning models were trained.During the training process,grid search is used to determine the optimal parameter combination of various models.The parameter settings of TSVM and SVM are consistent.The parameters of the base learner in collaborative training are the default parameters.The experimental results show that none of the models used by the author can obtain the highest recall rate of trusted platforms and the highest recall rate of untrusted platforms at the same time.TSVM often performs better than SVM under the same parameter combination.The best-performing supervised learning model is kNN(k = 7),which is denoted as Model 1.The best-performing semi-supervised learning model is the collaborative training model(SVM and kNN collaboration),which is denoted as Model 2.Although Model 1 can obtain The highest recall rate of trusted platforms(94.77%),its cost-sensitive error rate is 314.93% higher than model 2.Model 2 makes full use of a large amount of unlabeled data to obtain the highest recall rate of untrusted platforms(96.89%)and the lowest cost-sensitive error rate.Therefore,the semisupervised learning model can be effectively applied to the risk identification field of P2 P lending platforms.
Keywords/Search Tags:P2P Lending, Risk Identification, Semi-Supervised Learning
PDF Full Text Request
Related items