Font Size: a A A

A Loan Probability Forecasting System Based On FGBDT Algorithm

Posted on:2018-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:H X HuFull Text:PDF
GTID:2359330566955729Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,the use of data mining in the financial industry becomes more and more popular.For example,purchasing power analysis based on decision tree,customer segmentation based on kmeans model,credit evaluation based on ID3 algorithm etc.How to make good use of the limited financial data,to fully tap the potential of information and knowledge,rules and to take advantage of it,that businesses create more profit,improve profitability,is the focus of financial service providers.This work was conducted by mining the registered user information from a loan APP platform,predicting the users,who was not successfully loaned before,to apply for a loan's probability.according to the size of the possibility of lending again to the user ranking,the enterprise in accordance with the loan probability of different strategies on different marketing scores of users,in order to reduce marketing costs and improve the purpose of the work efficiency.In the process of modeling data,using various machine learning algorithms from the open source library Scikit-Learn and Baidu's large-scale machine learning platform-Pulsar,found through experiments in this type of data under the optimal performance of the algorithm FGBDT(Fully-Corrective Gradient Boosting Decision Tree,Pulsar platform).In addition,for a long time and the loan platform cooperation and optimization of the Pulsar platform of existing FGBDT model training system,the work platform of Pulsar FGBDT algorithm based on the development of a FGBDT super parameter learning system,the system can automatically learn the optimal FGBDT model's hyper-parameters to the corresponding training data.This system mainly consists of data preprocessing,model training(FGBDT model),cross validation,FGBDT optimal hyper parameter analysis module,a large number of manual record and commissioning was replaced by automated process,greatly simplifying the manual intervention,improve work efficiency.Because the system has made a lot of improvements to the Pulsar platform FGBDT training system,FGBDT hyper parameter learning system has been on-line to the Baidu large-scale machine learning platform-Pulsar.Based on the learning system,it only needs adding a few modules to predict the probability of a user's loan willing,which is combined with the FGBDT learning system to form a complete set of loan probability forecasting system.The system can greatly reduce the technical threshold and labor cost of financial data mining,and has good practical value and economic benefit.Processing and modeling of the sample data,correctly and effectively predict the probability of each unsuccessful loan loan users again sample data,find the potential loan crowd,on loan with high probability of user groups to carry out targeted precise marketing,reduce business operating costs and increase profits.For the specific data and tasks,how to find the most matching algorithm model.The use of data mining methods correctly and efficiently predict the probability of loan users,the applicable performance scene,which relates to various algorithms of data mining algorithm,the optimization algorithm of scalability and so on large data sets.Need to find out the most suitable for the problem of one or more data mining algorithms.The specific algorithm selection is based on the same test set of evaluation indicators such as MAE(minimum mean square error),accuracy rate,AUC.Through the study of Library Scikit-Learn and Baidu large-scale machine learning algorithms of Pulsar platform on the modeling of open source machine in this work,for each model of half off cross validation on the same test set,to compare different models by the average AUC and AUC variance of the pros and cons of select FGBDT algorithm.In addition,in order to further improve the prediction accuracy of the model,according to the distribution characteristics of the data features are divided into five groups,each group according to the characteristics of corresponding processing to the data separately for its modeling,the different levels of model combination for a combination of model.Develop a set of complete and efficient user loan probability system,which will require a lot of manual operation of the prediction process standardization,automation,process.The system is based on the FGBDT super learning system,which can be realized by adding a few modules,such as the training of the full data set,the extraction of the negative sample data,and the prediction of the loan probability.FGBDT super parameter learning system can input data for a given learning a number of group FGBDT model optimal hyper parameters,data preprocessing,model training,cross validation and results of background module and front-end parameter input,operation monitoring,log display,results show the system includes the Web interface.The prediction of the need for a new batch of users,only need to upload data to the platform,set the basic parameters,the system can automatically complete the data preprocessing,model training,cross validation,loan probability prediction steps,without manual intervention.How to make the system support for large data sets and reduce the time of model training and prediction as much as possible.In the large data set,the algorithm under the small data set,the system will generally become no longer practical,such as training time is too long,inefficient or direct memory overflow system error.This requires that the system has very good robustness to the order of magnitude of data.In addition,the user's demand for loans is a certain timeliness,once the user access to loans in other platforms,the possibility of short-term loans to the user is very small.Therefore,it is a key problem how to train the model quickly and effectively.The Baidu Pulsar platform by input and output data stored in HDFS distributed file storage system to solve the big data storage,training and prediction process to parallel model through Hadoop and MPI cluster,thereby greatly reducing the processing process of the whole system.The loan probability prediction system developed in this work is carried out in system improvement,package,two times after the development of the Baidu Pulsar platform of original training system based on FGBDT model,and therefore natural for large data sets,but also has good parallelism.
Keywords/Search Tags:data mining, machine learning, FGBDT, loan probability, financial data
PDF Full Text Request
Related items