| Data mining is a subject area for practical applications.Data mining can be considered as the intersection of machine learning and database.Machine learning is commonly used in data mining as a powerful analyzing tool.Among them,the structured data-based machine learning classification algorithm plays an important role in the application process,and many scenarios can be transformed into classification problems,such as risk prediction,disease diagnosis,and recommendation system.With the popularization of big data,cloud computing and other technologies,people’s ability to manage and collect data is getting stronger and larger,and the scale of data in the field of data mining is gradually expanding.This makes many problems in the application of classification algorithms,which can be summarized into two aspects:1)sparse data learning problems caused by missing values and discrete features;2)low-order and high-order feature cross information learning and representation problems.However,traditional machine learning classification algorithms are often not suitable for processing large-scale high-dimensional sparse data,and require manual extraction of feature cross information,which has a long development cycle and low efficiency.Deep learning is a powerful learning paradigm that allows large-scale task-driven feature learning from big data.It can automatically combine low-order features,extract and representations of high-order features.This paper takes the structured data classification problem as the application background,mainly studies the classification algorithm based on the neural network architecture,designs and proposes a classification model based on the DeepFM,and applies it to the risk prediction problem,so as to effectively solve the actual classification scenario with the data sparsity problem and feature cross information learning and representation problems.The thesis mainly includes the following work content:(1)A classification algorithm based on DeepFM and GBDT is proposed,which can automatically extract many different types of i ntersection features:the low-order intersection features of the decom position machine,the implicit high-order intersection features of the neural network,and explicit high-order cross feature based on GBD T leaves node encodings.This can help the model fully learn the 1 ow-order and high-order information hidden behind the data,and im prove the model’s ability to represent nonlinear classification scenari os.(2)The problem of learning and representation of sparse data by entity embedding technology is studied.By introducing the meth od of embedding vectors,discrete high-dimensional sparse features a re mapped into continuous dense vectors in low-dimensional space;experiments have shown that using embeddings can effectively redu ce the time cost,and we introduce the embeddings in the process of building the model.(3)The influencing factors of the model’s comprehensive predi ction performance are studied,including the embedding vector dime nsion,the number of GBDT trees,optimization algorithms and othe r important factors,to determine the optimal parameter set,and to provide reference and guidance for the model application.(4)In order to reduce the memory overhead of model training,this paper studies the feature selection method in the context of bi g data,proposes a feature selection method based on the fusion of mutual information and maximum information coefficient method,a nd verifies it through experiments.The results show that the fusion method based on filtering methods is better than using a single feat ure selection method.(5)In this paper,the machine learning classification algorithm is applied to the risk prediction problem of user default.Based on the traditional classification algorithm and deep neural network algor ithm,different data processing schemes are designed,and different r isk prediction models are constructed.Experiments show that the us e of machine learning algorithms for risk prediction has a high acc uracy rate,which can help loan platforms to effectively avoid risks.(6)Finally,this article combines the theoretical and experimental parts to comparatively analyze the advantages and disadvantages of the eight deep models,as well as their space-time complexity.By comparing various experimental indexes,the results show that the model is dominant in complexity and cross-feature types,and the introduction of GBDT leaves node encodings can indeed effectively improve the classification performance of the DeepFM. |