Font Size: a A A

The User's Purchase Behavior Prediction Based On Unbalanced Data

Posted on:2021-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:X J GeFull Text:PDF
GTID:2480306248955729Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the rise of e-commerce,online shopping has become popular.There are a lot of products online,but only a small percentage of them can be browsed at a time.In order to buy products that meet their requirements,users often need to spend a lot of time and energy to browse products.Data is valuable.Based on the data of users' historical behaviors,Such as browsing,collecting,adding to the cart,and buying behavioral data,it can predict users' future purchase intention,which will greatly improve users' shopping experience.Precision marketing must be an important competitiveness of the shopping platform.In view of this,this paper studies how to establish a model to predict users' future purchase behavior based on their historical behavior data,which is derived from Ali Tianchi data.In this paper,the prediction of users' purchase behavior is regarded as a binary classification,and the purchase and non-purchase are used as labels to construct a supervised learning model.There is only implicit behavior data of users in the data set,and the number of features is small.After visual analysis of the data,four types of features are added,namely user features,commodity features,category features and crossover features.The data of the purchased class is far less than that of the unpurchased class,which belongs to unbalanced data.For this,this paper mainly deals with this problem from the aspects of sample distribution,feature and algorithm.In terms of sample distribution,the following two methods are adopted: one is SMOTE-Sampling which is an oversampling method,and the optimal sampling ratio is found according to the training set;The second is the sampling method based on clustering which is an undersampling method.In terms of features,the following two groups of features are adopted: one is the original feature;The second is to manually select the features according to the correlation between the features and the importance of the features calculated by the Decision Tree algorithm.In terms of algorithm,the following two types of models are adopted: one is a single learner model,such as Logistic Regression,and the other is an Ensemble Learning model,such as Random Forest and Gradient Boosted Decision Tree.Since Tianchi provides real user purchase data with predicted date,the final modelevaluation effect is the F score of the online test set.The logistic regression algorithm is very sensitive to the unbalanced data set,and the F score of online test is significantly improved after SMOTE sampling and feature selection.Randomness and scoring mechanism are inherent in the feature selection of Random Forest algorithm,so the feature selection has no obvious effect on the improvement of the model.Although sampling is also random,SMOTE sampling improves the F score of the online test.In the algorithm of Gradient Boosted Decision Tree,the F score of test on line based on cluster balanced sampling is better than that of Random Forest,but the test score is still lower than that of SMOTE-Sampling line,and the sampling process is complex.Ensemble Learning can handle a large number of samples,and up-sampling allows the model to learn more information.
Keywords/Search Tags:Behavioral Prediction, Unbalanced Data, Feature Engineering, Ensemble Learning
PDF Full Text Request
Related items