| The asymmetric characteristics of online information,increasingly loose return policies,and impulsive or fraudulent consumption have led to a high return rate in online retail.The high return rate has a serious impact on merchants’ reverse Logistic costs and operating costs.Therefore,It is very important to accurately predict the return rate of e-commerce in order to formulate personalized information strategies,inventory planning and management and other preventive measures in advance.Combining the background of big data,this article mines,learns and analyzes the online retail order data set of Kaggle platform based on data mining and machine learning theory,and proposes a three-dimensional research framework based on the factors affecting returns behavior of users,merchants,and transactions.Proposed and verified an integrated return prediction performance improvement method based on feature space selection,model parameter adjustment and algorithm improvement,and verified the superiority of the under-sampling method based on M-FCM in dealing with unbalanced data sets.Specifically:First,identify the factors affecting returns based on the dimensions of users,products,and orders.First,based on the data set analysis results and the prior knowledge of return research,the feature variables that may be related to the user,product,and order dimensions of the return behavior are established.After the features are standardized,the logistic-based filtering method and the decision tree-based embedding method are used.The features are extracted,and the comparative experiment shows that the average prediction performance of each classifier on the feature subset selected by the embedding method is the best;finally,the factors of the optimal feature subset are analyzed based on the analysis results of the Logistic regression and the decision tree.The feature variables of the dimension have the highest average importance.The user’s personal marketing sensitivity,platform stickiness and return tendency characteristics have a significant impact on the return rate;the product price in the product dimension is the most important factor;the order dimension is discount marketing and order products The importance of quantity on the return rate cannot be ignored.Second,establish a classification learning model and continuously optimize its return rate prediction performance through parameter adjustment.Divide the model into training set and test set for independent parameter learning and performance verification process.A subset of the training set is used to establish the initial Logistic,support vector machine,decision tree,random forest model and XGBoost model and obtain the function parameters.One part is used for grid search of the hyperparameters of each classification learning model,and finally the test set is used to test the prediction effect of the model.Experimental comparison shows that the performance of the integrated model is better than that of a single model as a whole,and XGBoost has the best prediction performance,but the prediction accuracy of all models is less than 90%.Finally,in order to solve the problem of abnormal classification learning performance caused by the imbalance of returned and non-returned category samples in the actual e-commerce user behavior data set,this paper proposes a fuzzy C-mean(M-FCM)under-sampling method based on Mahalanobis distance.And through comparative experiments,it is shown that the algorithm is better than the classic FCM-based under-sampling method for processing unbalanced samples.It is the first to verify the superior performance of this conclusion on different classification learning models and different unbalanced rate data sets. |