Font Size: a A A

Identifying Potential Mobile 5G Customers Based On Data Minin

Posted on:2023-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y T ZhouFull Text:PDF
GTID:2568306800994859Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
5g service has been launched on a large scale in China,and various operators have launched fierce competition to grab 5g users.However,at present,the opening rate of 5g service is not high,and most users hold a wait-and-see attitude.According to the survey,there are mainly the following reasons:(1)the coverage of 5g base stations in some regions is low,and the utilization rate of 5g traffic is not high;(2)5g package is too expensive compared with 4G package;(3)Little demand for flow;(4)No pursuit of network speed.How to efficiently identify 5g potential users among a large number of communication users and carry out 5g business promotion according to this is the focus of operators.In this context,this paper uses data mining technology to analyze and process user information data,and constantly explore models with better classification effect,in order to save 5g business promotion cost and time cost for operators and improve 5g business transaction efficiency.The data comes from the public data of Chongqing Mobile’s big data platform.In order to protect user privacy,the public information has undergone desensitization training and can be used for academic research.The dataset contains 140,000 user information,a total of 45 feature variables,and the amount of data is huge.In the face of massive data,basic data retrieval should be performed to check data types and missing conditions.Based on the understanding of 5G services,first divide the characteristic variables into categorical variables and numerical variables,and check whether there are unreasonable outliers。After mastering the general situation of the data,start to preprocess the data.There are three main steps in preprocessing,namely,removing unique attributes and duplicate attributes,dealing with missing values,and dealing with unbalanced data.For different variables,different missing value processing methods are adopted.This paper mainly adopts three methods to deal with missing values: delete variables with a large number of missing values,delete samples with a large number of missing values,and KNN algorithm imputation method.Due to the imbalance of the data set,the proportion of 5G users is only 0.2,so it is necessary to balance the data before modeling.This paper adopts three methods: SMOTE sampling,random undersampling,and random oversampling.The data are modeled separately,and the model with the best prediction effect among the different models combined with different sampling methods is retained.Before the model is built,descriptive statistical analysis should also be performed on the data.On the one hand,it can provide a theoretical basis for the treatment of missing values ??in the previous section,and on the other hand,it can systematically understand the distribution characteristics of many variables.In this paper,the Pearson correlation coefficient is used to describe the correlation between variables.The correlation coefficient can show the degree of correlation between each variable and 5G users,as well as the degree of correlation within each variable.The correlation coefficient can be visualized using the correlation coefficient heat map.The formal model building uses five models: Logistic regression model,decision tree classification model,random forest model,XGBoost model and Light GBM model,and compares the performance of each model in the three evaluation indicators of accuracy,recall,and AUC.analyze.The first two are single machine learning models,and the last three are ensemble learning models.By comparing the evaluation indicators,it can be found that the classification effect of the ensemble learning model is significantly stronger than that of the single machine learning model.Comparing the feature importance rankings of the three ensemble models,it is found that the three models place great importance on different feature variables.Therefore,in order to improve the generalization and classification ability of each ensemble model,the above five models of logistic regression,decision tree,random forest,XGBoost and Light GBM are modeled based on the voting method and stacking algorithm in ensemble learning,and the final classification prediction of the model is compared.Finally,it is found that the weighted soft voting strategy has the best prediction effect,and the prediction results are 94.2% accuracy rate,90.3% recall rate,and AUC value of 0.893,which achieves a good prediction effect.
Keywords/Search Tags:Latent user identification, Random forest, XGBoost, LightGBM, Model fusion
PDF Full Text Request
Related items