Font Size: a A A

Research On Detection Of Abnormal Mobile Communication Users Based On Improved Random Forest

Posted on:2022-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:X Q HuangFull Text:PDF
GTID:2518306533972819Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
With the development of communication technology,mobile communications bring convenience to people,but also brings bad illegal activities such as telephone marketing,reactionary information and harassment fraud,which not only seriously affect people's normal life,and even give national stability and social harmony brings a negative impact.Based on the need to build the "green communication network" environment,this paper uses easy parallel training random forest algorithm to perform an abnormal mobile communication user on the Spark distributed platform.Since the traditional random forest algorithm is handled in the real large data scenario,it does not meet the satisfactory effect when the size of the large scale,high dimension,the category characteristic base is high,including redundant characteristics etc,which does not achieve satisfactory results,for these issues,the random forest algorithm has been improved and parallel optimized in this paper.The main work of this topic is divided into the following aspects:(1)Based on the sparseness of L1 regularization,the characteristic selection of raw data is realized.In response to a large amount of redundant feature in a massive mobile communication user data,this paper uses the L1 regularized Logistic Regression model to characterize the original data,using L1 regularization sparse,makes the redundant feature score close to 0,and will The characteristics of the score are less than 1e-4,and the role of feature reduction is used,and the reserved feature is divided into high,two correlation intervals in the score size.Since the random forest algorithm is trained for each decision tree,all high phase-related features and randomly extractive partial sections are used for decision tree construction.This ensures the difference between the decision tree,and avoids the accuracy and stability of the redundant feature affecting the algorithm.Experimental results show that the random forest algorithm based on L1 regularization is compared to the traditional random forest algorithm AUC value from82.43% to 92.57%,so the algorithm has better feature selection capabilities and prediction accuracy in an abnormal mobile communication user detection system.(2)Based on Entity Embeddings technology,the high-category class feature encoding is implemented.Since the massive mobile communication user data contains a lot of high-category categories feature,the L1 regularization-based random forest algorithm does not support optimal segmentation processing on the high-category characteristic.This article uses the Embedding matrix to map large One-hot sparse vector linearly to the low-dimensional space reserved semantic relationship,so that the feature dimension is reduced from the original 5089 weigh to 559 dimensions,thereby completing the completion of information compression in situations that do not lose semantic.Experimental results show that the random forest algorithm based on Entity Embeddings and L1 regular fusion is increased from 82.43% to 98.76%,and the training time is reduced by 3.71 hours by 3.85 hours.Therefore,the algorithm has higher training efficiency and prediction accuracy,and it is possible to completely solve the problem of a lot of sparse matrices generated by One-Hot coding,which effectively avoids the model's hypervision,and realizes an abnormal mobile communication user.Light quantization detection.(3)Based on the Spark distributed platform,parallel optimization of improved random forest algorithm is realized.Aiming at the problem of high training of real mobile communication data,high characteristic dimensions,models in stand-alone mode.This paper uses broadcast variables,continuous feature sampling discretization,and histogram statistics,three parallel optimization strategies are implemented,and parallel optimization based on Entity Embeddings and L1 regularization fuses.The experimental results show that the algorithm training time in the stand-alone improvement random forest algorithm is reduced by 102.6 minutes to 15.2 minutes,and the running speed of 2 million samples on the distributed platform is 5.53 times in single-machine mode,so parallel random Forest algorithm compares the algorithm in stand-alone mode with smaller training time and higher operating efficiency.And the experimental results show that the parallelization algorithm using three optimized strategies is less than 99.13% of the parallelization algorithm Auc value that does not use the optimization strategy.Therefore,a parallel random forest algorithm using an optimized strategy has better anti-fitting.
Keywords/Search Tags:random forest algorithm, feature selection, feature coding, distributed computing, histogram statistical algorithm
PDF Full Text Request
Related items