Font Size: a A A

Research On Large-scale Traffic Classification Technology Based On Spark Performance Optimization

Posted on:2021-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:K YangFull Text:PDF
GTID:2428330611450310Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,the increasingly mature internet technology has promoted the continuous development of the information society,which not only brings great convenience to people's work and study,but also generates the explosive growth of network traffic.In the face of huge network traffic,both storage capacity and computing efficiency pose severe challenges to the traditional network traffic classification technology based on single machine environment.How to classify network traffic accurately and quickly has become a hot issue to be solved urgently.As a popular big data analysis platform,Spark has become an effective way to solve this problem by enabling distributed storage,providing in-memory computing,and extremely high operational efficiency.At the same time,random forest is a good performance and easy to parallelize classification algorithm.Therefore,the research content of this paper can be divided into the following two parts.This paper firstly studies the application of random forest classification algorithm in Spark platform.In the process of network traffic classification of random forest,decision trees with different classification abilities can not be treated differently.This paper implements a weighted random forest algorithm based on Spark platform,so as to give full play to the performance advantages of decision trees with strong classification performance and reduce the impact of decision trees with poor classification ability.Experimental results show that the algorithm proposed in this paper has higher classification accuracy and good scalability.Secondly,this paper studies the performance optimization technique of Spark.In order to solve the problem that the Shuffle operation triggered by the Shuffle operation during the execution of the Spark job seriously affects the performance of Spark,this paper uses the Spark Shuffle acceleration plugin,crail-spark-io,to optimize the Spark Shuffle.The plugin is implemented based on RDMA remote directmemory access technology.Since the plugin can not handle aggregate class operators in a multi-partition environment,this article optimizes the Shuffle logic of the accelerated plugin to take full advantage of clustering resources to improve Spark performance.
Keywords/Search Tags:network traffic classification, Spark, random forest algorithm, RDMA, Crail-Spark-IO
PDF Full Text Request
Related items