Spark Performance Prediction And Optimization Tool Based On Container Environment

Posted on:2023-01-16

Degree:Master

Type:Thesis

Country:China

Candidate:Z C Ding

Full Text:PDF

GTID:2568307022498554

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The era of big data has given birth to big data computing platforms represented by Hadoop and Spark.Spark,which has been widely used in enterprise business development,is currently the most commonly used distributed computing framework in the industry.At the same time,it is deployed in a container environment.Increasingly become the development trend of cloud vendors.Spark provides a faster running speed and a more convenient programming interface for big data development engineers.The running of Spark programs includes more than one hundred parameter configuration options.Different Spark parameter configurations have a significant impact on the running performance of different Spark programs.There are also subtle interactions between different parameters.To improve the execution performance of Spark programs,it is usually through expert experience to adjust configuration parameter tuning,but such experts are scarce,and manual tuning will also consume a lot of The time cost of testing,so a tool that can automatically optimize Spark is necessary.Designed and implemented a Spark program tuning tool.First,collect the running data and execution time of the Spark program,then use machine learning methods to establish a performance model for predicting the execution time of the program,and finally search the Spark configuration parameter space through the search algorithm based on the performance model.Optimized parameters to achieve performance tuning of Spark programs.The tuning tool includes four modules: load management,execution time prediction model selection,parameter optimization,and optimization history.Load management includes two parts: the built-in test load and the Spark program submitted by the user.The built-in test load is based on the big data test set Hibench,which supports different data sizes and different types of loads.The execution time prediction model selection module uses machine learning methods to create models in the early stage,and finds models with similar loads in the history database through similarity in the later stage.The optimization parameter module searches for the optimal configuration of the load in the corresponding parameter space through the search algorithm,and can view the optimal configuration and search iteration graph in the optimization history.The experiment is based on a three-node cluster and configured with a container environment.The Docker engine is installed on each node,the Kubernetes container scheduling platform is installed on the master node,and the optimization tool is deployed on the master node.After experiments,the accuracy of the model built by wordcount 100 G load can reach 80%,and the optimal configuration obtained from the search is more than 5times better than the default configuration.

Keywords/Search Tags:

Big data, Spark, Container, Machine learning, Search algorithm

PDF Full Text Request

Related items

1	Design And Implementation Of Machine Learning Platform Based On Spark
2	Research And Implementation Of Machine Learning Application Framework On Spark
3	Research And Implementation Of Efficient WEB Container Log Processing System Based On Spark
4	The Design And Implementation Of Parallel Conditional Random Fields Algorithm Based On Spark Platform
5	Optimal Scheduling Of Machine Learning Tasks In Container Computational Environment
6	Analysis And Research Of Machine Learning Model Based On Spark
7	Research And Application Of Distributed Demantic Neighbor Search Algorithm Based On Spark
8	Design And Implementation Of Parallel Data Mining System Based On Spark
9	Research On Network Intrusion Detection Algorithm Based On Spark
10	Thermal Power Plant Energy Saving Analysis Based On Spark Big Data Platform