Font Size: a A A

Spark Performance Prediction And Optimization Tool Based On Container Environment

Posted on:2023-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z C DingFull Text:PDF
GTID:2568307022498554Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The era of big data has given birth to big data computing platforms represented by Hadoop and Spark.Spark,which has been widely used in enterprise business development,is currently the most commonly used distributed computing framework in the industry.At the same time,it is deployed in a container environment.Increasingly become the development trend of cloud vendors.Spark provides a faster running speed and a more convenient programming interface for big data development engineers.The running of Spark programs includes more than one hundred parameter configuration options.Different Spark parameter configurations have a significant impact on the running performance of different Spark programs.There are also subtle interactions between different parameters.To improve the execution performance of Spark programs,it is usually through expert experience to adjust configuration parameter tuning,but such experts are scarce,and manual tuning will also consume a lot of The time cost of testing,so a tool that can automatically optimize Spark is necessary.Designed and implemented a Spark program tuning tool.First,collect the running data and execution time of the Spark program,then use machine learning methods to establish a performance model for predicting the execution time of the program,and finally search the Spark configuration parameter space through the search algorithm based on the performance model.Optimized parameters to achieve performance tuning of Spark programs.The tuning tool includes four modules: load management,execution time prediction model selection,parameter optimization,and optimization history.Load management includes two parts: the built-in test load and the Spark program submitted by the user.The built-in test load is based on the big data test set Hibench,which supports different data sizes and different types of loads.The execution time prediction model selection module uses machine learning methods to create models in the early stage,and finds models with similar loads in the history database through similarity in the later stage.The optimization parameter module searches for the optimal configuration of the load in the corresponding parameter space through the search algorithm,and can view the optimal configuration and search iteration graph in the optimization history.The experiment is based on a three-node cluster and configured with a container environment.The Docker engine is installed on each node,the Kubernetes container scheduling platform is installed on the master node,and the optimization tool is deployed on the master node.After experiments,the accuracy of the model built by wordcount 100 G load can reach 80%,and the optimal configuration obtained from the search is more than 5times better than the default configuration.
Keywords/Search Tags:Big data, Spark, Container, Machine learning, Search algorithm
PDF Full Text Request
Related items