Font Size: a A A

Automatic Optimization Method For Spark Job Configuration Parameters Based On Bayesian Optimizatio

Posted on:2021-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:H C HuangFull Text:PDF
GTID:2568306905475484Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the field of big data computing,the Spark system has become one of the increasingly popular computing platforms.Its application range is very wide with the functions cover offline batch processing,SQL processing,streaming/real-time computing,machine learning,graph computing and other different types of computing operations.In actual use,however,the configuration parameters of Spark jobs are complex and diverse and the performance tuning is very dependent on the actual experience of technical personnel,making it almost impossible for users to obtain the optimal configuration parameters in a short time.But if there is no reasonable performance tuning of the job,it is likely to increase the execution time of the job and increase its resource waste rate,so it is difficult to fully reflect the advantages of Spark as a fast big data calculation engine.In this regard,this thesis proposes a method of automatic optimization of job’s configuration parameters based on Bayesian optimization.Based on an in-depth analysis of Spark job’s execution principles,by analyzing the historical log information and remotely monitoring the consumption of the computing cluster,this thesis first determines several main characteristics that affect Spark job’s performance.Secondly,a performance prediction model for the job is established by using job’s data size,its specific configuration parameters,the cluster resource usage and other characteristics as inputs,and the execution time or resources waste rate of the job which reflecting different user scenarios as outputs.In this process,this thesis fully considers the many types of Spark jobs,and the resource utilization and performance characteristics of different types of jobs are different.Based on the job’s data size and DAG graph,the K-means algorithm is used to cluster jobs.By comparing the model evaluation indexes and algorithm complexity of commonly used machine learning algorithms,the selection is based on the LightGBM regression algorithm to achieve the establishment of performance prediction models for different job types.Finally,this thsis uses Bayesian optimization algorithm to search out the optimal configuration parameters of the job through iteratively calculating and correcting the performance prediction model,and to achieve the goal of automatic optimization for Spark job’s configuration parameters.Different tests have been perfomed on some untrained jobs separately using the system’s default configuration parameters,the optimization parameters obtained by searching genetic algorithms published by others,and the configuration parameters obtained by the method described in this thesis.The results show that the method described in this thesis can effectively reduce the execution time and the resource waste rate of the job.It verifies that the method is effective and advanced,also shows that the work in this thesis has a certain innovation.
Keywords/Search Tags:Spark, configuration parameter, performance prediction, bayesian optimization, machine learning
PDF Full Text Request
Related items