| In recent years,the information industry has developed rapidly,computer hardware has been continuously upgraded,and the massive data storing and computing ability of Internet industry has been continuously improved.On this basis,in recent years,the industry has produced a series of distributed frameworks designed for massive data,such as MapReduce,Spark,HDFS,etc.,and Spark has won the favor of everyone because of its memory computing features.However,there are more than 200 Spark's configuration parameters.They are often confusing,and the unreasonable configuration of parameters can lead to slow operation of the job and waste of cluster resources.In this situation,this paper proposes the Spark job performance prediction system and configuration recommendation system.The purpose is to accurately estimate the running time and resource occupancy rate of the job before the operation,and then recommend the optimal cluster configuration.The main work of this paper consists of three parts:job performance monitoring system,job performance prediction system and configuration recommendation system.The job performance monitoring system is based on Ganglia,which records job performance data through real-time tracking of cluster nodes.The job performance prediction system uses the encoder-decoder model based on the improved local attention mechanism,to predict the performance curve of the job on the real data set through the simulation run of the job on the sampled data set.The configuration recommendation system considering the delay of the job,the resource occupation,and the cost of applying and releasing the resource,searches in the job configuration space,and combines the performance prediction system to give the optimal configuration parameters under constraint.In addition,this paper also builds a benchmark test program and data collection,and tests the system that implemented in this article in detail.The test results show that the performance prediction system has high accuracy and can meet the performance prediction requirements well.The configuration that given by the recommendation system optimizes the time and resource overhead.It shows that the system designed and implemented in this paper has a strong practical significance. |