| As an open source framework for processing and analyzing big data on distributed cluster,Spark has been widely used in industry out of its outstanding execution speed,strong versatility and user-friendly API.However,research has found that Spark still exposes some problems when used in real scenarios.Firstly,Spark has a complex parameter configuration mechanism.It’s difficult for inexperienced users to manually tune the parameters and thus Spark applications cannot show their best performance.Secondly,the implementation of Spark SQL cache can also be further optimized:One is that there’s no automatic cache mechanism in Spark,so the repeated data won’t be cached automatically to reduce calculation cost;The another is that the release of executors under the dynamic resource allocation mechanism will lead to the loss of cached data,which may introduce recomputation overhead in multi-query scenarios.To further improve the performance of Spark,this paper conducts indepth research towards above problems and the main contributions are as follow:(1)To solve the problem of parameter configuration,this paper proposes an intelligent parameter tuning method combining offline and online.The offline module builds a performance prediction model based on machine learning method,and uses a heuristic algorithm to search for optimal parameters which leads to the lowest cost before application running.The online module tunes parameter in a lightweight,feedbackbased way.It monitors application in real-time and dynamically tunes the parameter by integrating a monitor and an adjuster in Hadoop YARN.Based on the idea of combining offline and online,this paper design a parameter intelligent tuning method that integrates the two modules and dynamically adjusts the optimization plan according to the actual situation of the system.It not only solves the "cold start" problem,but also enhances the monitoring and real-time tuning capabilities of the system,and has strong practical applicability.(2)To optimize the reuse of cached data in Spark SQL,this paper proposes a dynamic cache optimization method based on cost model and Markov chain model.This paper first designs a cost model acting on query plan tree,and applies it to evaluate the execution cost of different plans to realize adaptive caching of potential data sets.Then,a Markov chain model is constructed to predict the change trend of the query idle period duration.This model is used to guide the decision-making of the reasonable release time of executor.Based on these two models,this paper designs a dynamic cache optimization method for Spark SQL.This method performs dynamic tuning of cache operations and the release of executors during runtime.It optimizes the reuse of cached data within a query and among multiple queries to improve query performance from multiple perspectives.By building a cluster and designing experiments,this paper proves that the proposed method can effectively improve application’s performance and reduce query response time,and has strong feasibility and effectiveness. |