Font Size: a A A

On The Low Overhead Configuration Optimization Of In-memory Big Data Query Engine

Posted on:2022-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:J H XinFull Text:PDF
GTID:2518306773971529Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Spark SQL has been widely used in industry as an in-memory query engine for big data.However,it is challenging to tune its performance.The latest research attempts to solve this problem by using machine learning(ML).However,ML-based approaches have two drawbacks.First,it takes a lot of time to collect training samples,resulting in a high time overhead.Second,the optimal configuration of one input data for an application may not be optimal for another input data of the same application,requiring re-tuning.To address these issues,this article presents a novel approach to automatically tune the Spark SQL analysis query application online.This method innovates three key technologies.The first technique is called Query Configuration Sensitivity Analysis(QCSA),which eliminates Configuration insensitive SQL Queries when collecting training samples.The second technique,called Datasize-Aware Gaussian Process(DAGP),can automatically adapt to data set size changes and builds Query performance modeling in combination with configuration parameters.The third technique,called Identifies Important Configuration Parameters(IICP),determines Important Configuration Parameters with respect to performance(e.g.,execution time or query latency)and in turn only tunes the Important Parameters.As a result,the method proposed in this paper can tune the configuration of Spark SQL applications with low overhead and adapt to input data set sizes changing.We employ applications from TPC-DS,TPC-H,and Hi Bench benchmark suites,four high-performance ARM server clusters and eight high-performance X86 server clusters,to evaluate the proposed method.Experimental results show that compared with the most advanced automatic tuning solutions,the optimization overhead is reduced by 9.7 times and 9.2 times,and the optimization performance is improved by2.4 times and 2.8 times,on the ARM and X86 clusters,respectively.In addition,the proposed method can automatically adapt to the scenario where the size of the input dataset changes.
Keywords/Search Tags:Big data, In-memory computing engine, Spark, Spark SQL, Auto-tuning
PDF Full Text Request
Related items