On The Low Overhead Configuration Optimization Of In-memory Big Data Query Engine

Posted on:2022-11-20

Degree:Master

Type:Thesis

Country:China

Candidate:J H Xin

Full Text:PDF

GTID:2518306773971529

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

Spark SQL has been widely used in industry as an in-memory query engine for big data.However,it is challenging to tune its performance.The latest research attempts to solve this problem by using machine learning（ML）.However,ML-based approaches have two drawbacks.First,it takes a lot of time to collect training samples,resulting in a high time overhead.Second,the optimal configuration of one input data for an application may not be optimal for another input data of the same application,requiring re-tuning.To address these issues,this article presents a novel approach to automatically tune the Spark SQL analysis query application online.This method innovates three key technologies.The first technique is called Query Configuration Sensitivity Analysis（QCSA）,which eliminates Configuration insensitive SQL Queries when collecting training samples.The second technique,called Datasize-Aware Gaussian Process（DAGP）,can automatically adapt to data set size changes and builds Query performance modeling in combination with configuration parameters.The third technique,called Identifies Important Configuration Parameters（IICP）,determines Important Configuration Parameters with respect to performance（e.g.,execution time or query latency）and in turn only tunes the Important Parameters.As a result,the method proposed in this paper can tune the configuration of Spark SQL applications with low overhead and adapt to input data set sizes changing.We employ applications from TPC-DS,TPC-H,and Hi Bench benchmark suites,four high-performance ARM server clusters and eight high-performance X86 server clusters,to evaluate the proposed method.Experimental results show that compared with the most advanced automatic tuning solutions,the optimization overhead is reduced by 9.7 times and 9.2 times,and the optimization performance is improved by2.4 times and 2.8 times,on the ARM and X86 clusters,respectively.In addition,the proposed method can automatically adapt to the scenario where the size of the input dataset changes.

Keywords/Search Tags:

Big data, In-memory computing engine, Spark, Spark SQL, Auto-tuning

PDF Full Text Request

Related items

1	Research On Memory Optimization Technology Of Spark Computing Engine
2	Design And Implementation Of Telecom 4G Big Data Platform For Network Optimization Based On Spark
3	Research And Implementation Of Memory Optimization Based On Parallel Computing Engine Spark
4	Adaptive Memory Management Research Based On In-Memory Computing Characteristics In Spark
5	Research On Workload-specific Memory Configuration Of Spark Workloads
6	The Profling And Memory Analysis On Typical In-memory Computing Big Data System
7	Scalable Big Data Analysis Platform Based On PostgreSQL And Spark
8	Research On Memory Data Management Technology In Spark
9	Research And Implementation Of Data Hybrid Computing Platform Based On Spark
10	Research On The In-Memory Data Management Technology On Spark Data Processing Framework