Query Optimization In SQL To Spark

Posted on:2017-09-25

Degree:Master

Type:Thesis

Country:China

Candidate:K Z Cai

Full Text:PDF

GTID:2348330491462674

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet and the popularization of numerous new Internet applications, enterprise and research institute begin to encounter huge data scale up to TBs, even PBs. Recently, due to the decreasing price of the memory, researchers pay more and more attention to storage and computing based on memory in order to further improve the ability for the data processing. Spark, based on the resilient distributed datasets (RRD), offers a distributed memory computing framework with light weight, high speed and scalable properties. However, existing advanced query tool, Spark SQL, doesn’t optimize the multi-query; batch processing the query lead to various Spark tasks, which disallows the data sharing between tasks, limiting the memory computing of Spark in turn. Targeting on the existing issues, this paper investigates the optimization on the query process between Spark SQL and Spark.More specifically, this paper analyzes the workflow of Spark SQL and optimize the query process through following two strategies:1) addressing the data sharing between multiple query via adding middle layer storage between the persistent file system and the Spark core, and optimizing the input of query data via reasonable distribution of the memory resources, effective data storage structure, low-cost fault tolerant and failure recovery design. 2) introducing query task data management module to manage the middle layer storage, and achieving high efficient utilization of cluster resource by estimating the cost of query task and selecting the appropriate data load node strategy based on the cost model.We designed and implemented the SQL2Spark system to achieve all the functionality aforementioned, and then compare the query performance with Spark SQL based on test provided by TPC-H. The experimental results demonstrate that SQL2Spark system has significant advantages in improving the query speed, reducing redundant I/O cost and decreasing memory usage.

Keywords/Search Tags:

Spark, Spark SQL, Query Optimization, Middle Layer Storage

PDF Full Text Request

Related items

1	The Query Execution Optimization In Spark SQL
2	Query Optimization In Spark SQL For Business Data Of 4G Industry Card Based On HDFS
3	An Ad-hoc Query Engine Based On Spark SQL
4	Research On Query Analysis And Optimization Based On Spark System
5	Dynamic Optimization Of Spark RDD Storage Solutions
6	Research Of Spark SQL Query Optimization Based On Runtime Statistics Collecting
7	Study And Implementaion Of SPARK SQL Query Optimization
8	Temporal Query Analysis And Temporal Index Optimization Based On Apache Spark
9	Research On Cost-based Query Optimization For Spark SQL
10	Data Transmission And Storage Method Optimization Of Spark Shuffle