Font Size: a A A

Query Optimization In SQL To Spark

Posted on:2017-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:K Z CaiFull Text:PDF
GTID:2348330491462674Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and the popularization of numerous new Internet applications, enterprise and research institute begin to encounter huge data scale up to TBs, even PBs. Recently, due to the decreasing price of the memory, researchers pay more and more attention to storage and computing based on memory in order to further improve the ability for the data processing. Spark, based on the resilient distributed datasets (RRD), offers a distributed memory computing framework with light weight, high speed and scalable properties. However, existing advanced query tool, Spark SQL, doesn’t optimize the multi-query; batch processing the query lead to various Spark tasks, which disallows the data sharing between tasks, limiting the memory computing of Spark in turn. Targeting on the existing issues, this paper investigates the optimization on the query process between Spark SQL and Spark.More specifically, this paper analyzes the workflow of Spark SQL and optimize the query process through following two strategies:1) addressing the data sharing between multiple query via adding middle layer storage between the persistent file system and the Spark core, and optimizing the input of query data via reasonable distribution of the memory resources, effective data storage structure, low-cost fault tolerant and failure recovery design. 2) introducing query task data management module to manage the middle layer storage, and achieving high efficient utilization of cluster resource by estimating the cost of query task and selecting the appropriate data load node strategy based on the cost model.We designed and implemented the SQL2Spark system to achieve all the functionality aforementioned, and then compare the query performance with Spark SQL based on test provided by TPC-H. The experimental results demonstrate that SQL2Spark system has significant advantages in improving the query speed, reducing redundant I/O cost and decreasing memory usage.
Keywords/Search Tags:Spark, Spark SQL, Query Optimization, Middle Layer Storage
PDF Full Text Request
Related items