Spark Task Scheduling With Immovable Data

Posted on:2020-04-23

Degree:Master

Type:Thesis

Country:China

Candidate:W Xu

Full Text:PDF

GTID:2370330623959878

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Due to the limited resources of a single data center,complex Spark workflow applications need to be executed in multiple data centers.When the input data is too large to be transferred(immovable data),tasks that depend on these data can can only be assigned to the data center that contains their input data.On the other hand,the task scheduling methods provided by Spark are not suitable for heterogeneous environment.This thesis considers the problem of scheduling the Spark workflow application with partial immovable input data to heterogeneous data centers for minimizing the completion time of the Spark workflow application.This problem is of important practical significance.The challenges of this problem mainly include the following two aspects:(i)tasks that need immovable data can only be assigned to specific data centers,and the resource competition between these tasks and other tasks could affect the completion time of the Spark workflow application.(ii)The number of feasible stage scheduling sequences is huge,and the critical path cannot be obtained in heterogeneous environment,so it is difficult to find a suitable scheduling sequence to optimize the completion time.This thesis proposes a rule-based task scheduling algorithm(STSID)for the considered problem.The algorithm is divided into two phases: time parameters estimation and stage allocation.The former proposes two node processing rate estimating methods and calculate all the time parameters of stages.And the latter contains three parts: stage selection,resource allocation,and adding ready stages.The first part selects the stage with highest priority from the collection of ready stages.The priorities of stages depends on the immovable data is higher than the other stages.For other stages there are three optional priorities rules: ESTF(earliest start time first),SFTF(shortest float time first)and RANDOM.The second part proposes three Stage scheduling algorithms based on EATF(Earliest Available Time First),EFTF(Earliest Finish Time First)and SWRF(Smallest Waste of Resource First)respectively.The third part is used to add ready stages to the ready stage set.To verify the performance of the proposed algorithm,the multi-factor analysis of variance(ANOVA)technique is adopted to analyze the involved parameters and algorithm components.The proposed algorithm with the calibrated parameters and components is compared to other two algorithms(FIFO and FAIR)supported the Spark Framework for similar problems over standard scientific workflow instances.Experimental results indicate that the proposed algorithm outperforms the compared algorithms with different number of Jobs and the number of nodes in data centers.

Keywords/Search Tags:

Spark, Heterogeneous nodes, Immovable data, Task scheduling

PDF Full Text Request

Related items

1	Study On The Key Technologies Of Geo-Computation On Distributed Clusters
2	Research On Task Scheduling Mechanism Of Storm-based Data Processing System
3	Task Offloading And Scheduling For Stochastic Requests
4	Research On Task Scheduling In Distributed Multi-Processors Environment
5	A Method Of Sanitizing A Privacy-sensitive Mobility Knowledge Network Of Trajectory Data Based On A Spark Platform
6	The Research And Implementation Of Pairwise Comparison Task Parallel Of Gene Sequences On Spark
7	Research On Expression And Construction Method Of The 3D Immovable Property Object
8	Research On Identifying Key Nodes Based On Heterogeneous Networks
9	Research On Task Scheduling Of Marine Metrological Verification Based On Dynamic Priority
10	Based On Simulation The Solar Wind System Research The Massive Data Task Scheduling Algorithm