Font Size: a A A

Spark Task Scheduling With Immovable Data

Posted on:2020-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:W XuFull Text:PDF
GTID:2370330623959878Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Due to the limited resources of a single data center,complex Spark workflow applications need to be executed in multiple data centers.When the input data is too large to be transferred(immovable data),tasks that depend on these data can can only be assigned to the data center that contains their input data.On the other hand,the task scheduling methods provided by Spark are not suitable for heterogeneous environment.This thesis considers the problem of scheduling the Spark workflow application with partial immovable input data to heterogeneous data centers for minimizing the completion time of the Spark workflow application.This problem is of important practical significance.The challenges of this problem mainly include the following two aspects:(i)tasks that need immovable data can only be assigned to specific data centers,and the resource competition between these tasks and other tasks could affect the completion time of the Spark workflow application.(ii)The number of feasible stage scheduling sequences is huge,and the critical path cannot be obtained in heterogeneous environment,so it is difficult to find a suitable scheduling sequence to optimize the completion time.This thesis proposes a rule-based task scheduling algorithm(STSID)for the considered problem.The algorithm is divided into two phases: time parameters estimation and stage allocation.The former proposes two node processing rate estimating methods and calculate all the time parameters of stages.And the latter contains three parts: stage selection,resource allocation,and adding ready stages.The first part selects the stage with highest priority from the collection of ready stages.The priorities of stages depends on the immovable data is higher than the other stages.For other stages there are three optional priorities rules: ESTF(earliest start time first),SFTF(shortest float time first)and RANDOM.The second part proposes three Stage scheduling algorithms based on EATF(Earliest Available Time First),EFTF(Earliest Finish Time First)and SWRF(Smallest Waste of Resource First)respectively.The third part is used to add ready stages to the ready stage set.To verify the performance of the proposed algorithm,the multi-factor analysis of variance(ANOVA)technique is adopted to analyze the involved parameters and algorithm components.The proposed algorithm with the calibrated parameters and components is compared to other two algorithms(FIFO and FAIR)supported the Spark Framework for similar problems over standard scientific workflow instances.Experimental results indicate that the proposed algorithm outperforms the compared algorithms with different number of Jobs and the number of nodes in data centers.
Keywords/Search Tags:Spark, Heterogeneous nodes, Immovable data, Task scheduling
PDF Full Text Request
Related items