Font Size: a A A

Research And Improvement Of Job Scheduling Algorithm Based On Hadoop Cluster

Posted on:2018-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y F WangFull Text:PDF
GTID:2348330515981991Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of large data age,cloud computing has been the business community and various researchers of great concern.Hadoop is an open source cloud computing platform developed by Apache.Hadoop platform is mainly composed of two parts,namely Hadoop basic HDFS distributed file system and Hadoop core MapReduce computing framework.MapReduce calculation framework as the core content of Hadoop,the main function is the number of data processing.The job scheduling technique in the MapReduce framework plays a key role in allocating system resources in the system.However,Hadoop’s own scheduling algorithm has different shortcomings,so it is necessary to study the shortcomings of the scheduling algorithm and make targeted improvements.The performance of scheduling algorithm is an important factor that affects the performance of the system.In the Hadoop cluster environment,the main performance indicators of the system are data locality and the average completion time of the job.The essence of the algorithm is to improve the data locality of Hadoop cluster,reduce the network transmission cost and avoid blocking.In the thesis,a local scheduling algorithm is proposed to improve the locality of the data.The local scheduling algorithm defines the node selection conditions of the Map task and Reduce task.The scheduling algorithm is to process the data in the HDFS as far as possible,so that the data can be run on the local node.Because the Map task in the local scheduling in different completion time,the presence of Reduce Early Shuffle after the task start mechanism of idle phenomena influenced the average completion time of the job,the job completion time increased.In order to solve the above problems,the thesis proposes a new scheduling strategy,which integrates the preemptive scheduling strategy on the basis of the data locality.Suspend the task and release resources to other Map tasks in the Reduce task waiting,when to complete the task of Map to a certain extent after the re scheduling of Reduce tasks,so as to meet the data locality algorithm,the operation will also reduce the average completion time.At the end of this thesis,we describe the implementation of a new scheduling algorithm under the Hadoop cluster platform,and compare the performance of the preemptive local scheduling strategy and the non-integrated preemptive local scheduling strategy.Through the experiments in the cluster environment,the algorithm proposed in this paper improves the average completion of the local data by 17%,and the average completion time of the algorithm of seizing the scheduling strategy is reduced by 14.12%,effectively optimizing the data local performance,reducing the network transmission,and reducing the average completion time of the operation.
Keywords/Search Tags:Hadoop cluster, Job scheduling, Data localization, Average completion time, Preemption
PDF Full Text Request
Related items