| At present,cloud computing technology is developing rapidly and gradually changing people's life.As a cloud computing platform with high reliability,high scalability and high fault tolerance,Hadoop has received extensive attention and use.With the application of Hadoop in more and more fields,the operating environment is becoming more complex,especially in the heterogeneous cluster environment,the amount of data is increasing and the node resources are limited.In order to solve this problem,people are constantly optimizing the Hadoop platform.The improved Hadoop job scheduling algorithm can effectively use resources in heterogeneous cluster environment to improve the utilization of cluster and the efficiency of operation.It is of great significance to improve the performance of the whole Hadoop platform.Hadoop's FIFO scheduler,capacity scheduler and fair scheduler are scheduling with the premise of homogeneous cluster.It has a good scheduling effect in the face of small cluster and single job type.However,the scale of Hadoop cluster is increasing,the heterogeneity of cluster is more prominent and the type of job is more complex.The scheduling algorithm of Hadoop itself can not make effective resource allocation and task scheduling for this situation,which will cause the low utilization of resources and the imbalance of load.In this paper,the job scheduling algorithm for Hadoop platform is analyzed and compared,and an improved job scheduling algorithm which can adapt to heterogeneous cluster is proposed,which mainly includes the following aspects:First,the concept of load balancing index is put forward.The load balance index is measured by four characteristics CPU,memory,disk IO and network bandwidth.The load balancing is weighted according to the node capacity and the load balance index is obtained from the point of node capability and resource utilization.Secondly,the genetic algorithm is used to model the Hadoop job scheduling problem,and the resources and tasks involved in the job scheduling are numbered and parameterized,and the scheduling problem of the job is transformed into the genetic algorithm for the optimal solution.The job completion time and the load balance index proposed in this paper are used as the fitness function of the genetic algorithm.At the same time,two targets of job completion time and load balancing are optimized.In this way,the optimal individual is found by genetic algorithm and the best task allocation list is decode.The scheduler can effectively shorten the total job completion time and maintain the load balance according to the obtained task list.Finally,a small Hadoop cluster is built to have a test about load balance and the completion time of the improved scheduling algorithm.The results show that the new scheduling algorithm can guarantee the load balance in the heterogeneous cluster environment and better than the Hadoop scheduling algorithm in the job completion time. |