| In recent years, Cloud computing has been widely adopted in massive data processing due to the higher processor performance, the higher reliability and scalability. In the background of network information explosion, massive data processing becomes a new challenge in the field of computer science. MapReduce is a new distributed data processing programming model, its main features is that the simplification of the traditional distributed program development, and developers just need to focus on business logic program without thinking about the details of the distributed implementation. Hadoop is the open source implementation of MapReduce, and it provides a data processing foundation platform for enterprise and research institutions of the massive data processing. The main purpose of Hadoop scheduling is to improve the utilization of cluster resources and reduce the running time of the user’s job. Hadoop job scheduling in a cloud environment brings new challenges to academia and industry. Improving and enhancing the ability of job scheduling is of great significance to improve the performance and resource utilization of the Hadoop.First, this paper introduces the concept of cloud computing and the architecture. We delve into the MapReduce programming model and the distributed file system (Hadoop Distributed File System), analyzing the Hadoop job operating mechanism as well as the existing scheduling algorithm.Second, for Priority Based Weighted Round Robin algorithm does not consider the load on the system level, and it can’t fully utilize the processing capacity of the compute nodes in heterogeneous clusters issue, this paper proposes an improved priority scheduling algorithm (Priority Based Multi Scale, PBMC). The PBMC scheduling algorithm is used to classify the cluster’s nodes which have distinct computing capacity and in accordance with the computing ability to sort. The PBMC scheduling algorithm considers the overall system load level, selecting the higher priority tasks assigned to the computing power of a good node. Experimental results show that PBMC algorithm fully consider the performance of the different nodes in the cluster, reducing the job completion time. It also improves the utilization of cluster resources.Finally, by studying the job scheduling mechanism of Hadoop, in view of the randomness and convergence of service and the reliability of the cloud computing system and cluster resource utilization problem, we use queuing model to establish a cloud computing system model, using the compute nodes load values to divide the reliability of nodes. Based on the classification of node reliability, it proposes a new job scheduling algorithm (Job Scheduling Based on Node Reliability, JSBNR). JSBNR puts forward a computing node reliability evaluation model and then launch a matching method of nodes and tasks. Experiments show that JSBNR improves the reliability of the cluster and the utilization of resources, at the same time; it has a good scalable performance. |