| With the continuous development of Internet technology,network and enterprise production need to deal with more and more data,and cloud computing has become a popular computing model of large data processing.Hadoop as an open source system platform for cloud computing,and soon became the mainstream of large data processing technology.With the wide application of Hadoop cluster,its performance problem has attracted people’s attention.Load balancing plays an important role in cluster performance and is the focus of this thesis.In this thesis,we study and analyze the load balance problem in MapReduce operation,and achieve the aim of performance optimization.In the heterogeneous environment,the node computing ability is different.In the task scheduling process of MapReduce,the task load is unevenly distributed,which causes the individual nodes to execute too long and affects the response time of the whole operation.This thesis presents a kind of task scheduling algorithm based on load balancing.By analyzing the characteristics of the task and the performance of the nodes in the heterogeneous cluster,the algorithm obtains a task schedule load balance metric,which provides the basis for the task assignment of the nodes,so that each node matches its performance in the process of task scheduling and the dynamic adjustment of the load is realized by establishing the node communication model during the execution of the task,which ensures the load balancing in the task scheduling.The default Hash partitioning mechanism in MapReduce execution process,will result in the data load tilt problem when processing the intensive data.In this thesis,a partition cost model is proposed to evaluate the load balancing problem of the partition,and a new fine-grained partitioning algorithm is proposed,which increases the number of partitions and reduces the tilt data in the partition,to ensure the relative balance of the data received by the node through the partition cost model.At last,by setting up the experimental environment,and designing the corresponding experimental scheme to verify the task scheduling algorithm and the fine granularity partition algorithm which optimize the cluster load balancing. |