Font Size: a A A

Research Of Optimization Of Hadoop MapReduce Shuffle Phase

Posted on:2017-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:K K HuangFull Text:PDF
GTID:2348330503972461Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Hadoop MapReduce is a popular open-source programming framework for distributed computing which is based on the Google MapReduce programming model. After a detailed analysis of the data transmission between tasks during a Hadoop MapReduce job by reading the source code, we can see that every Map Task should create a connection to every Reduce Task during the MapReduce transmission process of the intermediate data. So a job which includes many Map Tasks and Reduce Tasks in it can create plenty of connections in the cluster when it is running. These connections may cause network congestion and slow down the job. Some of the key-value pairs of a single Map Task have so little data to transmit to Reduce Tasks after the establishment of the link that the task can not make good use of the link. In addition, Hadoop uses HTTP network for data transmission as default, this lowspeed network protocol is a limiting factor in system performance.In order to reduce the heavy network load caused by the large number of link connections created in MapReduce process and optimize the network I/O of Hadoop Shuffle phase, an improved scheme is to group the Map Tasks and merge the intermediate data of one group before being transmitted to Reduce Tasks. A leader Map Task is chosen in every task group to collect all the other tasks' output data in the group. Reduce Tasks can get intermediate data of the entire group directly from the leader task instead of fetching data from every Map Task in the group. Shuffle scheduler service is rewritten using specific strategies to achieve Map Task grouping. The amount of data transmission links is greatly reduced while the job is running in the cluster. Also grouping makes the data gathered and the amount of data of every single partition in the final output files is increased so the link utilization is improved. Moreover using high speed RDMA network protocol as data exchanging mode can further improve Hadoop system performance.System test results show that this scheme can make a certain upgrade to the system performance. In particular, when the cluster runs other kinds of computing frameworks in the meantime which makes the network very busy, however network congestion has little impact on the improved system. After accelertated by RDMA network Hadoop MapRedcue system shows a considerable improvement in performance.
Keywords/Search Tags:Hadoop, Map Reduce, Shuffle optimization, parallel programming model
PDF Full Text Request
Related items