The Source Code Analysis And Performance Improvement Of MapReduce

Posted on:2016-06-21

Degree:Master

Type:Thesis

Country:China

Candidate:M Guo

Full Text:PDF

GTID:2428330482981287

Subject:Systems analysis and integration

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet and Mobile terminals,and with their increasingly wide application in people's life,the data produced is also on the increase,leading us to an age of "information explosion".Against this background,how to handle these wildly expanding data both efficiently and safely has become a very urgent problem for us.Hadoop is an open-source distributed computing platform that is created under such an era background,which can take advantage of cluster computing and storage capacity to complete the processing of large data.Hadoop Distributed File System and MapReduce(Google MapReduce open source implementation)is the core of Hadoop.Hadoop provide users the distributed underlying infrastructure with transparent details.As the core component of Hadoop,MapReduce is a programming model used for the parallel action with large-scale data sets.It determines the performance and efficiency of Hadoop in its treatment of large data.This paper begins with an introduction to the background against which Hadoop and MapReduce came into being,plus a description of the functions of MapReduce,which was followed by an analysis of the implementation process and framework of MapReduce and an account of the details involved.After that,this paper gives an analysis of the implementation process of MapReduce from the perspective of sound source,in which priority is given to the kernel function codes involved.When a certain proportion of Map tasks have been completed,Reduce nodes will copy all the data generated before it starts its own tasks.This process from Map and Reduce is called Shuffle.Shuffle,a connection between Map and Reduce,which is reputed as "where miracles come",contains a lot of important details,some of which being the most vital parts of the MapReduce operation.In light of this,a good understanding of Shuffle will be a great help in improving the performance of MapReduce.In a real cluster,Map and Reduce are often allocated on different machines,so the Map output data can be pulled down only through the network transmission because there are a large number of Map nodes while a few even a Reduce node.In addition,most of the MapReduce task is completed in Map end,and a relatively large result data will be generated.Therefore,the amount of result data of the Map end via the internet copy is considerable.Network bandwidth has been a valuable resource for large-scale clusters,and it is time-consuming and error-prone for large amounts of result data to be transmitted by the network.Therefore,the process that the Reduce end pulls a large amount of result data from the Map end has become the performance bottleneck of the MapReuce execution.The thesis makes a detailed analysis for the process of the Shuffle.In addition,based on the source code analysis in the previous chapters,the author has analyzed the performance bottleneck of the MapReduce and put forward the idea of the improvement:to combine a large amount of the temporary result data produced by many Map tasks in the same job of the Map node;to replace the mechanism that the original MapReduce architecture combines the result data of a single Map task;to solve the problem of a large number of the result data on an original Map node;to improve the situation that when Reduce side copies these data,it is time-consuming and has a high failure rate.Through the improvement project,the amount of the output result data is decreased on the Map node so that the amount of data transmission of the entire cluster is decreased to a great extent and the failure rate of the data transmission is reduced.Meanwhile,the execution time of the MapReuce job is reduced to some extent,which improves the execution performance of the MapReduce.

Keywords/Search Tags:

Hadoop, MapReduce, source code, Shuffle, performanc

PDF Full Text Request

Related items

1	Research Of Optimization Of Hadoop MapReduce Shuffle Phase
2	The Research Of Improving Performance Of Hadoop Cluster
3	The Optimization Of High Performance MapReduce FairScheduler And The Implementation On Simulator Of Huge Scale Cluster
4	Task Scheduling And Shuffle Scheduling For MapReduce Jobs
5	MDE-Based Approach For Mapreduce Bigdata Transformation Software Development
6	Research On The Performance And Optimization Of MapReduce Model In Hadoop Platform
7	The Mapreduce Model In The Hadoop Implementation Of Performance Analysis And Optimization Improvements
8	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
9	The Research Of MapReduce Job Scheduling Algorithm Based On The Hadoop Platform
10	The Performance Optimization And Improvement Of MapReduce In Hadoop