Font Size: a A A

Research On Performance Optimization Of Data Access In Mapreduce

Posted on:2014-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2268330422464733Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
MapReduce programming model was the central in the era of big data processingtechnique and has been widely used in all fields of industry. MapReduce technique framemasks the underlying complex operations in parallel programming of the past, makesdevelopers a hardware transparent developed environment, with allowing developers tofocus on specific cloud computing applications. Although MapReduce itself has manyadvantages, because of the complexity of distributed computing and the diversity ofapplications, MapReduce technique cannot meet the demand of the rapid development ofcloud computing, so it is necessary to optimize the MapReduce technique to improveperformance.It assumes that the computing nodes are homogeneous in cluster but this assumptionon the Map tasks scheduling in heterogeneous environments is inefficient. Map task canaccess input data as soon as the metadata of Map task input split is available. This waymakes the access overhead very high. For this drawback, we design preschedule andprefetch technique to assign tasks to the best node based on the capacity of the node andthe distribution of the data. Besides, the task loads data to the memory of the node beforethe task executes which minimizes the overhead of data access. For another, theimplementation of MapReduce technique virtualizes the resources of the whole storageand computation in cluster, tasks communicate and share data by RPC protocol whentasks execute concurrently. This way hides all kinds of impact factors and decreases thedifficulty of data transfer, but increases the task execution time. We design preshuffletechnique to merge intermediate result and buffer to local memory, and transfer it toreduce task through pipeline, which increases the efficiency of intermediate data transfer.In summary, we integrate optimized technique with the native MapReduce and test itwith hadoop benchmark applications. Experimental results show the validity of the thetechnique in this dissertation.
Keywords/Search Tags:Data Prefetch, Tasks Schedule, Distributed Computing, Cloud Computing
PDF Full Text Request
Related items