Sampling-based Partitioning In Mapreduce For Skewed Data

Posted on:2014-02-07

Degree:Master

Type:Thesis

Country:China

Candidate:P Zou

Full Text:PDF

GTID:2248330398950715

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Cloud computing as a new service model, has become one of the internetâ€™s most influential technology of the twenty-first century in the information age. In recent years, could computing has an high ranking among internet technology. Many well-known domestic and foreign IT companies regard cloud computing as their primary developing strategy of technology. Cloud computing has gradually begun to change the way people work and traditional software process. Cloud computing provide the public cheap and convenient IAAS (Infrastructure as a Service), PAAS (Platform as a Service), SaaS (Software as a Service) service.With the rapid development of the internet, the amount of stored data and the amount of data transmitted grow geometrically compared to the previous period. From2006to2010, the global information overall grow more than6times. At2010, production of information was1.2ZB. The growth rate was50%and by2020, this figure will become35ZB. Traditional storage structure canâ€™t be achieved flexibly to expand due to architectural limitations. Therefore, it canâ€™t keep up the pace of data growth, canâ€™t effectively achieve the storage and management of unstructured data transmission. With pet bytes of mass storage needs, traditional architecture in the expansion of capacity and performance bottlenecks, previously dispersed fragmented built, very easy to form islands of information. The processing and analysis of huge amounts of data has become an important issue.Map Reduce is treated as a treatment tool for distributed mass data today, because its "easy to expand","fault tolerance","cheap" and so on. It has been widely applied to many fields. However, due to the key value used in the design of unified distribution algorithm to Reduce, when processing data exists tilt it will lead to the processing of data resulting in imbalances in the distribution of "short leg" operations, and ultimately affect the overall operating results. Today to solve this problem are mostly asynchronous Map and Reduce and collect the distribution of key values in advance, and then develop a distribution plan, but this will waste a lot of time.The paper will examine how to efficiently distribute the middle key value to ensure the balance of the Reduce-side data. The overall frequency distribution of statistical key value using specialized sampling procedures and make distributing strategy in advance. Then we will assign the policy to the allocation process of Map Reduce. This design not only provides a balanced data distribution mode, but also can improve the synchronization performance of the MapReduce. Sampling provides two options:sub-portfolio optimization and sub-divided optimization. The experimental results show, the first method is suitable for the relatively few cases of data, the second method makes the data more balanced overall running time when data is seriously skewed.

Keywords/Search Tags:

Cloud Computing, Sampling, MapReduce, Partiton

PDF Full Text Request

Related items

1	The Research Of Load Balancing In Mapreduce Based On Sampling Estimation
2	Cloud Computing And A Number Of Data Mining Algorithms Mapreduce Research
3	Performance Optimization And Applications Of MapReduce In Cloud Computing
4	Research On Verifiable Computation Based On MapReduce In Cloud Computing
5	Research And Improvement Of The MapReduce Framework In Cloud Computing
6	The Research Of Task Scheduling Algorithm For Mapreduce Framework In Cloud Environment
7	Research On The Malicious Workers Detection Technology In Cloud Computing Environments Based On Mapreduce
8	Verifiable Computing Based On MapReduce In Cloud Computing Service
9	Reseach On Mapreduce Parallel Computing Platform For Cloud Computing
10	Research And Optimzation Of Availability For Mapreduce In Cloud Computing