| Cloud computing as a new service model, has become one of the internet’s most influential technology of the twenty-first century in the information age. In recent years, could computing has an high ranking among internet technology. Many well-known domestic and foreign IT companies regard cloud computing as their primary developing strategy of technology. Cloud computing has gradually begun to change the way people work and traditional software process. Cloud computing provide the public cheap and convenient IAAS (Infrastructure as a Service), PAAS (Platform as a Service), SaaS (Software as a Service) service.With the rapid development of the internet, the amount of stored data and the amount of data transmitted grow geometrically compared to the previous period. From2006to2010, the global information overall grow more than6times. At2010, production of information was1.2ZB. The growth rate was50%and by2020, this figure will become35ZB. Traditional storage structure can’t be achieved flexibly to expand due to architectural limitations. Therefore, it can’t keep up the pace of data growth, can’t effectively achieve the storage and management of unstructured data transmission. With pet bytes of mass storage needs, traditional architecture in the expansion of capacity and performance bottlenecks, previously dispersed fragmented built, very easy to form islands of information. The processing and analysis of huge amounts of data has become an important issue.Map Reduce is treated as a treatment tool for distributed mass data today, because its "easy to expand","fault tolerance","cheap" and so on. It has been widely applied to many fields. However, due to the key value used in the design of unified distribution algorithm to Reduce, when processing data exists tilt it will lead to the processing of data resulting in imbalances in the distribution of "short leg" operations, and ultimately affect the overall operating results. Today to solve this problem are mostly asynchronous Map and Reduce and collect the distribution of key values in advance, and then develop a distribution plan, but this will waste a lot of time.The paper will examine how to efficiently distribute the middle key value to ensure the balance of the Reduce-side data. The overall frequency distribution of statistical key value using specialized sampling procedures and make distributing strategy in advance. Then we will assign the policy to the allocation process of Map Reduce. This design not only provides a balanced data distribution mode, but also can improve the synchronization performance of the MapReduce. Sampling provides two options:sub-portfolio optimization and sub-divided optimization. The experimental results show, the first method is suitable for the relatively few cases of data, the second method makes the data more balanced overall running time when data is seriously skewed. |