| With the advent of the era of data explosion, how efficiently processes TB level or even large-scale data, PB’s urgent, is the problem we need to solve. With the application requirements and technology promoting, cloud computing as a new computing model was put forward, and gradually became the main theme of the IT industry, Hadoop distributed computing platform is the open source implementation of cloud computing, and its major components are HDFS(Hadoop Distributed File System) and MapReduce computing model. MapReduce distributed computing framework as a tool for cloud computing and large-scale data processing is widely used in major enterprises. However, in actual use, MapReduce still has a lot to improve, especially in the scheduling mechanism, such as the uneven distribution of tasks, while the original scheduling approach wastes a lot in resource and flow.In this paper, it mainly focuses on the low efficiency of resources and waste problems when IBM Platform MapReduce interation does repeated calls from the same data file system.It proposes functional requirements, including splitting cache requirements and cache-aware scheduling requirements, by tracking customer reports,and analyzing performance.It proposes the performance requirements including increasing the efficiency of K-means algorithm performance requirements and scalability requirements.Based on the above requirements,it proposes the solutions.In this solution,in the middle of HDFS and Map data optimization tasks, the same data is stored in the cache between jobs, the cache information is managed, and the cached information will be notified to the master management node.It will Reduce the call from the HDFS file system data, reduce the occupation of local disk space, reduce job runtime to solve the waste of resources and low efficiency problem when doing massive data analysis.In this paper,it introduces the design and implementation of cache aware scheduling for MapReduce platform,including two subsystems,which is splitting cache and cache aware scheduling.The design and implementation of the splitting cache subsystem mainly includes the design and implementation of splitting cache status judgment module, the splitting cache registration module, the splitting cache expiration information management module,to achieve the same data from the HDFS avoiding calling from the file system repeatedly and store data in the memory cache.The design and implementation of cache aware scheduling subsystem mainly includes the design and implementation of SSM connection with MRSS module, MRSS storing and updating module,SSM storing and updating module, SSM scheduling module and fault tolerance mechanism module,to achieve that the master management node knows the splitting cache information and the host list with the splitting cache information,so that the management node can rationally allocate tasks to the computing nodes, then optimize the use of the resource and improve the efficiency of data processing. As the test results of this paper,it shows that when you enable splitting cache and cache aware scheduling features, speed of mass data operations iteration has significantly improved, and the time spent running the job has significantly reduced. In addition, the performance of Hadoop with splitting cache,compared to the standard Hadoop,improves around33%,and increase the efficiency of K-means algorithm. The test result indicates that the test has passed and has met the needs. |