| The wide use of the smart phones and the Internet has brought new challenges to the massive data analyzing and processing technology. The traditional data processing methods require a few days or even a few months to deal with massive data, but the distributed computing increases efficiency by distributing the complex calculation task subset to multiple computers.Widely used in the scene of batch processing is the MapReduce framework based of Hadoop. Because of high abstraction, MapReduce provides an extremely simple programming model, and the usability is one of the reasons that make MapReduce to be widely used.The Hadoop is mainly designed for batch processing of massive data, and has good throughput when dealing with large data sets, but is not suitable for real-time scene. The task starting and disk reading of the Hadoop needs more time, which causes second delay.The Storm framework developed by the Twitter performs well in the scene of real-time processing. Storm is a real-time distributed computing system, and has high fault tolerance. The developing Storm Trident adds the function of batch processing to the Storm, but its programming model is very complex, and its data transfer has the risk of memory overflow. Besides, the fault tolerant of the Checkpoint cannot be used on the Storm Trident, and there is the risk of distributed inconsistent state.In the paper, the distributed parallel processing technology is studied, and a framework of distributed computing based on Storm is put forward, which has the advantages of simple calculation programming model, small delay, and efficient fault tolerance etc.. The main work is as follows:(1) The low-level primitive of Storm is abstracted, and a distributed incremental framework named MapReduceMerge based on Storm is designed. The transition from MapReduce to MapReduceMerge is very easy.(2) Designing data transmission from Map to Reduce of batch computation deals with overflow of Reduce. Using Push and independent coordination improves the performance of data transmission. Balancing the task distribution of all batches using hash factor improves the throughput of the system in the data of tilt.(3) Reducing the influence to computing performance by the fault tolerance mechanism of Merge state based on Checkpoint. Using scheme of interval batch improves the performance of computation by sacrificing the error recovery time. The use of asynchronous data storage can ensure normal calculation not be interfered when the computing resource is enough, and the use of multiple versions of data can ensure consistency in distributed environment.(4) Keeping the snapshots of the results to ensure sequence and idempotent of merging computation of state, avoids repeating computation of data, reduces the calculation time when failed. Designing the collaboration of master and slaver ensures sequence of Checkpoint.In order to verify the advantages of the framework put forward in the paper, a distributed computing environment is build and cases are implemented on the framework, MapReduce, and Storm respectively. The test result shows that the proposed framework can improve the rate of distributed incremental computing, and realize the fault tolerance of the incremental results without sacrificing performance. |