| With the rapid development of economy and science,the amount of data generated by various industries every day is myriad. There are even innumerable data without any rules. Faced with such a complex and countless data, how should we use them? How do we dig out the meaningful information from the data in a shorter period of time? The main purpose of this project is to achieve a common, flexible and efficient mass offline data processing engine.Based on a new design of the current large data processing engine does not have the versatility, this engine is proposed. A new design: the use of DAG(directed acyclic graph) model to establish scenarios. DAG model can satisfy the needs of users according to their own needs to be flexible to change the order of the implementation of each scene. DAG model solves the problem that the engine allows the users to customize the operators. DAG model is conducive to the realization of the engine’s high scalability, flexibility and versatility. In order to improve the processing speed of the engine, this engine uses the Spark computing framework. The intermediate processing results of Spark are stored in memory. In the process of iterative data processing, it can reduce a lot of IO consumption. Meanwhile, Spark interior design model determines its high scalability, which can meet the demand of the engine for scalability, flexibility. Finally, Spark is a distributed computing framework to support DAG, which is compatible with the D AG model selected by the engine. Each operator in this engine represents a data processing function. This engine provides a number of operators, and supports the user to customize the operators according to their own processing requirements. This engine is a further encapsulation of Spark.Users do not need to use the underlying Spark API when they customize the operators. The engine can achieve the docking of various heterogeneous data, can pull data which be specified from the users’ different data sources to HDFS, and can handle different types of files.The engine has been put into use, currently running well. The engine solves the technical problems of low efficiency and poor universality of the existing large data processing system. |