| Event is the term used in data collection to refer to the implementation of capture,processing,and sending user actions at the"point of operation" where data needs to be collected.It is used to track the use of applications and can provide operating data support such as containing page access and number of page visits."When a data-driven enterprise makes business decisions,it often needs a high-quality event tracking system to complete the collection,processing,storage and calculation of event.So,the thesis topic has practical application significance.The existing event tracking system evolved from the offline data warehouse,using the Lambda architecture to meet both offline and realtime demand.This is a general big data processing framework with the advantage of high stability.However,it has high development and operation costs,and is prone to inconsistent offline and real-time statistical caliber issues.In order to solve the above disadvantages,this paper designs and implements a system based on unified streaming and batch processing,which can clean,process,store and calculate the event data.Unified streaming and batch processing refers to the use of a unified method to complete streaming computing and batch computing of data to ensure the consistency of processing and results.The system uses unified streaming and batch processing to build an event warehouse,which supports switching the processing mode of computing tasks under the same cluster,reducing the cost of development and operation.The system designs a flexible and complete event format,which provides a solid foundation for subsequent data analysis.In the reporting process,the system reduces the amount of event data that needs to be transmitted through batch reporting,extraction of common fields and compression.In addition to integrating and scaling advanced big data storage and processing systems,the paper also meets non-functional requirements by improving system workflow and system architecture,such as using publish-subscribe mode between different modules to improve reliability.According to the software engineering method,the paper conducted requirement analysis,overall architecture design,module design,and code implementation for the tracking system.The system uses Apache Flink as the unified streaming and batch processing engine,uses Kakfa to temporarily store event as a downstream data source,and uses the table format Apache Hudi to support both streaming and batch reading and writing of data.The system has passed functional and non-functional tests,can cover the whole link of event data collection-processing-storagecomputing,and provide support for enterprises to carry out product improvement,operation optimization,marketing analysis and business decision-making,so as to better to help business grow. |