| With the development of the Internet technology, data generation speed and data s-cale in many real applications are increasing dramatically. Real time stock transaction systems, for example, produce thousands of transaction records every second and seek help from data analysis systems for efficient data storage and analysis. At the same time, to detect and handle the underlying transaction threats, transaction data must be analyzed with strong time constraints, which poses real-time data analysis requirement to data anal-ysis systems. On the other hand, the development in manufacturing has fundamentally changed the computer architecture. In recent several years, the prices of the severs with large memory capacity and powerful multicore multiprocessors is decreasing steadily, and NUMA architecture has become a de-factor standard for enterprise multicore servers. In last decade, many in-memory data analysis systems, such as MonetDB, SAP HANA, etc., have emerged to handle the increasing demand of data storage and analysis in real appli-cation by taking advantage of new hardware. And in-memory data processing has become a hot spot in both industry and academia. Although there is a large body of work focusing on optimizing the data processing performance in centralized systems, such systems fail to meet the increasing demand of data analysis on massive data set, since their scalability are inherently limited by the memory capacity and the number of CPU cores in a single server.Compared with centralized data processing systems, distributed in-memory data pro-cessing systems running on a cluster of servers have much richer computing and storage resources and hence are more feasible for real-time data analysis. However, in distributed in-memory databases, real-time data analysis must follow distributed transaction proto-cols, resulting in severe performance degradation. To tackle this problem, this paper ana-lyzes the data generation fashion in real applications and focuses on the high-throughput distributed data processing techniques on append-only data store. The goal of this paper is to improve the data processing performance for mixed tasks by proposing a new query processing engine that can fully leverage all sorts of hardware resources in the cluster. The major work and contributions of this paper is as follows.1. A High-Throughput Parallel Data Processing Engine(HTPDPE) is proposed based on the queries with mixed workload (performing time continuous queries while data is importing simultaneously). This engine simplifies distributed transaction processing by decomposing distributed concurrency control into the independent centralized concurrency control on each of the involved servers.2. A novel distributed in-memory database index, namely ECSB-Trees, is proposed to replace traditional centralized tree-family indices. Many optimizations, such as cache sensitive structures, elaborate data structure organization for internal index node, index key compression, computation resource utilization enhancement, etc., has been made in ECSB-Treesto improve the index performance and throughput of the entire system.3. Based on the characteristics of queries with mixed workload, we propose a light-weight concurrency control method with Multiple Version Concurrent Control and Copy On-Write mechanism. An index structure that maintains multiple versions is also proposed to guarantee the correctness and completeness of the time continuous queries, to analyze conflict-serialization, and to reduce the probability of conflicts as much as possible.4. We normalize the task input interfaces in HTPDPE such that they are compatible to existing database systems with a SQL parser under the SQL 92 standard. By using our engine, existing database systems can significantly improve the performance of query with mixed workload. In this paper, we integrate our engine into CLAIMS system, to enable real-time data importing and efficient time continuous queries. |