Tolerating Correlated Failures In Distributed Stream Processing Systems

Posted on:2020-06-11

Degree:Master

Type:Thesis

Country:China

Candidate:J J Zhan

Full Text:PDF

GTID:2428330590958367

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer network technology and the increasing enrichment of data acquisition methods,the need of real-time processing of massive and high-speed data have emerged in more and more fields,which presents the characteristics of high volume,strong timeliness and fast arrival.However,in the face of such high-volume data,the traditional processing mode can't complete the processing within the effective time,distributed stream processing system(DSPS)came into being.With the expansion of the computing scale of the large scale DSPS,failures become normal for system.In a DSPS,serious correlated failures that a large number of nodes fail simultaneously due to network component,power facilities,etc.,can result in a long system downtime.How to ensure that the system can quickly recover from the failures and the availability of the system becomes a key issue in a DSPS.Fault-tolerance techniques in the existing stream processing system can be classified into three types: active standby approach,running a standby node for each work node at the same time so as to replace the failed node immediately when the failure occurs,brings expensive cost.Checkpoint approach,periodically extracting checkpoints and fetching the latest checkpoint once failure occurs,brings long recovery latency.And upstream backup approach,all data stored in the upstream and rewinding all the stream data during failure recovery,also brings significant recovery delay.Abovementioned fault-tolerance techniques are often applied for independent failure,and they can't solve the latency problem caused by nodes' recovery waiting in correlated failures.One has to restore a large number of nodes and finish state synchronization in a very short time,posing great challenges to the failure recovery in DSPSs.To solve the long recovery latency while tolerating correlated failures in DSPSs,we present Ares,a high performance and fault tolerant DSPS.Ares presents a creative task allocation strategy considering application topology,available resource,processing latency and recovery latency to select the optimal task allocation for each task.In the design of Ares,we use a game-theoretic approach to solve the FTS problem and propose a novel Nirvana algorithm based on best-response dynamics.We mathematically prove the existence of Nash equilibrium in the game.We implement Ares atop Apache Storm and conduct comprehensive experiments to evaluate this design.The results show that,compared to existing designs Ares achieves a 3.6x improvement of throughput,as well as reducing the processing latency and the recovery time by 50.2% and 52.5%,respectively.

Keywords/Search Tags:

DSPS, Fault tolerance, Task allocation, Game theory

PDF Full Text Request

Related items

1	Design Of Dynamic Task Allocation Mechansim For NoC Manycore Fault-Tolerant System
2	Task Decomposition And Allocation Mechanism Of Ad Hoc Networks
3	Research On Efficient Fault Tolerance And Security Mechanisms Of Network Coding
4	Distributed Task Autonomous Allocation And Cooperative Control
5	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
6	Game Theory Based Resource Allocation Algorithm For D2D Communications In LTE-A Systems
7	Optimization Techniques Of Proactive Fault Tolerance For Large-scale High Performance Computing Systems
8	Research On The Allocation Mechanism Of IoT Edge Computing Resources Based On Game Theory
9	Research On Adaptive Task Allocation Algorithm In Wireless Sensor Network
10	Research On Dynamic Assignment And Coordination Of Emergency Task