Font Size: a A A

Tolerating Correlated Failures In Distributed Stream Processing Systems

Posted on:2020-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:J J ZhanFull Text:PDF
GTID:2428330590958367Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer network technology and the increasing enrichment of data acquisition methods,the need of real-time processing of massive and high-speed data have emerged in more and more fields,which presents the characteristics of high volume,strong timeliness and fast arrival.However,in the face of such high-volume data,the traditional processing mode can't complete the processing within the effective time,distributed stream processing system(DSPS)came into being.With the expansion of the computing scale of the large scale DSPS,failures become normal for system.In a DSPS,serious correlated failures that a large number of nodes fail simultaneously due to network component,power facilities,etc.,can result in a long system downtime.How to ensure that the system can quickly recover from the failures and the availability of the system becomes a key issue in a DSPS.Fault-tolerance techniques in the existing stream processing system can be classified into three types: active standby approach,running a standby node for each work node at the same time so as to replace the failed node immediately when the failure occurs,brings expensive cost.Checkpoint approach,periodically extracting checkpoints and fetching the latest checkpoint once failure occurs,brings long recovery latency.And upstream backup approach,all data stored in the upstream and rewinding all the stream data during failure recovery,also brings significant recovery delay.Abovementioned fault-tolerance techniques are often applied for independent failure,and they can't solve the latency problem caused by nodes' recovery waiting in correlated failures.One has to restore a large number of nodes and finish state synchronization in a very short time,posing great challenges to the failure recovery in DSPSs.To solve the long recovery latency while tolerating correlated failures in DSPSs,we present Ares,a high performance and fault tolerant DSPS.Ares presents a creative task allocation strategy considering application topology,available resource,processing latency and recovery latency to select the optimal task allocation for each task.In the design of Ares,we use a game-theoretic approach to solve the FTS problem and propose a novel Nirvana algorithm based on best-response dynamics.We mathematically prove the existence of Nash equilibrium in the game.We implement Ares atop Apache Storm and conduct comprehensive experiments to evaluate this design.The results show that,compared to existing designs Ares achieves a 3.6x improvement of throughput,as well as reducing the processing latency and the recovery time by 50.2% and 52.5%,respectively.
Keywords/Search Tags:DSPS, Fault tolerance, Task allocation, Game theory
PDF Full Text Request
Related items