Heavy Tails and Instabilities in Large-Scale Systems with Failures

Posted on:2016-01-21

Degree:Ph.D

Type:Dissertation

University:Columbia University

Candidate:Skiani, Evangelia

Full Text:PDF

GTID:1478390017982177

Subject:Electrical engineering

Abstract/Summary:

Modern engineering systems, e.g., wireless communication networks, distributed computing systems, etc., are characterized by high variability and susceptibility to failures. Failure recovery is required to guarantee the successful operation of these systems. One straight- forward and widely used mechanism is to restart the interrupted jobs from the beginning after a failure occurs. In network design, retransmissions are the primary building blocks of the network architecture that guarantee data delivery in the presence of channel failures. Retransmissions have recently been identified as a new origin of power laws in modern information networks. In particular, it was discovered that retransmissions give rise to long tails (delays) and possibly zero throughput. To this end, we investigate the impact of the 'retransmission phenomenon' on the performance of failure prone systems and propose adaptive solutions to address emerging instabilities.;The preceding finding of power law phenomena due to retransmissions holds under the assumption that data sizes have infinite support. In practice, however, data sizes are upper bounded 0 ≤ L ≤ b, e.g., WaveLAN's maximum transfer unit is 1500 bytes, YouTube videos are of limited duration, e-mail attachments cannot exceed 10MB, etc. To this end, we first provide a uniform characterization of the entire body of the distribution of the number of retransmissions, which can be represented as a product of a power law and the Gamma distribution. This rigorous approximation clearly demonstrates the transition from power law distributions in the main body to exponential tails. Furthermore, the results highlight the importance of wisely determining the size of data fragments in order to accommodate the performance needs in these systems as well as provide the appropriate tools for this fragmentation.;Second, we extend the analysis to the practically important case of correlated channels using modulated processes, e.g., Markov modulated, to capture the underlying dependencies. Our study shows that the tails of the retransmission and delay distributions are asymptotically insensitive to the channel correlations and are determined by the state that generates the lightest tail in the independent channel case. This insight is beneficial both for capacity planning and channel modeling since the independent model is sufficient and the correlation details do not matter. However, the preceding finding may be overly optimistic when the best state is atypical, since the effects of 'bad' states may still downgrade the performance.;Third, we examine the effects of scheduling policies in queueing systems with failures and restarts. Fair sharing, e.g., processor sharing (PS), is a widely accepted approach to resource allocation among multiple users. We revisit the well-studied M/G/1 PS queue with a new focus on server failures and restarts. Interestingly, we discover a new phenomenon showing that PS-based scheduling induces complete instability in the presence of retransmissions, regardless of how low the traffic load may be. This novel phenomenon occurs even when the job sizes are bounded/fragmented, e.g., deterministic. This work demonstrates that scheduling one job at a time, such as first-come-first-serve, achieves a larger stability region and should be preferred in these systems.;Last, we delve into the area of distributed computing and study the effects of commonly used mechanisms, i.e., restarts, fragmentation, replication, especially in cloud computing services. We evaluate the efficiency of these techniques under different assumptions on the data streams and discuss the corresponding optimization problem. These findings are useful for optimal resource allocation and fault tolerance in rapidly developing computing networks.;In addition to networking and distributed computing systems, the aforementioned results improve our understanding of failure recovery management in large manufacturing and service systems, e.g., call centers. Scalable solutions to this problem increase in significance as these systems continuously grow in scale and complexity. The new phenomena and the techniques developed herein provide new insights in the areas of parallel computing, probability and statistics, as well as financial engineering.

Keywords/Search Tags:

Systems, Failures, Computing, Tails, New

Related items

1	Adaptive control of nonlinear systems with actuator failures and uncertainties
2	Performance evaluation of wireless ad hoc networks and the presence of heavy-tails & LRD
3	Non-intrusive detection and diagnosis of failures in high throughput distributed systems
4	Fault recovery in discrete-event systems with intermittent and permanent failures
5	Research On Damage Mechanism Of Cascading Failures In Inter-domain Routing Systems
6	Coping with dependent failures in distributed systems
7	Stability And Feedback Control For Delay Systems With Actuator/Controller Failures
8	Fault tolerance in adaptive real-time computing systems
9	Sliding Mode Control Of Uncertain Systems In Case Of Actuator Failures
10	Research On Analysis And Preventive Control Of Cascading Failures In Complex Systems