Crash-only software and microreboot: A design and technique for achieving high availability in frequently-failing software systems

Posted on:2006-04-29

Degree:Ph.D

Type:Thesis

University:Stanford University

Candidate:Candea, George M

Full Text:PDF

GTID:2458390008970306

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Application-level software failures are a dominant cause of outages in large-scale systems, such as e-commerce, banking, or Internet services. The root cause of these failures is often unknown and the only cure is to reboot, often at the cost of nontrivial service disruption or downtime, even when clusters and failover are employed. Our thesis is that, although large-scale systems are unreliable, structuring them for fast, minimally-disruptive recovery is a cost-effective way to make them highly available.; This dissertation defines the crash-only design, a set of principles for building large-scale programs that crash safely and recover fast. We describe the microreboot mechanism, by which fine-grained components of crash-only systems are re covered through restart at the first indication of failure. We applied the crash-only design and microreboot technique to a satellite ground station and an Internet auction system; without fixing any bugs, microrebooting recovered most of the same failures as process restarts, but did so more than an order of magnitude faster and with an order of magnitude savings in lost work, reducing overall unavailability by a factor of 50.; The fast, minimally-disruptive nature of microrebooting makes several failure management policies cost-effective, policies that would otherwise be prohibitively expensive in terms of incurred downtime. First, we show that failures can be avoided at low cost by preventively microrebooting components, thus rejuvenating applications with minimal downtime. Second, we show that microrebooting at the slightest hint of failure (without engaging in diagnosis) improves availability even when failure detection is prone to false positives. Finally, we demonstrate that microreboot-based recovery can be hidden from end users via transparent request retries, improving availability without change in end-user-perceived service quality.; The crash-only/microreboot approach is in keeping with a minimalist philosophy of system design, in which simpler recovery mechanisms are preferred to complex ones---by casting most failures as reboot-curable problems, we simplify recovery, making it more prompt and effective, while being less disruptive to end users. We conclude that the combination of crash-only software and microrebooting provides a better cost/dependability tradeoff compared to the traditional approach of aiming for correct code and supporting diverse recovery mechanisms.

Keywords/Search Tags:

Software, Systems, Crash-only, Failures, Recovery, Microreboot, Availability

PDF Full Text Request

Related items

1	Research On High Availability Of Distributed Mission-Critical System
2	Research On The Method Of Microreboot Oriented Parallel Computing Environment
3	Research On Microreboot Technologies Oriented To Self-Recovery
4	The Research Of Dynamic Adaptive Software Model Based On JMX
5	Fault recovery in discrete-event systems with intermittent and permanent failures
6	Research On Crash Availability Judgement Based On Taint Analysis
7	An improved crash recovery approach for distributed systems
8	Research On Software Self-recovery Technology For Real-time Embedded System
9	Transaction recovery in databases and beyond
10	Mobility-based route recovery from multiple node failures in movable sensor networks