| Application-level software failures are a dominant cause of outages in large-scale systems, such as e-commerce, banking, or Internet services. The root cause of these failures is often unknown and the only cure is to reboot, often at the cost of nontrivial service disruption or downtime, even when clusters and failover are employed. Our thesis is that, although large-scale systems are unreliable, structuring them for fast, minimally-disruptive recovery is a cost-effective way to make them highly available.; This dissertation defines the crash-only design, a set of principles for building large-scale programs that crash safely and recover fast. We describe the microreboot mechanism, by which fine-grained components of crash-only systems are re covered through restart at the first indication of failure. We applied the crash-only design and microreboot technique to a satellite ground station and an Internet auction system; without fixing any bugs, microrebooting recovered most of the same failures as process restarts, but did so more than an order of magnitude faster and with an order of magnitude savings in lost work, reducing overall unavailability by a factor of 50.; The fast, minimally-disruptive nature of microrebooting makes several failure management policies cost-effective, policies that would otherwise be prohibitively expensive in terms of incurred downtime. First, we show that failures can be avoided at low cost by preventively microrebooting components, thus rejuvenating applications with minimal downtime. Second, we show that microrebooting at the slightest hint of failure (without engaging in diagnosis) improves availability even when failure detection is prone to false positives. Finally, we demonstrate that microreboot-based recovery can be hidden from end users via transparent request retries, improving availability without change in end-user-perceived service quality.; The crash-only/microreboot approach is in keeping with a minimalist philosophy of system design, in which simpler recovery mechanisms are preferred to complex ones---by casting most failures as reboot-curable problems, we simplify recovery, making it more prompt and effective, while being less disruptive to end users. We conclude that the combination of crash-only software and microrebooting provides a better cost/dependability tradeoff compared to the traditional approach of aiming for correct code and supporting diverse recovery mechanisms. |