Font Size: a A A

Increasing processor dependability in distributed shared-memory servers

Posted on:2010-04-23Degree:Ph.DType:Thesis
University:Carnegie Mellon UniversityCandidate:Gold, Brian TFull Text:PDF
GTID:2448390002488709Subject:Computer Science
Abstract/Summary:PDF Full Text Request
Scalable shared-memory servers offer high performance and capacity within the familiar shared-memory programming model. However, reliability and availability have been significant shortcoming for previous shared-memory architectures, as a single error in one of the many processor or memory modules could bring down the entire system. The goal of this thesis is to eliminate the processor module as a single point of failure for shared-memory servers without requiring changes to software and minimizing the impact on commodity hardware designs.;The basic approach studied is distributed redundancy, where pairs of processor cores are grouped together logically but separated physically to increase availability of the system. We propose a design space based on fault-containment granularity, and argue that achieving our goals requires that processor cores and their private caches keep unchecked values from propagating into shared memory. We investigate two alternatives for exposing these updates to the outside system: forcing a check when external requests arrive or hiding the updates using a relaxed memory model.;We propose initial designs based on lockstep coordination that constructs synchronous redundant processor pairs. We then leverage the hidden-update mechanisms to develop an asynchronous, distributed-redundant system. Our evaluations of common enterprise workloads show that asynchronous redundancy can achieve performance overheads averaging just 10% over a non-redundant system, while obviating the need for extensive initialization and deterministic execution found in synchronous designs.;We observe that although asynchronous redundancy has numerous benefits for the designer, it complicates the system's ability to recover from chip failures. Our implementation of asynchronous redundancy relies on one of the replica cores in each pair being potentially incoherent with the rest of the system, leading to temporal regions where, if the coherent core failed, data could be lost. We propose simple extensions to the cache coherence protocol to close these windows of vulnerability. Using symbolic model checking, we formally verify an example distributed shared-memory coherence protocol and our proposed extensions for chip-failure tolerance.
Keywords/Search Tags:Shared-memory, Processor, Distributed, Model
PDF Full Text Request
Related items