Increasing processor dependability in distributed shared-memory servers

Posted on:2010-04-23

Degree:Ph.D

Type:Thesis

University:Carnegie Mellon University

Candidate:Gold, Brian T

Full Text:PDF

GTID:2448390002488709

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Scalable shared-memory servers offer high performance and capacity within the familiar shared-memory programming model. However, reliability and availability have been significant shortcoming for previous shared-memory architectures, as a single error in one of the many processor or memory modules could bring down the entire system. The goal of this thesis is to eliminate the processor module as a single point of failure for shared-memory servers without requiring changes to software and minimizing the impact on commodity hardware designs.;The basic approach studied is distributed redundancy, where pairs of processor cores are grouped together logically but separated physically to increase availability of the system. We propose a design space based on fault-containment granularity, and argue that achieving our goals requires that processor cores and their private caches keep unchecked values from propagating into shared memory. We investigate two alternatives for exposing these updates to the outside system: forcing a check when external requests arrive or hiding the updates using a relaxed memory model.;We propose initial designs based on lockstep coordination that constructs synchronous redundant processor pairs. We then leverage the hidden-update mechanisms to develop an asynchronous, distributed-redundant system. Our evaluations of common enterprise workloads show that asynchronous redundancy can achieve performance overheads averaging just 10% over a non-redundant system, while obviating the need for extensive initialization and deterministic execution found in synchronous designs.;We observe that although asynchronous redundancy has numerous benefits for the designer, it complicates the system's ability to recover from chip failures. Our implementation of asynchronous redundancy relies on one of the replica cores in each pair being potentially incoherent with the rest of the system, leading to temporal regions where, if the coherent core failed, data could be lost. We propose simple extensions to the cache coherence protocol to close these windows of vulnerability. Using symbolic model checking, we formally verify an example distributed shared-memory coherence protocol and our proposed extensions for chip-failure tolerance.

Keywords/Search Tags:

Shared-memory, Processor, Distributed, Model

PDF Full Text Request

Related items

1	Speculative distributed shared-memory multiprocessors organized as processor-and-memory hierarchies
2	Studies On Shared-Memory Management And Optimization Technologies In Parallel And Distributed Operating Systems
3	Research On Key Technology Of High-efficient Shared Memory System In Network Processor Based On MPSoC
4	Design And Implementation Of Distributed Memory Object System Based On RDMA
5	The Design And Implementation Of A Distributed Shared Memory System
6	Research On Key Technologies Of Scalable 64-core Processor-- Network-on-chipã€memory Hierarchy And LTE Implementation
7	Fusion And Partition
8	Research On Key Technology Of Multi - Core Processor
9	Design And Implementation Of Avoiding False Sharing Distributed Shared Memory Protocol
10	The Solution Of Distributed Shared Memory Based On Domain Oriented Search Engine