| This dissertation presents a flexible technique that can be applied to commodity many-core architectures to exploit idle resources and ensure reliable system operation. The proposed system interposes a dynamically-adaptable fault tolerance layer between the hardware and the operating system through the use of a hypervisor. It avoids the introduction of a new single point of failure by incorporating the hypervisor into the sphere of replication. This approach greatly simplifies implementation over specialized hardware, operating system, or application-based techniques and offers significant flexibility in the type and degree of protection provided. The possible levels of protection considered range from duplex replication to arbitrary n-modular replication limited only by the number of processors in the system. The feasibility of the approach is considered for both near- and long-term computing platforms and a prototype is developed as a proof-of-concept and used to estimate the performance overhead and gather empirical data on fault tolerance capabilities. A fault detection latency reduction technique is also proposed and analyzed using the fault injection facilities provided by the prototype. |