Fault Tolerance through Invariant Checks in Applications Using Linear Algebraic Method

Posted on:2019-11-05

Degree:Ph.D

Type:Thesis

University:The University of Wisconsin - Madison

Candidate:Loh, Felix Da Yuan

Full Text:PDF

GTID:2448390002482122

Subject:Computer Engineering

Abstract/Summary:

Graphics processing units (GPUs) have become a popular platform for scientific computing applications, many of which are based on linear algebra. As the minimum feature size of transistors decreases, GPUs are becoming more vulnerable to transient faults caused by events such as alpha particle strikes, power fluctuations and electronic noise. In addition, the likelihood of a fault increases as more GPU computing nodes are used in supercomputers to meet the increasingly demanding computational requirements of scientific applications. Consequently, there are concerns that GPU-based supercomputer systems will suffer from very high fault rates. In order to ensure reliability, it is necessary to use fault tolerance (FT) techniques.;This thesis presents low-overhead FT techniques for several commonly-used linear algebraic applications that run on GPUs, focusing mainly on applications that operate with sparse matrices. These FT techniques exploit the invariant properties of the algorithms used in these applications, and exploit the parallel execution model of GPUs to allow for low-overhead error detection.;This thesis introduces and studies efficient error checking schemes for three popular matrix factorization techniques: Householder QR factorization, left-looking Cholesky factorization, and right-looking LU factorization. It also explores lightweight invariant checking methods for the preconditioned conjugate gradient (PCG) and biconjugate gradient stabilized (BiCGSTAB) iterative solvers and introduces an efficient checking method for the Lanczos eigensolver, as well as fault injection mechanisms for NVIDIA GPUs that allow for the simulation of transient, non-instantaneous faults.;This thesis carefully evaluates these FT methods on a contemporary NVIDIA GPU platform, and the results show that the aforementioned error checking strategies have high error coverage and are significantly more efficient than prior FT techniques on a GPU system.

Keywords/Search Tags:

Applications, FT techniques, GPU, Linear, Fault, Gpus, Invariant, Error

Related items

1	Exploiting Parallelism in GPUs
2	Research On Software-based Fault Tolerance Techniques For Aerospace Applications At Source Code Level
3	Optimizing Throughput and Power Consumption of Graphics Processing Units (GPUs)
4	Automatic transformation and optimization of applications on GPUs and GPU clusters
5	Research On Fault Recovery Techniques For Soft Errors Of COTS DSP
6	LMI-Based Approaches To Fault Detection For Linear Systems
7	The Error Linear Complexity Spectra Of Binary Sequences With Period P～n
8	Analysis Of Hardware Fault Propagation In Programs And Research On Fault-tolerance Techniques
9	Research On LINC Transmitter Techniques
10	Research On Linc Transmitter Techniques