Font Size: a A A

Research On The Anti-SEU Technology Based On TMR-CUDA Fault Tolerant Architecture

Posted on:2019-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:P DingFull Text:PDF
GTID:2322330563954295Subject:Navigation, guidance and control
Abstract/Summary:PDF Full Text Request
The on-borne computer is the core part of the on-board system,it is responsible for the important work of management,in-orbit information processing and satellite control.With the development of space exploration tasks,we need to improve the on-orbit processing ability of the on-borne system,and the on-borne computer will be developed in the direction of high performance and low power consumption.Due to the complexity of traditional anti-radiation chip manufacturing process,low performance,high power consumption and high cost.Which are not suitable for used to build high perforemance satellite-borne computer.And commercial GPU(Grapfics Processing Unit)has powerful data processing ability.Software implements transient hardware fault tolerance on GPU,which can provide on-borne system with high performance,low cost,and low power consumption.In the space,there are lots of high-energy charged particles,SEU(Single-Event Upsets)have the most seriously impact on the space computer.So the reliability of on-board computers is very important.We mainly study the hardware transient fault caused by SEU on GPU.Firstly,we studied the mechanism of SEU and the existing fault tolerance method,and analyzed the influence of SEU effect on different structure of GPU system.This paper focuses on the technical method of GPU's soft error by software fault tolerance,by means of the internal hardware architecture,communication mechanism,the thread organization form and instruction scheduling mode of CUDA(Compute Unified Device Architecture)software computing platform's research.According to the architecture characteristics of GPGPU(General Purpose GPU)and TMR(Triple Modular Redundancy),we propose a fault tolerant architecture based on TMR-CUDA.Then we optimize the program to reduce the performance overhead of the fault-tolerant program.Finally,the fault tolerant scheme is tested by analying the benchmark program,the cost reduction of the fault-tolerant scheme based on computing resources is reduced to about 60%,and the performance cost of the thread bundle redundancy is reduced to about 26%.The reliability model of software is built according to the scheme and the reliability of fault-tolerant scheme is evaluated by the fault injection experiment.The purpose of this topic is to analyze the application prospect of on-borne GPU,to promote the orbit computing ability and give a new train of thought of on-borne system.The reliability of the on-borne system is improved by studying the fault-tolerant scheme of GPU,and the fault-tolerant scheme is verified by the fault injection experiment.The GPU's advantages of high performance,low power consumption and low cost provide the basis for further research on the application of on-borne GPU.This paper is of great theoretical and practical significance for the research on the anti-SEU technology of on-borne GPU.
Keywords/Search Tags:SEU, Transient Hardware Fault, GPGPU, TMR, TMR-CUDA
PDF Full Text Request
Related items