Font Size: a A A

Design And Implementation Of Four Redundancy Fault-tolerant Algorithm For Spaceborne GPU

Posted on:2019-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:W H ZhuFull Text:PDF
GTID:2322330569495732Subject:Engineering
Abstract/Summary:PDF Full Text Request
The satellite-borne computer is to satellite what the human brain to the human body.It is a very important part of the satellite.It controls the operation of the satellite and the execution of tasks on the satellite.Therefore,once the satellite-borne computer fails,it may lead to failure of the mission,resulting in very serious consequences,even disasters.However,due to the harsh conditions of the space environment and the limitations of current software and hardware conditions,the security of the on-board computers needs to be effectively guaranteed.On the other hand,due to the wider application of satellites,satellite-borne computer hardware is required to have high performance,low power consumption,small size and light weight.The rapid development of GPU hardware and software technologies in recent years can solve this problem.The powerful computing capability of GPU can not only complete the calculation of large-scale intensive tasks,but also reduce the power consumption and cost compared with other aerospace-class chips.However,GPUs are more prone to transient failures due to increased chip integration and lower operating voltages.Therefore,when GPUs are used in aerospace applications where the reliability requirements are extremely high,suitable fault-tolerant technology needs to be used to implement the fault-tolerant design of the GPUs to improve its reliability and reduce the failure rate.This paper deeply studies and compares the applicable situations,advantages and disadvantages of various fault-tolerance methods,and focuses on the fault-tolerant technologies of hardware and software.In order to take into account both the high reliability of the system and the low design complexity,fault tolerance is designed using four redundancy.This paper selects NVIDIA Jetson TX2 with Linux as the operating system as the on-board GPU.Based on the hardware features and software technology of the GPU,the four redundancy fault-tolerant design is implemented from two aspects: CUDA and redundant process.The core idea of the four redundancy fault-tolerant design scheme based on CUDA is the redundancy calculation.It combines certain hardware and software fault-tolerant design concepts to make full use of the redundant resources in the hardware and implement four redundancy fault-tolerant at the kernel level,block level or algorithm design level.The redundant process fault-tolerant scheme has two parts: fault detection and fault recovery.Improving PLR method proposed by Shye et al.can realize fault detection,and the fault recovery can be achieved through checkpoint setting and recovery technology.Through the CUDA parallel computing platform introduced by NVIDIA,the experimental test and data analysis of part of the fault-tolerant scheme can be learned that GPU compared to the CPU can greatly reduce the time consume of the computing part through parallel computing,and the acceleration effect of this part is very significant.The performance of GPU fault-tolerant programs is mainly affected by factors such as the size of the computation,the time consume of data transfer between the CPU and the GPU,and the time required for the comparison of error detection and the like.Through the analysis of reliability,we can know that the four redundancy fault-tolerant scheme based on CUDA designed in this paper can greatly improve the reliability of the system and meet the reliability requirements of the on-board GPU.
Keywords/Search Tags:GPU, CUDA, Four Redundancy, Fault Tolerant
PDF Full Text Request
Related items