| With the development of the VLSI fabrication technologies, the performance of themodern microprocessors has been increasing exponentially, while their sensitivity to softerrors also increases dramatically. Soft errors are intermittent faults caused by externalevents, such as the radiation of energetic particles, voltage disturbance and electromag-netic interference. Soft errors do not cause permanent damage but may result in incorrectprogram execution by altering signal transfers or stored values, thus having in a seriousimpact on system reliability.In order to improve system reliability, fault-tolerant technologies have been pro-posed. Accordingtotheirimplementation,fault-toleranttechnologiescanbeclassifiedin-to hardware-implemented technology and software-implemented technology. Comparedwith the hardware-implemented technology, software-implemented technology does notneed to alter or re-design hardware architecture, with the advantages of low cost, short de-velopment life cycle, and flexible configuration. Therefore, software-implemented tech-nology has been an efficient solution to deal with soft errors. According to the fault pro-cessing, software-implemented technologies include: soft error analysis and assessmen-t, error detection, error recovery, fault-tolerant optimal configuration and fault-tolerantverification. The earlier researches mainly focus on soft error impact analysis and errordetection, so our study concentrates on error recovery and fault-tolerant verification. Themain contributions are as follows:1. We present a pure-software method based on encoded signatures to recover fromcontrol-flow errors(CFEs). After error detection, both the program control-flowand data-flow transfer back to a correct state before the error occurrence, ensuringthat the program continues executing and produces correct output. In this study, theassembly program code is firstly partitioned into storeless basic block and a staticencoded signature is assigned to each storeless basic block. Then, checking instruc-tions and recovery instructions are inserted in each storeless basic block based onthe assigned signatures. Checking instructions are designed to detect CFEs, whilerecovery instructions are designed to recover the data errors caused by the CFEpropagation. Finally, CFE-handlers are defined to handle CFEs. To the best of ourknowledge, this is the first time to solve the problem of inter-function CFEs’ detec- tion and recovery. Moreover, all the inter-block and most of the intra-block CFEscanbedetectedandrecovered. ComparedwiththecurrentCFEdetectiontechnique,our method achieves the goal of error recovery on the basis of error detection at arelatively lower performance overhead.2. Weproposeafault-toleranttechniqueatthesourceleveltodealwithdataflowerrorscaused by soft errors, which consists of three parts:(1)A definition of containmentblock based on the concept of basic block. The proposed technique deals with da-ta flow errors at the granularity of containment block, so that the data flow errorswithin a containment block don’t propagate to other blocks.(2) An error detectionmechanismbased ondata diversitytransformation andredundancy computing. Thebasic principle of error detection is that a redundant fault-tolerant program with thesame function as the original program is generated based on a set of diversity trans-formation rules, and some comparison statements are inserted at certain positionsto check whether an error occurs during the execution.(3) An error recovery mech-anism based on application-level checkpoint. Data flow analysis is introduced toobtain the variables of each checkpoint, and the statements for error recovery areadded as well. A source to source transformation tool is implemented to generatethe fault-tolerant program automatically. Fault injection and performance overheadexperimental results show that most control flow errors can be recovered with rel-atively low performance overhead.3. Wepresentageneralapproachtoevaluatetheeffectivenessofsignature-monitoringmechanism based on model checking principle. At first, we make an abstract sum-maryofsignature-monitoringmechanisms. Thenthefault-tolerantprogramismod-eled as a control-flow machine state, and its syntax and semantics are defined usinga step-operational semantics. The control-flow machine is refined into a state tran-sition system, which is translated into the input program of the model checker inorder to perform the verification automatically. At last, our approach is appliedto two representative techniques, DSM and CFCSS, to demonstrate the practica-bility. The verification results show that the undetected errors of DSM algorithmare revealed and some undetected errors due to the signature association of CFCSSalgorithm are revealed for the first time.4. We propose a formal verification technique based on typed assembly system to ver-ify the correctness of data flow error tolerance technique. The basic principle of typed assembly system is to add some static type property to the assembly languageso that the target property can be proved through the verification of the soundnessof the typed system. We take a representative data flow error recovery technique-SWIFT_R as an example to illustrate the verification process. At first, the syntaxof TFAL is defined and the operational semantics is given as a step operation bymodeling the execution of an instruction as a step transition. Based on the defini-tions of syntax and semantics, all the instructions of TFAL are type-checked, andthe undetected errors of SWIFT_R are obtained. Suppose that all the undetectederrors are excluded, the type safety of TFAL, including progress and preservation,is verified. Then, the similarity relation is defined. Based on the similarity relation,the property of fault tolerance is proved. A program is fault-tolerant when the out-put of the original program under normal environment is the same as the output ofthe fault-tolerant program under fault environment. Moreover, the state transitionsof the original program and the fault-tolerant program are similar. |