| With the rapidly developing of High Performance Computer system(HPC), thesystem scale of HPC is increasing quickly, after that the Mean Time Before Failure(MTBF) of HPC is shortened, and has been very shorter than the run time of large-scalescience computing applications be, which reduces the availability of HPC. Fault toleranttechnology can improve the system availability. At now system checkpoint is usedwidely for fault tolerance, but the overhead is usually large. Although checkpoint atapplication can reduce the overhead, but it also need to reload the application after thefailure, which colud lead to much overhead when the scale of application is large. MPIis widely used as the parallel programming environment in the HPC, while NR-MPI is anew and efficiency fault tolerant MPI, so the research on design methodology ofNR-MPI based fault tolerant parallel programs is significant.Because of the complexity and diversification of MPI applications, it is hard todesign one fault tolerant technology which is highly efficient and applicable to everyapplication. Based the characteristics of MPI application which is widely used, Wemainly research the data redundancy and node redundancy based fault toleranttechnology. All of our work is as follows:First,for evaluating fault tolerant technology, proposed the requirement which theideal fault tolerant technology should satisfy, and defined four definition of indicator.For estimating whether one fault tolerant technology is suitable to certain application ina certain HPC system, defined the time factor of fault tolerance. These work is thefoundation of design for fault tolerant parallel applications.Secondly, deeply analyzed the design methodology for fault tolerant parallelapplication using the data redundancy technology. Especially analyzed the keyproblems of design: the back-up strategy, the consistency, the back-up cycle and the keyvariable. Based the analysis, proposed the Data Redundancy based Fault TolerantFramework(DRFTF). DRFTF, which is based on the original algorithm of applications,is easy to be achieved and has little overhead for the applications whose key dataproportion is little.Thirdly, Analyzed the algorithm of NPB and Sweep3D, and achieved the faulttolerant version of NPB and Sweep3D using DRFTF. The result of experiment verifythe fault tolerant ability and little overhead of DRFTF.Fourthly, for the applications whose key data proportion is large, proposed the NodeRedundancy based Fault Tolerant Framework(NRFTF). NRFTF build the checksum ofdata computed by original algorithm and save it to the redundancy nodes. For the algorithm which can maintain the checksum after every loop step, the checksum can beupdated with the executing of original algorithm, so the overhead is little.Finally, analyzed the parallel Gaussian-Elimation algorithm and designed the faulttolerant parallel Gaussian-Elimation algorithm using the NRFTF, which update theredundancy without pause the original algorithm and achieve little overhead. Appliedthe fault tolerant algorithm to the HPL (the benchmark of TOP500). The result ofexperiment verify the fault tolerant ability and little overhead of NRFTF. |