Research On Design Methodology Of NR-MPI Based Fault Tolerant Parallel Programs

Posted on:2013-06-27

Degree:Master

Type:Thesis

Country:China

Candidate:K Tian

Full Text:PDF

GTID:2268330422974107

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapidly developing of High Performance Computer system(HPC), thesystem scale of HPC is increasing quickly, after that the Mean Time Before Failure(MTBF) of HPC is shortened, and has been very shorter than the run time of large-scalescience computing applications be, which reduces the availability of HPC. Fault toleranttechnology can improve the system availability. At now system checkpoint is usedwidely for fault tolerance, but the overhead is usually large. Although checkpoint atapplication can reduce the overhead, but it also need to reload the application after thefailure, which colud lead to much overhead when the scale of application is large. MPIis widely used as the parallel programming environment in the HPC, while NR-MPI is anew and efficiency fault tolerant MPI, so the research on design methodology ofNR-MPI based fault tolerant parallel programs is significant.Because of the complexity and diversification of MPI applications, it is hard todesign one fault tolerant technology which is highly efficient and applicable to everyapplication. Based the characteristics of MPI application which is widely used, Wemainly research the data redundancy and node redundancy based fault toleranttechnology. All of our work is as follows:Firstï¼Œfor evaluating fault tolerant technology, proposed the requirement which theideal fault tolerant technology should satisfy, and defined four definition of indicator.For estimating whether one fault tolerant technology is suitable to certain application ina certain HPC system, defined the time factor of fault tolerance. These work is thefoundation of design for fault tolerant parallel applications.Secondly, deeply analyzed the design methodology for fault tolerant parallelapplication using the data redundancy technology. Especially analyzed the keyproblems of design: the back-up strategy, the consistency, the back-up cycle and the keyvariable. Based the analysis, proposed the Data Redundancy based Fault TolerantFramework(DRFTF). DRFTF, which is based on the original algorithm of applications,is easy to be achieved and has little overhead for the applications whose key dataproportion is little.Thirdly, Analyzed the algorithm of NPB and Sweep3D, and achieved the faulttolerant version of NPB and Sweep3D using DRFTF. The result of experiment verifythe fault tolerant ability and little overhead of DRFTF.Fourthly, for the applications whose key data proportion is large, proposed the NodeRedundancy based Fault Tolerant Framework(NRFTF). NRFTF build the checksum ofdata computed by original algorithm and save it to the redundancy nodes. For the algorithm which can maintain the checksum after every loop step, the checksum can beupdated with the executing of original algorithm, so the overhead is little.Finally, analyzed the parallel Gaussian-Elimation algorithm and designed the faulttolerant parallel Gaussian-Elimation algorithm using the NRFTF, which update theredundancy without pause the original algorithm and achieve little overhead. Appliedthe fault tolerant algorithm to the HPL (the benchmark of TOP500). The result ofexperiment verify the fault tolerant ability and little overhead of NRFTF.

Keywords/Search Tags:

High Performance Compute, MPI parallel application, faulttolerance, data redundancy, node redundancy, DRFTF, NRFTF

PDF Full Text Request

Related items

1	Analysis On Redundancy Of User Data In WLAN Traffic
2	Research On High Performance Redundancy Elimination Techniques For Data Backup Systems
3	Synchronization Control System Redundancy Program Based On The Gsm-r Implementation Locomotive
4	Research On Multi-Replica Fault Tolerant Technology In MPI Environment
5	Research On The Application Of IEEE1588in High-availability Seamless Redundancy
6	High Altitude Long Endurance Uav Flight Control Computer Simulator System Redundancy Management Technology Research
7	The Research Of Redundancy And Fault-Tolerant Technology Based On Real-Time Operation System
8	Research On Node Optimization For Indoor Passive Localization
9	A Design And Implementation Of High Available 3G Core Network Equipment
10	The Optimum Design And Dynamics Study Of Spherical 2-DOF Parallel Manipulator With Actuation Redundancy