From NetSysLab

Jump to: navigation, search

Checkpoing/Restart is one of the most important fault-tolerance techniques for HPC systems/applications. The traditional checkpoint/restart approach regularly saves the state of an application (i.e., it "checkpoints") and, if an error is detected, loads the previously saved application state in an attempt to continue application execution. Hence this technique "rolls back" the application to the last saved correct program state. What explore solutions that, once an error is detected, use heuristics to repair the application state, continue the application execution and avoid failure (thus they "roll forward"). This approach can increase the theoretical checkpointing interval, as the applications will need to roll-backwards much less often.


Bo Fang
Qiang Guan (LANL)
Nathan DeBardeleben (LANL)
Karthik Pattabiraman
Matei Ripeanu


LetGo - the debugger-based failure forwarding tool github


[1] LetGo: A Lightweight Continuous Framework for HPC Applications under Failures, Bo Fang, Qiang Guan, Nathan Debardeleben, Karthik Pattabiraman, Matei Ripeanu, 26th IEEE/ACM International Symposium on High Performance Distributed Computing (HPDC), June 2017, Washington, DC (acceptance rate: 19/102 = 18.9%) pdf slides