From NetSysLab

Jump to: navigation, search

Checkpoing/Restart is one of the most important fault tolerance techniques for HPC systems/applications. The traditional checkpoint/restart takes checkpoints on every checkpointing interval, and loads the checkpoints for recovery upon failures, hence "roll back" the application to the previous and correct program state. What we look for is an methodology to allow the checkpoint/restart system to "roll forward" the application under failures, while remaining agnostic to the application. LetGo can potentially increase the theoretical checkpointing interval, as the application working with LetGo would fail much less than without LetGo.


Bo Fang
Qiang Guan (LANL)
Nathan DeBardeleben (LANL)
Karthik Pattabiraman
Matei Ripeanu


LetGo - the debugger-based failure forwarding tool github


[1] LetGo: A Lightweight Continuous Framework for HPC Applications under Failures, Bo Fang, Qiang Guan, Nathan Debardeleben, Karthik Pattabiraman, Matei Ripeanu, 26th IEEE/ACM International Symposium on High Performance Distributed Computing (HPDC), June 2017, Washington, DC (acceptance rate: 19/102 = 18.9%) pdf slides