Colloquium 2025 Checkpoints: why, when and how
Checkpointing is a technique that enables programs to save their current state and resume execution from a saved state in the future. This mechanism is useful in running long jobs, which may be interrupted for various unpredictable causes, such as system failures (either hardware or software), bugs in the running program, timeout, etc.
We have a wiki page about checkpoints that only gives general guidelines. In this webinar, we will introduce checkpointing through a few concrete examples to illustrate what is the state of a program and how its states at different points of execution are saved and restored. We will discuss various topics related to checkpoints, such as saving frequency, checkpoint file types, and how to implement the checkpointing mechanism in different computational job categories: serial, threaded, and MPI.