Checkpointing

What is Checkpointing

Checkpointing is the ability for the running job to be "snapshotted" at a scheduled interval. If the machine the job is running on were to power off unexpectedly or your job gets suspended for any reason, the job could be resumed from the "snapshot." Checkpointing is turned off by default, so you must enable a checkpointing method for it to be enabled.

DMTCP

DMTCP is a new method of checkpointing on Beocat.

Will Checkpointing work for me?

Well, it depends.

  1. Does your application use Infiniband? If so, none of our current checkpointing software will work.
  2. Does your application use OpenMPI? If so, is it a multi node job? If you answered yes to both of these questions, DMTCP checkpointing may work for you.

  3. Are you using more than one application? If so, are they running at the same time? If you answered yes to both of the questions, checkpointing MAY work for you.

  4. For all other use cases, checkpointing method should work.

If you still have problems with checkpointing, e-mail the administrator (beocat@cis.ksu.edu), and he should be able to help you.

To enable DMTCP checkpointing for your application, add the -ckpt dmtcp option to your job submission script:

mozes@loki ~ $ qsub -ckpt dmtcp checkpoint.sh

In this submission, -ckpt dmtcp indicates that you want to use DMTCP checkpointing.