Checkpointing
What is Checkpointing
Checkpointing is the ability for the running job to be "snapshotted" at a scheduled interval. If the machine the job is running on were to power off unexpectedly or your job gets suspended for any reason, the job could be resumed from the "snapshot." Checkpointing is turned off by default, so you must enable a checkpointing method for it to be enabled.
DMTCP
DMTCP is a new method of checkpointing on Beocat.
MPI checkpointing is currently broken, we are working to resolve the issue. When we get this fixed, your mpirun command should look like this mpirun --mca btl tcp,self <app>
- Unlike the old BLCR method, there are no compilation requirements.
- Infiniband is still not supported. The DMTCP developers are working on this for an upcoming release.
Checkpoints are taken only once every 12 hours by default. As this is a blocking checkpoint, if you are using a lot of memory (>=64GB) you may want to think about increasing the checkpoint interval. This can be done via the qsub option -c <interval> i.e. qsub -ckpt dmtcp -c 36:00:00 script.qsub
Will Checkpointing work for me?
Well, it depends.
- Does your application use Infiniband? If so, none of our current checkpointing software will work.
Does your application use OpenMPI? If so, is it a multi node job? If you answered yes to both of these questions, DMTCP checkpointing may work for you.
Are you using more than one application? If so, are they running at the same time? If you answered yes to both of the questions, checkpointing MAY work for you.
- For all other use cases, checkpointing method should work.
If you still have problems with checkpointing, e-mail the administrator (beocat@cis.ksu.edu), and he should be able to help you.
To enable DMTCP checkpointing for your application, add the -ckpt dmtcp option to your job submission script:
mozes@loki ~ $ qsub -ckpt dmtcp checkpoint.sh
In this submission, -ckpt dmtcp indicates that you want to use DMTCP checkpointing.