Skip to content

DMTCP

Creating a checkpoint involves saving the current state of a large calculation or process so that it can be restarted later from the same point, potentially saving a great deal of time. We advise the creation of checkpoints for all computational works so that in the event of any termination of the software, already completed tasks or calculation can be utilized afterward. If the software you are using does not have the capability of creating its own checkpoints, you may be able to use the Distributed MultiThreaded Checkpointing (dmtcp) program for this task. As of today, Dec 21, 2018, the Argo cluster provides the latest dmtcp with the version 2.5.2. DMTCP can provide checkpointing capabilities for Matlab, R, Java, Python, Perl, Ruby, PHP, Ocaml, GCL (GNU Common Lisp), emacs, vi/cscope, Open MPI, MPICH-2, MVAPICH2, IntelĀ® MPI, OpenMP, and Cilk. However, not all supported languages and programs have been tested yet on the Argo cluster. To gain access to DMTCP, users need to perform a module load dmtcp in interactive sessions, and in the case of batch jobs the module should be loaded in your SLURM job-submission script.

Running a program with checkpointing

To create checkpoints at regular intervals (measured in seconds) the program should be launched via DMTCP as follows,

$ dmtcp_launch -i <interval_time_in_seconds> ./<executable_program> 
This will save checkpoints after every interval_time_in_seconds in the current directory. Files will be created by DMTCP with the ".dmtcp" extension along bash scripts dmtcp_restart_script_xxxx*.sh, which can be used to restart your computation from that time step described by the xxxx* in the filename.

Creating checkpoints in another directory

If the the checkpoint files (with ".dmtcp" extensions) become overwhelming, they can be saved in a different directory by specifying the --ckptdir= command line option. Remember to make sure that the path is valid (e.g. already created) before you execute the dmtcp_launch command. It is also important that the directory is writeable, so you probably want to use a directory in /scratch/$USER.

Checkpointing Java programs

Note that when you use the dmtcp_launch command to run a Java program, you may sometimes see the following error message:

[40000] NOTE at siginfo.cpp:55 in setupCkptSigHandler; REASON='Your chosen SIGCKPT is not a valid signal, and cannot be used. Default signal will be used instead.'
    STOPSIGNAL = 7779
    12 = 12
[40000] WARNING at signalwrappers.cpp:141 in sigaction; REASON='JWARNING(false) failed'
    "Application trying to use DMTCP's signal for it's own use.\n" "  You should employ a different signal by setting the\n" "  environment variable DMTCP_SIGCKPT to the number\n" "  of the signal that DMTCP should use for checkpointing." Application trying to use DMTCP's signal for it's own use.
 You should employ a different signal by setting the
 environment variable DMTCP_SIGCKPT to the number
 of the signal that DMTCP should use for checkpointing.
    stopSignal = 12
You can eliminate this message by issuing the following command before you run dmtcp_launch (you may want to put this in your SLURM submission script as well):

export DMTCP_SIGCKPT=10
Restarting a Checkpointed Program

In addition to creating the checkpoint files, dmtcp_launch will create a bash script named dmtcp_restart_script.sh whenever it is interrupted. This script can be used to restart the program from the last checkpoint.

$ ./dmtcp_restart_script.sh
This script works fine in an interactive environment, but unfortunately it can sometime fail when placed in a SLURM script. There is, however, another way to restart the programs from a particular checkpoint as follows,
$ dmtcp_restart ckpt_programName_****.dmtcp 

Two example programs that can be tested in an interactive session or as batch jobs are DmtcpCppDemo.cpp and DmtcpJavaDemo.java. In the following, an example launch and restart of the DmtcpJavaDemo program under DMTCP is shown in an interactive session.

SLURM submission and resubmission scripts

We offer the following Submit.slurm and Resubmit.slurm scripts to demonstrate how one might go about using checkpoints for submitted jobs to SLURM, and then restarting them. The submit and resubmit scripts have code to restart either of the example programs. Just uncomment the section for one of the programs (in both scripts), and make sure the other program is commented out.

To test out the SLURM scripts, try the following commands:

$ sbatch Submit.slurm
Submitted batch job 644859
Now wait for at least 15 seconds, and continue:

$ scancel <jobID>   # In this case 644859
Here is where you would find the name of the checkpoint file that you care about. You would then edit the "Resubmit.slurm" script and place the filename as the parameter to the dmtcp_restart command. As a default we have setup the resubmit script to restart the checkpoint file that has the most recent timestamp, but this may not choose the file you care about in all circumstances. However, for demonstration purposes you should be able to just execute the following command:

$ sbatch Resubmit.slurm
Note: So far, MATLAB (without the parallel computing toolbox), R (without Rmpi), Python , OpenMP based threaded programs, and Java programs have been found to work properly under dmtcp on ARGO.

External Link For more on DMTCP, please refer to the official documentation