DMTCP
Creating a checkpoint involves saving the current state of a large
calculation or process so that it can be restarted later from the same
point, potentially saving a great deal of time. We advise the creation
of checkpoints for all computational works so that in the event of any
termination of the software, already completed tasks or calculation can
be utilized afterward. If the software you are using does not have the
capability of creating its own checkpoints, you may be able to use the
Distributed MultiThreaded Checkpointing
(dmtcp
) program for this task. As
of today, Dec 21, 2018, the Argo cluster provides the latest
dmtcp
with the version 2.5.2. DMTCP
can provide checkpointing capabilities for Matlab, R, Java, Python,
Perl, Ruby, PHP, Ocaml, GCL (GNU Common Lisp), emacs, vi/cscope, Open
MPI, MPICH-2, MVAPICH2, Intel® MPI, OpenMP, and Cilk. However, not all
supported languages and programs have been tested yet on the Argo
cluster. To gain access to DMTCP, users need to perform a
module load dmtcp
in interactive sessions, and in the case of batch
jobs the module should be loaded in your SLURM job-submission script.
Running a program with checkpointing
To create checkpoints at regular intervals (measured in seconds) the program should be launched via DMTCP as follows,
$ dmtcp_launch -i <interval_time_in_seconds> ./<executable_program>
interval_time_in_seconds
in the
current directory. Files will be created by DMTCP with the ".dmtcp"
extension along bash scripts dmtcp_restart_script_xxxx*.sh, which
can be used to restart your computation from that time step described by
the xxxx* in the filename.
Creating checkpoints in another directory
If the the checkpoint files (with ".dmtcp" extensions) become
overwhelming, they can be saved in a different directory by specifying
the --ckptdir=
command line option.
Remember to make sure that the path is valid (e.g. already created)
before you execute the dmtcp_launch
command. It is also important that
the directory is writeable, so you probably want to use a directory in
/scratch/$USER.
Checkpointing Java programs
Note that when you use the dmtcp_launch
command to run a Java program,
you may sometimes see the following error message:
[40000] NOTE at siginfo.cpp:55 in setupCkptSigHandler; REASON='Your chosen SIGCKPT is not a valid signal, and cannot be used. Default signal will be used instead.'
STOPSIGNAL = 7779
12 = 12
[40000] WARNING at signalwrappers.cpp:141 in sigaction; REASON='JWARNING(false) failed'
"Application trying to use DMTCP's signal for it's own use.\n" " You should employ a different signal by setting the\n" " environment variable DMTCP_SIGCKPT to the number\n" " of the signal that DMTCP should use for checkpointing." Application trying to use DMTCP's signal for it's own use.
You should employ a different signal by setting the
environment variable DMTCP_SIGCKPT to the number
of the signal that DMTCP should use for checkpointing.
stopSignal = 12
dmtcp_launch
(you may want to put this in your SLURM
submission script as well):
export DMTCP_SIGCKPT=10
In addition to creating the checkpoint files, dmtcp_launch
will create
a bash script named dmtcp_restart_script.sh
whenever it is
interrupted. This script can be used to restart the program from the
last checkpoint.
$ ./dmtcp_restart_script.sh
$ dmtcp_restart ckpt_programName_****.dmtcp
Two example programs that can be tested in an interactive session or as batch jobs are DmtcpCppDemo.cpp and DmtcpJavaDemo.java. In the following, an example launch and restart of the DmtcpJavaDemo program under DMTCP is shown in an interactive session.
SLURM submission and resubmission scripts
We offer the following Submit.slurm and Resubmit.slurm scripts to demonstrate how one might go about using checkpoints for submitted jobs to SLURM, and then restarting them. The submit and resubmit scripts have code to restart either of the example programs. Just uncomment the section for one of the programs (in both scripts), and make sure the other program is commented out.
To test out the SLURM scripts, try the following commands:
$ sbatch Submit.slurm
Submitted batch job 644859
$ scancel <jobID> # In this case 644859
dmtcp_restart
command. As a
default we have setup the resubmit script to restart the checkpoint file
that has the most recent timestamp, but this may not choose the file you
care about in all circumstances. However, for demonstration purposes you
should be able to just execute the following command:
$ sbatch Resubmit.slurm
MATLAB
(without
the parallel computing toolbox), R
(without Rmpi), Python
,
OpenMP
based threaded programs, and
Java
programs have been found to
work properly under dmtcp
on ARGO.
External Link For more on DMTCP, please refer to the official documentation