Running CAFFE on Argo

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. For more details regarding Caffe see the caffe homepage.

Loading Caffe

You can check to see what versions of Caffe are available using the following command:

module avail caffe

To load you preferred version, use the module load command. For example:

module load caffe/1.0.0-rc3

Setting Up an Example

There are a number of examples in the $CAFFE_HOME/examples directory. We will use the "cifar10" example to demonstrate how to run a Caffe program on ARGO. Unfortunately these examples intertwine parameter files, data files and results files all in one directory structure. We generally recommend that you keep data and parameters files in your /home directory, and put results in your /scratch directory since the /home file system is not writable from the compute nodes during a run. The /scratch file system is intended for short term usage, so if you keep important input files there for too long, we may erase them.

Nonetheless, we will store the entire example on the /scratch file system for now, because otherwise this example would become too burdensome. To do a proper setup, you would copy it to you home directory, and then go through the ".prototxt" files, changing the paths for files that will be created or written-to during the run to reference something in your /scratch directory. A good example of a parameter that should be changed are the lines that begin with "snapshot_prefix:". These define where checkpoint files will be written. Other parameters in the ".prototxt" files reference parameter files that will remain constant during the run. For example, lines that begin with "net:" refer to other parameter files that define the neural network, and should probably reference a file on /home.

cd /scratch/$USER
mkdir -p examples
cp -r $CAFFE_ROOT/examples/cifar10 examples

Running the Example

The version of Caffe that is installed on ARGO is compiled using Cuda, hence users must run Caffe on nodes which has access to GPUs, even if it is run in CPU only mode. Node 40 has 8 GPUs (2 K80s) and Nodes 50, 55 and 56 have 4 GPUs (1 K80). In order to run caffe, users have to specify the number of GPUs requested in the SLURM job script, just like running any other GPU jobs:

#!/bin/sh

## Specify the name for your job, this is the job name by which Slurm will
## refer to your job.  This can be different from the name of your executable
## or the name of your script file.
#SBATCH --job-name CaffeJob

#SBATCH --qos normal  # normal,cdsqos,phyqos,csqos,statsqos,hhqos,gaqos,esqos
#SBATCH -p gpuq       # partition (queue): all-LoPri, all-HiPri,
                      #   bigmem-LoPri, bigmem-HiPri, gpuq, CS_q, CDS_q, ...

## Deal with output and errors.  Separate into 2 files (not the default).
## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH -o /scratch/%u/%x-%N-%j.out    # Output file
#SBATCH -e /scratch/%u/%x-%N-%j.err    # Error file
#SBATCH --mail-type=BEGIN,END,FAIL     # NONE,BEGIN,END,FAIL,REQUEUE,ALL,...
#SBATCH --mail-user=<userID>@gmu.edu   # Put your GMU email address here

## Specifying an upper limit on needed resources will improve your scheduling
## priority, but if you exceed these values, your job will be terminated.
## Check your "Job Ended" emails for actual resource usage info.
#SBATCH --mem=5G          # Total memory needed for your job (suffixes: K,M,G,T)
#SBATCH --time=0-00:10    # Total time needed for your job: Days-Hours:Minutes

## These options are more useful when running parallel and array jobs
#SBATCH --nodes 1         # Number of nodes (computers) to reserve
#SBATCH --tasks 1         # Number of independent processes per job
#SBATCH --gres=gpu:<G>    # Number of GPUs to reserve

## Load the relevant modules needed for the job
module load caffe/1.0.0-rc3

## Start the job
echo $CUDA_VISIBLE_DEVICES
caffe train \
  --solver=examples/cifar10/cifar10_quick_solver.prototxt \
  --gpu 0,1,...,<G-1>

# reduce learning rate by factor of 10 after 8 epochs
caffe train \
  --solver=examples/cifar10/cifar10_quick_solver_lr1.prototxt \
  --snapshot=examples/cifar10/cifar10_quick_iter_4000.solverstate.h5 \
  --gpu 0,1,...,<G-1>

In the --gres option, is the number of GPUs requested. Also, when the caffe program called, you have to instruct it on the number GPUs to use. The important thing to note here is the --gpu 0,1,..., option at the end of the caffe command. This parameter tells the caffe engine to use GPUs from 0 to from the GPUs that are allocated to the job. If we were asking for all of the GPUs available on the node, (which is 8 for Node 40 and 4 for Nodes 50, 55 and 56), you can use the shortcut --gpu all instead.

NOTE: The GPU id you specify in the caffe command is not the actual system GPU ID but an ID relative to the number of GPUs allocated. For example, if in the job script above was 4, then when the caffe program is called the -gpu option will be 0, 1, 2, and 3 and not the actual IDs of the GPUs allocated to the job.

The number of GPUs specified the caffe command in --gpu must be between 0 and , where is the total number of GPUs requested for the job via the --gres=gpu: option, otherwise you will get an error as shown below and your job will fail to run:

Check failed: error == cudaSuccess (10 vs. 0)  invalid device ordinal