Running CAFFE on ARGO
Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. For more details regarding Caffe see the caffe homepage.
You can check to see what versions of Caffe are available using the following command:
module avail caffe
module loadcommand. For example:
module load caffe/1.0.0-rc3
Setting Up an Example
There are a number of examples in the $CAFFE_HOME/examples directory. We will use the "cifar10" example to demonstrate how to run a Caffe program on ARGO. Unfortunately these examples intertwine parameter files, data files and results files all in one directory structure. We generally recommend that you keep data and parameters files in your /home directory, and put results in your /scratch directory since the /home file system is not writable from the compute nodes during a run. The /scratch file system is intended for short term usage, so if you keep important input files there for too long, we may erase them.
Nonetheless, we will store the entire example on the /scratch file system for now, because otherwise this example would become too burdensome. To do a proper setup, you would copy it to you home directory, and then go through the ".prototxt" files, changing the paths for files that will be created or written-to during the run to reference something in your /scratch directory. A good example of a parameter that should be changed are the lines that begin with "snapshot_prefix:". These define where checkpoint files will be written. Other parameters in the ".prototxt" files reference parameter files that will remain constant during the run. For example, lines that begin with "net:" refer to other parameter files that define the neural network, and should probably reference a file on /home.
cd /scratch/$USER mkdir -p examples cp -r $CAFFE_ROOT/examples/cifar10 examples
Running the Example
The version of Caffe that is installed on ARGO is compiled using Cuda, hence users must run Caffe on nodes which has access to GPUs, even if it is run in CPU only mode. Node 40 has 8 GPUs (2 K80s) and Nodes 50, 55 and 56 have 4 GPUs (1 K80). In order to run caffe, users have to specify the number of GPUs requested in the SLURM job script, just like running any other GPU jobs:
#!/bin/sh ## Specify the name for your job, this is the job name by which Slurm will ## refer to your job. This can be different from the name of your executable ## or the name of your script file. #SBATCH --job-name CaffeJob #SBATCH --qos normal # normal,cdsqos,phyqos,csqos,statsqos,hhqos,gaqos,esqos #SBATCH -p gpuq # partition (queue): all-LoPri, all-HiPri, # bigmem-LoPri, bigmem-HiPri, gpuq, CS_q, CDS_q, ... ## Deal with output and errors. Separate into 2 files (not the default). ## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub #SBATCH -o /scratch/%u/%x-%N-%j.out # Output file #SBATCH -e /scratch/%u/%x-%N-%j.err # Error file #SBATCH --mail-type=BEGIN,END,FAIL # NONE,BEGIN,END,FAIL,REQUEUE,ALL,... #SBATCH --mail-user=<userID>@gmu.edu # Put your GMU email address here ## Specifying an upper limit on needed resources will improve your scheduling ## priority, but if you exceed these values, your job will be terminated. ## Check your "Job Ended" emails for actual resource usage info. #SBATCH --mem=5G # Total memory needed for your job (suffixes: K,M,G,T) #SBATCH --time=0-00:10 # Total time needed for your job: Days-Hours:Minutes ## These options are more useful when running parallel and array jobs #SBATCH --nodes 1 # Number of nodes (computers) to reserve #SBATCH --tasks 1 # Number of independent processes per job #SBATCH --gres=gpu:<G> # Number of GPUs to reserve ## Load the relevant modules needed for the job module load caffe/1.0.0-rc3 ## Start the job echo $CUDA_VISIBLE_DEVICES caffe train \ --solver=examples/cifar10/cifar10_quick_solver.prototxt \ --gpu 0,1,...,<G-1> # reduce learning rate by factor of 10 after 8 epochs caffe train \ --solver=examples/cifar10/cifar10_quick_solver_lr1.prototxt \ --snapshot=examples/cifar10/cifar10_quick_iter_4000.solverstate.h5 \ --gpu 0,1,...,<G-1>
caffe program called, you have to instruct it on the number GPUs
to use. The important thing to note here is the
--gpu all instead.
NOTE: The GPU id you specify in the caffe command is not the actual
system GPU ID but an ID relative to the number of GPUs allocated. For
example, if in the job script above
-gpu option will be 0, 1, 2, and 3 and not the
actual IDs of the GPUs allocated to the job.
The number of GPUs specified the caffe command in
--gpu must be
between 0 and
--gres=gpu: option, otherwise you will get an error as
shown below and your job will fail to run:
Check failed: error == cudaSuccess (10 vs. 0) invalid device ordinal