Keras Jobs on Argo

Tensorflow Setup on Argo

The Argo cluster has 4 GPU compute nodes (nodes 40, 50, 55 and 56) with 2 to 4 K80 graphics cards. These systems have 24 core CPUs with RAM size varying from 128GB to 512GB.

Choosing a TensorFlow Environment

Use "module avail tensorflow" command to find the available versions on Argo. Appropriate modules have to be loaded before running the required program/application. Some of the modules are designed for GPU use and will only run on the 4 GPU compute nodes, while others are for general CPU use, and will run on any node. In some cases the Python version is explicitly defined as part of the package (e.g. tensorflow/gpu/1.8.0-py36 uses Python 3.6.4). Once this module is loaded, attempts to load a different version of Python could cause errors.

For our example, we will use the this TensorFlow module:

module load tensorflow/cpu/1.8.0-py36

Preparing the MNIST Example Program

As an demonstration, we will run one of the examples made available by the TensorFlow developers. Specifically we will use their MNIST example to train a Deep Neural Network to identify images of hand written numbers.

Clone and checkout the "models" GIT repository that contains the MNIST example:

git clone [https://github.com/tensorflow/models.git](https://github.com/tensorflow/models.git)
cd models
export PYTHONPATH=$PWD:$PYTHONPATH  # Add the models dir to the PYTHONPATH
git checkout r1.8.0

And now we will install any supporting Python packages that are needed. Ideally we would like to use a Python Virtual Environment for this task, but as we describe in this section on the Limitations of Python Virtual Environments, that is just not possible, so we use the alternative approach we describe there, installing them into a separate directory.

mkdir -p $HOME/python/models-r1.8.0-packages
pip3 install -r official/requirements.txt -t $HOME/python/models-r1.8.0-packages

Submitting the MNIST Example to SLURM for Training

The job script described below is used to run the MNIST example for training. GPUs are treated as generic resources in SLURM. You need to request the number of GPUs required in your SLURM submission script. Even if you need only one GPU, you must use this option to request the one GPU card. You also need to request the GPU partition in the SLURM script. NOTE: You cannot write files to /home directory. You need to point the output path to /scratch directory.

Below is a sample submission script that requests 1 GPU and runs the MNIST program utilizing TensorFlow.

#!/bin/sh

## Specify the name for your job, this is the job name by which Slurm will
## refer to your job.  This can be different from the name of your executable
## or the name of your script file.
#SBATCH --job-name tf-MNIST

#SBATCH --qos normal  # normal,cdsqos,phyqos,csqos,statsqos,hhqos,gaqos,esqos
#SBATCH -p gpuq       # partition (queue): all-LoPri, all-HiPri,
                      #   bigmem-LoPri, bigmem-HiPri, gpuq, CS_q, CDS_q, ...

## Deal with output and errors.  Separate into 2 files (not the default).
## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH -o /scratch/%u/%x-%N-%j.out    # Output file
#SBATCH -e /scratch/%u/%x-%N-%j.err    # Error file
#SBATCH --mail-type=BEGIN,END,FAIL     # NONE,BEGIN,END,FAIL,REQUEUE,ALL,...
#SBATCH --mail-user=<userID>@gmu.edu   # Put your GMU email address here

## Specifying an upper limit on needed resources will improve your scheduling
## priority, but if you exceed these values, your job will be terminated.
## Check your "Job Ended" emails for actual resource usage info.
#SBATCH --mem=5G          # Total memory needed for your job (suffixes: K,M,G,T)
#SBATCH --time=0-00:15    # Total time needed for your job: Days-Hours:Minutes

## These options are more useful when running parallel and array jobs
#SBATCH --nodes 1         # Number of nodes (computers) to reserve
#SBATCH --tasks 1         # Number of independent processes per job
#SBATCH --gres=gpu:1      # Reserve 1 GPU

## Load the relevant modules needed for the job
module load tensorflow/gpu/1.8.0-py36

## Setup the environment
export PYTHONPATH=$HOME/python/models-r1.8.0-packages:$PYTHONPATH
export PYTHONPATH=<path/to>/models:$PYTHONPATH

## Start the job
python3 mnist.py

Change to the MNIST example directory:

cd official/mnist

In the official/mnist diectory, cut and paste the above script into a file named mnist.slurm.

Edit the file, replacing all the with appropriate values. Then submit your job using the sbatch command as discussed in Getting Started with Slurm.

sbatch mnist.slurm

Keras Setup on Argo

Keras provides high-level, easy-to-use API that works on top of one of the three supported libraries, i.e., Tensorflow, CNTK, and Theano. Argo provides several versions of Keras but all the versions use Tensorflow at the back end and are gpu-enabled. So, to use Keras a GPU-node must be requested.

*Important* Currently Keras attempts to store configuration information in /home/.keras, and if this fails will try to use /tmp/.keras. If neither location is available then Keras will error out. On the Argo Cluster /home is not writable, thus Keras will always try to use /tmp/.keras. The consequence of this is that the first person to run a Keras job on a node will succeed and subsequent attempts by any other user will fail as /tmp/keras will already exist and not be writable. A work around is to pre-create the .keras directory in your $SCRATCH directory and create a symbolic link to it from your $HOME directory:

> mkdir $SCRATCH/.keras
> ln -s $SCRATCH/.keras $HOME

A Keras Test Program

Here is a sample python program you can use to test Keras:

# mnist_mlp.py
# From [https://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py](https://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py)

# Trains a simple deep NN on the MNIST dataset.
# Gets to 98.40% test accuracy after 20 epochs
# (there is *a lot* of margin for parameter tuning).
# 2 seconds per epoch on a K520 GPU.

from __future__ import print_function

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop

batch_size = 128
num_classes = 10
epochs = 20

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))

model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

A Keras SLURM Script

And here is a corresponding SLURM submission script:

#!/bin/sh

## Specify the name for your job, this is the job name by which Slurm will
## refer to your job.  This can be different from the name of your executable
## or the name of your script file.
#SBATCH --job-name keras-MNIST

#SBATCH --qos normal  # normal,cdsqos,phyqos,csqos,statsqos,hhqos,gaqos,esqos
#SBATCH -p gpuq       # partition (queue): all-LoPri, all-HiPri,
                      #   bigmem-LoPri, bigmem-HiPri, gpuq, CS_q, CDS_q, ...

## Deal with output and errors.  Separate into 2 files (not the default).
## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH -o /scratch/%u/%x-%N-%j.out    # Output file
#SBATCH -e /scratch/%u/%x-%N-%j.err    # Error file
#SBATCH --mail-type=BEGIN,END,FAIL     # NONE,BEGIN,END,FAIL,REQUEUE,ALL,...
#SBATCH --mail-user=<userID>@gmu.edu   # Put your GMU email address here

## Specifying an upper limit on needed resources will improve your scheduling
## priority, but if you exceed these values, your job will be terminated.
## Check your "Job Ended" emails for actual resource usage info.
#SBATCH --mem=5G          # Total memory needed for your job (suffixes: K,M,G,T)
#SBATCH --time=0-00:15    # Total time needed for your job: Days-Hours:Minutes

## These options are more useful when running parallel and array jobs
#SBATCH --nodes 1         # Number of nodes (computers) to reserve
#SBATCH --tasks 1         # Number of independent processes per job
#SBATCH --gres=gpu:1      # Reserve 1 GPU

## Load the relevant modules needed for the job
module load keras/2.2.0-py36

## Start the job
python3 mnist_mlp.py

Current TF Issues