Keras Jobs on Argo

Tensorflow Setup on Argo

The Argo cluster has 10 GPU compute nodes. There are 4 nodes with Nvidia K80 graphics cards (nodes 40, 50, 55 and 56), each with 24 cores and RAM varying from 128GB to 512GB. Another 6 nodes (nodes 76-81) contain Nvidia V100 graphics cards, each with 28 cores and RAM varying from 770GB to 1.5TB.

Installing TensorFlow in a Python Virtual Environment

There are currently several modules that are named "tensorflow". We currently DO NOT recommend using these. Instead we suggest that you install TensorFlow in a Python Virtual Environment. This will make it much easier to install additional Python packages that you might need, and allow you to use the most up-to-date version of TenserFlow.

TensorFlow CPU (i.e. non-GPU) version

[jdoe@ARGO-1 ~]$module`` ``load`` ``python/3.7.4 [jdoe@ARGO-1 ~]$python`` ``-m`` ``virtualenv`` ``tf-cpu-VE Using base prefix '/cm/shared/apps/python/3.7.4' New python executable in /home/jdoe/tensorflow-cpu/bin/python Installing setuptools, pip, wheel... done. [jdoe@ARGO-1 ~]$source`` ``~/tf-cpu-VE/bin/activate (tf-cpu-VE) [jdoe@ARGO-1 ~]$pip`` ``install`` ``tensorflow==1.13.2 Collecting tensorflow==1.13.2 ... Installing collected packages: numpy, six, keras-preprocessing, gast, astor, protobuf, markdown, grpcio, werkzeug, absl-py, tensorboard, mock, tensorflow-estimator, termcolor, h5py, keras-applications, tensorflow Successfully installed absl-py-0.9.0 astor-0.8.1 gast-0.3.3 grpcio-1.26.0 h5py-2.10.0 keras-applications-1.0.8 keras-preprocessing-1.1.0 markdown-3.1.1 mock-3.0.5 numpy-1.18.1 protobuf-3.11.2 six-1.14.0 tensorboard-1.13.1 tensorflow-1.13.2 tensorflow-estimator-1.13.0 termcolor-1.1.0 werkzeug-0.16.0

Preparing the MNIST Example Program

As an demonstration, we will run one of the examples made available by the TensorFlow developers. Specifically we will use their MNIST example to train a Deep Neural Network to identify images of hand written numbers.

Clone and checkout the "models" GIT repository that contains the MNIST example:

cd $SCRATCH git clonehttps://github.com/tensorflow/models.git cd models export PYTHONPATH=$PWD:$PYTHONPATH # Add the models dir to the PYTHONPATH git checkout r1.13.0

Now we will install any supporting Python packages into our same Virtual Environment.

pip install -r official/requirements.txt

In theory we should be able to test our installation with the following commands:

module load gcc/8.3.1 cd official/mnist python mnist.py

However, at this point I get the following error:

ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/jbasset1/tf-cpu-VE/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)

/lib64/libstdc++.so.6 is the wrong library. It should be loading the one in the gcc/8.3.1 module, but it can't seem to find that one. The $LD_LIBRARY_PATH is not setup properly for that version of GCC. In fact, I tried most everything from GCC/7.1.0 and up and I can't access it. All the earlier versions are too old are are not CXXABI_1.3.8 compliant.

So this is where I'm leaving off.

NOTE: All text beyond this point will need to be updated.

Submitting the MNIST Example to Slurm for Training

The job script described below is used to run the MNIST example for training. GPUs are treated as generic resources in Slurm. You need to request the number of GPUs required in your Slurm submission script. Even if you need only one GPU, you must use this option to request the one GPU card. You also need to request the GPU partition in the Slurm script. NOTE: You cannot write files to /home directory. You need to point the output path to /scratch directory.

Below is a sample submission script that requests 1 GPU and runs the MNIST program utilizing TensorFlow.

#!/bin/sh

## Specify the name for your job, this is the job name by which Slurm will ## refer to your job. This can be different from the name of your executable ## or the name of your script file. #SBATCH --job-name tf-MNIST

#SBATCH --qos normal # normal,cdsqos,phyqos,csqos,statsqos,hhqos,gaqos,esqos #SBATCH -p gpuq # partition (queue): all-LoPri, all-HiPri, # bigmem-LoPri, bigmem-HiPri, gpuq, CS_q, CDS_q, ...

## Deal with output and errors. Separate into 2 files (not the default). ## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub #SBATCH -o /scratch/%u/%x-%N-%j.out # Output file #SBATCH -e /scratch/%u/%x-%N-%j.err # Error file #SBATCH --mail-type=BEGIN,END,FAIL # NONE,BEGIN,END,FAIL,REQUEUE,ALL,... #SBATCH --mail-user=@gmu.edu # Put your GMU email address here

## Specifying an upper limit on needed resources will improve your scheduling ## priority, but if you exceed these values, your job will be terminated. ## Check your "Job Ended" emails for actual resource usage info. #SBATCH --mem=5G # Total memory needed for your job (suffixes: K,M,G,T) #SBATCH --time=0-00:15 # Total time needed for your job: Days-Hours:Minutes

## These options are more useful when running parallel and array jobs #SBATCH --nodes 1 # Number of nodes (computers) to reserve #SBATCH --tasks 1 # Number of independent processes per job #SBATCH --gres=gpu:1 # Reserve 1 GPU

## Load the relevant modules needed for the job module load tensorflow/gpu/1.8.0-py36

## Setup the environment export PYTHONPATH=$HOME/python/models-r1.8.0-packages:$PYTHONPATH export PYTHONPATH=/models:$PYTHONPATH

## Start the job python3 mnist.py

Change to the MNIST example directory:

cd official/mnist

In the official/mnist diectory, cut and paste the above script into a file named mnist.slurm.

Edit the file, replacing all the with appropriate values. Then submit your job using the sbatch command as discussed in Getting Started with Slurm.

sbatch mnist.slurm

Keras Setup on Argo

Keras provides high-level, easy-to-use API that works on top of one of the three supported libraries, i.e., Tensorflow, CNTK, and Theano. Argo provides several versions of Keras but all the versions use Tensorflow at the back end and are gpu-enabled. So, to use Keras a GPU-node must be requested.

A Keras Test Program

Here is a sample python program you can use to test Keras:

# mnist_mlp.py # Fromhttps://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py

# Trains a simple deep NN on the MNIST dataset. # Gets to 98.40% test accuracy after 20 epochs # (there is *a lot* of margin for parameter tuning). # 2 seconds per epoch on a K520 GPU.

from __future__ import print_function

import keras from keras.datasets import mnist from keras.models import Sequential from keras.layers import Dense, Dropout from keras.optimizers import RMSprop

batch_size = 128 num_classes = 10 epochs = 20

# the data, split between train and test sets (x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784) x_test = x_test.reshape(10000, 784) x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 print(x_train.shape[0], 'train samples') print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes)

model = Sequential() model.add(Dense(512, activation='relu', input_shape=(784,))) model.add(Dropout(0.2)) model.add(Dense(512, activation='relu')) model.add(Dropout(0.2)) model.add(Dense(num_classes, activation='softmax'))

model.summary()

model.compile(loss='categorical_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test, y_test)) score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1])

A Keras Slurm Script

And here is a corresponding Slurm submission script:

#!/bin/sh

## Specify the name for your job, this is the job name by which Slurm will ## refer to your job. This can be different from the name of your executable ## or the name of your script file. #SBATCH --job-name keras-MNIST

#SBATCH --qos normal # normal,cdsqos,phyqos,csqos,statsqos,hhqos,gaqos,esqos #SBATCH -p gpuq # partition (queue): all-LoPri, all-HiPri, # bigmem-LoPri, bigmem-HiPri, gpuq, CS_q, CDS_q, ...

## Deal with output and errors. Separate into 2 files (not the default). ## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub #SBATCH -o /scratch/%u/%x-%N-%j.out # Output file #SBATCH -e /scratch/%u/%x-%N-%j.err # Error file #SBATCH --mail-type=BEGIN,END,FAIL # NONE,BEGIN,END,FAIL,REQUEUE,ALL,... #SBATCH --mail-user=@gmu.edu # Put your GMU email address here

## Specifying an upper limit on needed resources will improve your scheduling ## priority, but if you exceed these values, your job will be terminated. ## Check your "Job Ended" emails for actual resource usage info. #SBATCH --mem=5G # Total memory needed for your job (suffixes: K,M,G,T) #SBATCH --time=0-00:15 # Total time needed for your job: Days-Hours:Minutes

## These options are more useful when running parallel and array jobs #SBATCH --nodes 1 # Number of nodes (computers) to reserve #SBATCH --tasks 1 # Number of independent processes per job #SBATCH --gres=gpu:1 # Reserve 1 GPU

## Load the relevant modules needed for the job module load keras/2.2.0-py36

## Start the job python3 mnist_mlp.py