Running Tensorflow on GPUs

The best way to run tensorflow on the ORc clusters is through the use of Python Virtual Envronments.

Building Python Virtual Environments

As a first step, before installing the tensorflow libraries, follow the instructions as described in these pages to build the Python Virtual Environment. Ideally, to make sure that the python virtual environment persists, it is preferrable that the steps be done from the home directory. To summarize:

1-Load the needed python module

module load gnu10
module load python

2-Build the Python Virtual Environment

python -m venv tf-env

3-Unload the python module to ensure that the python being used is the one installed in the Python Virtual Environment in the next steps.

Installing Tensorflow in your Python Virtual Environment

Once the Python Virtual Environment is built, activate it and update the pip install.

$ source activate tf-venv/bin/activate

$ python -m pip install --upgrade pip

You can now install the needed tensorflow version.

$ pip install tensorflow==<version>

Loading the Correct CUDA Version

The version of tensorflow installed in your python virtual environment determines the version of CUDA and cuDNN libraries that you will need. The table below summarizes the combinations of python version, cuDNN and CUDA that you will need for your tensorflow code to run correctly on the GPU nodes. A more exhaustive list is provided on the Tensorflow webpages.

Version	Python version	Compiler	cuDNN	CUDA
tensorflow-2.6.0	3.6-3.9	GCC 7.3.1	8.1	11.2
tensorflow-2.5.0	3.6-3.9	GCC 7.3.1	8.1	11.2
tensorflow-2.4.0	3.6-3.8	GCC 7.3.1	8.0	11.0
tensorflow-2.3.0	3.5-3.8	GCC 7.3.1	7.6	10.1
tensorflow-2.2.0	3.5-3.8	GCC 7.3.1	7.6	10.1
tensorflow-2.1.0	2.7, 3.5-3.7	GCC 7.3.1	7.6	10.1
tensorflow-2.0.0	2.7, 3.3-3.7	GCC 7.3.1	7.4	10.0
tensorflow_gpu-1.15.0	2.7, 3.3-3.7	GCC 7.3.1	7.4	10.0
tensorflow_gpu-1.14.0	2.7, 3.3-3.7	GCC 4.8	7.4	10.0
tensorflow_gpu-1.13.1	2.7, 3.3-3.7	GCC 4.8	7.4	10.0

Check for the available versions of CUDA already installed on the cluster:

$ module avail cuda

--- /cm/shared/modulefiles ----------------------------------------------------
cuda/10.0   cuda/10.1   cuda/10.2   
cuda/11.0   cuda/11.1   cuda/11.1.0 cuda/11.1.1 cuda/11.2   cuda/11.2.2 
cuda/9.0    cuda/9.1    cuda/9.2

and load the needed version based on the tensorflow version.

module load cuda/<version>

On Hopper, loading the cuDNN modules will also add the corresponding cuda modules.

module load cudnn

Testing your Tensorflow Code

There are GPUs on the Hopper AMD headnodes (hopper-amd), so it is possible to test your tensorflow code while logged onto the headnode. For larger, more intensive tests, you need to start an interactive session directly on a GPU node. The following command will ask for an allocation of 1 GPU and 50GB memory on that GPU. It also sets the number of nodes to 1 and the tasks per node to 8.

salloc -p gpuq -q gpu --gres=gpu:1g.10gb:1 --nodes=1 --ntasks-per-node=8 --mem=50G

Once you're connected to the GPU node, you'll need to activate the python virtual environment

$ source activate tf-venv/bin/activate

and make sure the correct cuda module is loaded before you can test your script. The interactive session will persist until you type

exit

and go back to the headnode.

Writing a SLURM Script and Submitting a Tensorflow Job

After testing, you can now submit the production run to the queue using a SLURM script tf-run.slurm.

#!/bin/bash
#SBATCH   --partition=gpuq                     # submit   to the gpu partition
#SBATCH   --qos=gpu
#SBATCH   --gres=gpu:1g.10gb:1                 # request 1 gpu node
#SBATCH   --job-name=tf-example                # name the job
#SBATCH   --output=tf-example-%N-%j.out        # write stdout/stderr   to named file
#SBATCH   --error=tf-example-%N-%j.err
#SBATCH   --time=0-00:30:00                    # Run for max of 0 days 00 hrs, 30 mins, 00 secs

#SBATCH   --nodes=1                            # Request N nodes
#SBATCH   --ntasks-per-node=8                  # Request n   cores per node
#SBATCH   --mem-per-cpu=2GB                    # Request nGB RAM per core


##load modules with
module load gnu10
module load cudnn

##Activate your python virtual environment
##using the correct path for the virtual environment
source ~/tf-env/bin/activate  

python tf-example.py

You can then submit the script to the queue

sbatch tf-run.slurm

and your job will start running once the resources become available.

Using Containerized Tensorflow

Containerized versions of tensorflow are also available on the cluster. While they are stored on Hopper, they can also be run on the ARGO GPU nodes. Details and examples for using the containerized versions can be found in these pages.

NOTE When running on ARGO, the SLURM scripts have to be updated so that they can be run on ARGO. The main differences between ARGO and HOPPER are detailed in these pages.