Running Tensorflow on GPUs
The best way to run tensorflow on the ORc clusters is through the use of Python Virtual Envronments.
Building Python Virtual Environments
As a first step, before installing the tensorflow libraries, follow the instructions as described in these pages to build the Python Virtual Environment. Ideally, to make sure that the python virtual environment persists, it is preferrable that the steps be done from the home directory. To summarize:
1-Load the needed python module
module load gnu10
module load python
python -m venv tf-env
Installing Tensorflow in your Python Virtual Environment
Once the Python Virtual Environment is built, activate it and update the pip install.
$ source activate tf-venv/bin/activate
$ python -m pip install --upgrade pip
$ pip install tensorflow==<version>
Loading the Correct CUDA Version
The version of tensorflow installed in your python virtual environment determines the version of CUDA and cuDNN libraries that you will need. The table below summarizes the combinations of python version, cuDNN and CUDA that you will need for your tensorflow code to run correctly on the GPU nodes. A more exhaustive list is provided on the Tensorflow webpages.
Version | Python version | Compiler | cuDNN | CUDA |
---|---|---|---|---|
tensorflow-2.6.0 | 3.6-3.9 | GCC 7.3.1 | 8.1 | 11.2 |
tensorflow-2.5.0 | 3.6-3.9 | GCC 7.3.1 | 8.1 | 11.2 |
tensorflow-2.4.0 | 3.6-3.8 | GCC 7.3.1 | 8.0 | 11.0 |
tensorflow-2.3.0 | 3.5-3.8 | GCC 7.3.1 | 7.6 | 10.1 |
tensorflow-2.2.0 | 3.5-3.8 | GCC 7.3.1 | 7.6 | 10.1 |
tensorflow-2.1.0 | 2.7, 3.5-3.7 | GCC 7.3.1 | 7.6 | 10.1 |
tensorflow-2.0.0 | 2.7, 3.3-3.7 | GCC 7.3.1 | 7.4 | 10.0 |
tensorflow_gpu-1.15.0 | 2.7, 3.3-3.7 | GCC 7.3.1 | 7.4 | 10.0 |
tensorflow_gpu-1.14.0 | 2.7, 3.3-3.7 | GCC 4.8 | 7.4 | 10.0 |
tensorflow_gpu-1.13.1 | 2.7, 3.3-3.7 | GCC 4.8 | 7.4 | 10.0 |
Check for the available versions of CUDA already installed on the cluster:
$ module avail cuda
--- /cm/shared/modulefiles ----------------------------------------------------
cuda/10.0 cuda/10.1 cuda/10.2
cuda/11.0 cuda/11.1 cuda/11.1.0 cuda/11.1.1 cuda/11.2 cuda/11.2.2
cuda/9.0 cuda/9.1 cuda/9.2
and load the needed version based on the tensorflow version.
module load cuda/<version>
module load cudnn
Testing your Tensorflow Code
There are GPUs on the Hopper AMD headnodes (hopper-amd
), so it is possible to test your tensorflow code
while logged onto the headnode. For larger, more intensive tests, you need to
start an interactive session directly on a GPU node. The following command will
ask for an allocation of 1 GPU and 50GB memory on that GPU. It also sets the number
of nodes to 1 and the tasks per node to 8.
salloc -p gpuq -q gpu --gres=gpu:1g.10gb:1 --nodes=1 --ntasks-per-node=8 --mem=50G
$ source activate tf-venv/bin/activate
exit
Writing a SLURM Script and Submitting a Tensorflow Job
After testing, you can now submit the production run to the queue using a SLURM script tf-run.slurm
.
#!/bin/bash
#SBATCH --partition=gpuq # submit to the gpu partition
#SBATCH --qos=gpu
#SBATCH --gres=gpu:1g.10gb:1 # request 1 gpu node
#SBATCH --job-name=tf-example # name the job
#SBATCH --output=tf-example-%N-%j.out # write stdout/stderr to named file
#SBATCH --error=tf-example-%N-%j.err
#SBATCH --time=0-00:30:00 # Run for max of 0 days 00 hrs, 30 mins, 00 secs
#SBATCH --nodes=1 # Request N nodes
#SBATCH --ntasks-per-node=8 # Request n cores per node
#SBATCH --mem-per-cpu=2GB # Request nGB RAM per core
##load modules with
module load gnu10
module load cudnn
##Activate your python virtual environment
##using the correct path for the virtual environment
source ~/tf-env/bin/activate
python tf-example.py
sbatch tf-run.slurm
Using Containerized Tensorflow
Containerized versions of tensorflow are also available on the cluster. While they are stored on Hopper, they can also be run on the ARGO GPU nodes. Details and examples for using the containerized versions can be found in these pages.
NOTE When running on ARGO, the SLURM scripts have to be updated so that they can be run on ARGO. The main differences between ARGO and HOPPER are detailed in these pages.