Running Pytorch Jobs on Hopper

Runs requiring Pytorch can be run directly using the system installed python or a python virtual environment or any one of the available singularity containers.

Running with the System Python in Batch Mode

To run with the system python, log in to the cluster AMD head node which has a gpu card that allows for testing gpu codes.

ssh NetID@hopper-amd.orc.gmu.edu

On the hopper-amd headnode, load the GNU 10 and default python - version 3.9.9

module load gnu10
module load python

Create a directory and change into it:

mkdir cs678_example && cd  cs678_example

Download the necessary files:

into the directory with the following commands:

wget https://wiki.orc.gmu.edu/mkdocs/cs678_example_files/main.py

wget https://wiki.orc.gmu.edu/mkdocs/cs678_example_files/run.slurm

Your directory contents should now look like:

cs678_example/
├── main.py
└── run.slurm

You can now test python script directly on the headnode:

python main.py

Preferably, start an interactive session on a gpu compute node to test your script.

Running interactively on a GPU node

To work directly on a gpu node, start interactive session with

salloc --account cs678fl22 -p gpuq -q gpu --nodes=1 --ntasks-per-node=4 --gres=gpu:1g.10gb:1 --mem=50GB --time=0-2:00:00

The gres and other parameters can be adjusted to match the required resources. Once on the gpu node, the same steps as outlined above can be followed to set up your environment and run your code. The interactive session will continue until the set time limit or until it is ended with

exit

This will take you back to the head node.

Once you're ready, use a Slurm script to run your code in batch mode. When setting the Slurm parameters in your script, pay attention to the the following:

1 - The account information:

#SBATCH --account=cs678fl22

2 - The partition and QOS:

#SBATCH --partition=gpuq
#SBATCH --qos=gpuq

3 - Time limits:

The time limit on the gpuq partiton defaults to 3 days:

#SBATCH --time=3-00:00:00

but can be set to a maximum of 5 days.

4 - GPU node options:

Type of GPU	Slurm setting	No. of GPUs on Node	No. of CPUs	RAM
1g 10GB	--gres=gpu:1g.10gb:nGPUs	4	64	500GB
2g 20GB	--gres=gpu:2g.20gb:nGPUs	4	64	500GB
3g 40GB	--gres=gpu:3g.40gb:nGPUs	4	64	500GB
A100 80GB	--gres=gpu:A100.80gb:nGPUS	4	64	500GB
DGX A100 40GB	--gres=gpu:A100.40gb:nGPUs	8	128	1TB

Below is a sample Slurm submission script. The information can be found in the downloaded run.slurm. Update the timing information, the,and` to reflect the number of CPU cores and GPUs you need (referring to the table above) and submit it by entering

sbatch run.slurm

Sample script, run.slurm:

#!/bin/bash

#SBATCH --account=cs678fl22
#SBATCH --job-name=mnist
#SBATCH --partition=gpuq
#SBATCH --qos=gpu

#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=<N_CPU_CORES> 
#SBATCH --gres=gpu:1g.10gb:<nGPUs>
#SBATCH --mem=<MEMORY>  
#SBATCH --export=ALL 
#SBATCH --time=0-01:00:00 

#SBATCH --output=mnist.%j.out
#SBATCH --error=mnist.%j.err

# to see ID and state of GPUs assigned
nvidia-smi 

## Load the necessary modules
module load gnu10
module load python

## Execute your script
python main.py

After submitting the Slurm script, you can monitor it and make updates to it using additional Slurm commands as detailed in this page on Monitoring and Controlling Slurm Jobs.

Managing pytorch and other packages with python virtual environments

To use additional libraries or different versions of pytorch and other libraries than what is available in the python modules, use Python Virtual Environments. The following steps summarize the process.

Info

To make sure your Python Virtual Environment runs across all nodes, it is important to use the versions of modules built for this. Before creating the Python Virtual Environment

1 - First switch modules to GNU 10 compilations:

module load gnu10/10.3.0-ya

2 - Check and load python module

module avail python
module load python/3.9.9-jh

3 - Now you can create the python virtual environment. After it is created, activate it and upgrade pip

python -m venv pytorch-env

source pytorch-env/bin/activate

python -m pip install --upgrade pip

5 - Remove system python module and install torch modules (Refer to this page for updated instructions on installing Pytorch)

module unload python
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

6 - Install additional modules as needed, test your scripts and deactivate the python virtual environment once you're done.

deactivate

In your Slurm script, to run with the Python Virtual Environment, you activate the Python Virtual Environment instead of loading the python module.

Sample script, run.slurm, for python virtual environments:

#!/bin/bash

#SBATCH --account=cs678fl22
#SBATCH --job-name=mnist
#SBATCH --partition=gpuq
#SBATCH --qos=gpu

#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=<N_CPU_CORES> 
#SBATCH --gres=gpu:1g.10gb:<nGPUs>
#SBATCH --mem=<MEMORY>  
#SBATCH --export=ALL 
#SBATCH -time=0-01:00:00 

#SBATCH --output=mnist.%j.out
#SBATCH --error=mnist.%j.err

# To see ID and state of GPUs assigned
nvidia-smi 

## Activate the python virtual environment
source activate ~/pytorch-env/bin/activate

## Execute your script
python main.py

Running Pytorch with Singularity Containers

Containers and examples for running Pytorch are available for all users can be found on the cluster at

/containers/dgx/Containers/pytorch

and

/containers/dgx/Examples/Pytorch

You can copy the set up to your working directory on Hopper directly by

cp -r /containers/dgx/Examples/Pytorch/misc/* .

You can then modify the scripts to run in your directories. In the available example scripts the environmental variable $SINGULARITY_BASE points to /containers/dgx/Containers. Check these pages for examples on runnig containerized version of Pytorch.

Common Pytorch Issues

GPU not recognized - In your Python Virtual Environment, you probably need to use an earlier version of Pytorch.
CUDA Memory Error - Refer to this page for documentation on dealing with this error: Pytorch FAQs