Running Pytorch Jobs on Hopper
Runs requiring Pytorch can be run directly using the system installed python or a python virtual environment or any one of the available singularity containers.
Running with the System Python in Batch Mode
To run with the system python, log in to the cluster AMD head node which has a gpu card that allows for testing gpu codes.
ssh NetID@hopper-amd.orc.gmu.edu
module load gnu10
module load python
Create a directory and change into it:
mkdir cs678_example && cd cs678_example
Download the necessary files:
into the directory with the following commands:
wget https://wiki.orc.gmu.edu/mkdocs/cs678_example_files/main.py
wget https://wiki.orc.gmu.edu/mkdocs/cs678_example_files/run.slurm
Your directory contents should now look like:
cs678_example/
├── main.py
└── run.slurm
You can now test python script directly on the headnode:
python main.py
Running interactively on a GPU node
To work directly on a gpu node, start interactive session with
salloc --account cs678fl22 -p gpuq -q gpu --nodes=1 --ntasks-per-node=4 --gres=gpu:1g.10gb:1 --mem=50GB --time=0-2:00:00
The gres and other parameters can be adjusted to match the required resources. Once on the gpu node, the same steps as outlined above can be followed to set up your environment and run your code. The interactive session will continue until the set time limit or until it is ended with
exit
Once you're ready, use a Slurm script to run your code in batch mode. When setting the Slurm parameters in your script, pay attention to the the following:
1 - The account information:
#SBATCH --account=cs678fl22
2 - The partition and QOS:
#SBATCH --partition=gpuq
#SBATCH --qos=gpuq
The time limit on the gpuq partiton defaults to 3 days:
#SBATCH --time=3-00:00:00
4 - GPU node options:
Type of GPU | Slurm setting | No. of GPUs on Node | No. of CPUs | RAM |
---|---|---|---|---|
1g 10GB | --gres=gpu:1g.10gb:nGPUs | 4 | 64 | 500GB |
2g 20GB | --gres=gpu:2g.20gb:nGPUs | 4 | 64 | 500GB |
3g 40GB | --gres=gpu:3g.40gb:nGPUs | 4 | 64 | 500GB |
A100 80GB | --gres=gpu:A100.80gb:nGPUS | 4 | 64 | 500GB |
DGX A100 40GB | --gres=gpu:A100.40gb:nGPUs | 8 | 128 | 1TB |
Below is a sample Slurm submission script. The information can be found in the downloaded run.slurm.
Update the timing information, the
,
and
sbatch run.slurm
Sample script, run.slurm
:
#!/bin/bash
#SBATCH --account=cs678fl22
#SBATCH --job-name=mnist
#SBATCH --partition=gpuq
#SBATCH --qos=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=<N_CPU_CORES>
#SBATCH --gres=gpu:1g.10gb:<nGPUs>
#SBATCH --mem=<MEMORY>
#SBATCH --export=ALL
#SBATCH --time=0-01:00:00
#SBATCH --output=mnist.%j.out
#SBATCH --error=mnist.%j.err
# to see ID and state of GPUs assigned
nvidia-smi
## Load the necessary modules
module load gnu10
module load python
## Execute your script
python main.py
After submitting the Slurm script, you can monitor it and make updates to it using additional Slurm commands as detailed in this page on Monitoring and Controlling Slurm Jobs.
Managing pytorch and other packages with python virtual environments
To use additional libraries or different versions of pytorch and other libraries than what is available in the python modules, use Python Virtual Environments. The following steps summarize the process.
Info
To make sure your Python Virtual Environment runs across all nodes, it is important to use the versions of modules built for this. Before creating the Python Virtual Environment
1 - First switch modules to GNU 10 compilations:
module load gnu10/10.3.0-ya
2 - Check and load python module
module avail python
module load python/3.9.9-jh
3 - Now you can create the python virtual environment. After it is created, activate it and upgrade pip
python -m venv pytorch-env
source pytorch-env/bin/activate
python -m pip install --upgrade pip
5 - Remove system python module and install torch modules (Refer to this page for updated instructions on installing Pytorch)
module unload python
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
deactivate
Sample script, run.slurm
, for python virtual environments:
#!/bin/bash
#SBATCH --account=cs678fl22
#SBATCH --job-name=mnist
#SBATCH --partition=gpuq
#SBATCH --qos=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=<N_CPU_CORES>
#SBATCH --gres=gpu:1g.10gb:<nGPUs>
#SBATCH --mem=<MEMORY>
#SBATCH --export=ALL
#SBATCH -time=0-01:00:00
#SBATCH --output=mnist.%j.out
#SBATCH --error=mnist.%j.err
# To see ID and state of GPUs assigned
nvidia-smi
## Activate the python virtual environment
source activate ~/pytorch-env/bin/activate
## Execute your script
python main.py
Running Pytorch with Singularity Containers
Containers and examples for running Pytorch are available for all users can be found on the cluster at
/containers/dgx/Containers/pytorch
/containers/dgx/Examples/Pytorch
cp -r /containers/dgx/Examples/Pytorch/misc/* .
You can then modify the scripts to run in your directories. In the available example scripts the environmental variable $SINGULARITY_BASE points to /containers/dgx/Containers
. Check these pages
for examples on runnig containerized version of Pytorch.
Common Pytorch Issues
- GPU not recognized - In your Python Virtual Environment, you probably need to use an earlier version of Pytorch.
- CUDA Memory Error - Refer to this page for documentation on dealing with this error: Pytorch FAQs