Running GPU Jobs on the ORC Clusters
Available GPU Resources on Hopper
The GPU resources on Hopper currently include 2 DGX A100 40GB nodes each with 8 A100.40GB GPUs and 24 A100 80GB nodes each with 4 A100.80GB GPUs. 8 of the A100.80GB GPUs are further partitioned into MIG slices to increase even further the number of gpu instances that can be started on Hopper.
GPU partitioning
The A100.80GB nodes have 4 NVIDIA Tesla A100 GPUs which can be further partitioned into smaller slices to optimize access and utilization. For example, each GPU can be sliced into as many as 7 instances when enabled to operate in MIG (Multi-Instance GPU) mode. The MIG mode is a feature that allows a single GPU to be partitioned into multiple instances, each with their own dedicated resources. This enables multiple users or applications to share a single GPU, improving overall utilization and efficiency.
The following table outlines three MIG partition types with varying resource allocations: MIG 1g.10gb has a 1/8 memory fraction, 1/7 fraction of Streaming Multiprocessors (SMs), no NVDEC (NVIDIA Decoder) hardware units, 1/8 L2 cache size, and 8 nodes; MIG 2g.20gb features a 2/8 memory fraction, 2/7 fraction of Streaming Multiprocessors (SMs), 1 NVDEC (NVIDIA Decoder) hardware unit, 2/8 L2 cache size, and 4 nodes; MIG 3g.40gb comes with a 4/8 memory fraction, 3/7 fraction of Streaming Multiprocessors (SMs), 2 NVDEC (NVIDIA Decoder) hardware units, 4/8 L2 cache size, and 4 nodes.
GPU Instance Profiles on A100 Profile
Name | Fraction of Memory | Fraction of SMs | Hardware Units | L2 Cache Size | Number of Nodes | Total Available |
---|---|---|---|---|---|---|
MIG 1g.10gb | 1/8 | 1/7 | 0 NVDECs | 1/8 | 8 | 64 |
MIG 2g.20gb | 2/8 | 2/7 | 1 NVDECs | 2/8 | 4 | 32 |
MIG 3g.40gb | 4/8 | 3/7 | 2 NVDECs | 4/8 | 4 | 32 |
To make the most of the GPUs on Hopper, it is essential to evaluate your job's requirements and select the appropriate GPU slice based on availability and suitability. For instance, if your simulation demands minimal GPU memory, a MIG 1g.10gb slice (providing 10GB of GPU memory) would be more suitable, reserving larger slices for jobs with higher memory needs. In the context of machine learning, training tasks generally require more computation and memory, making a full GPU node or a larger slice like MIG 3g.40gb ideal, while inference tasks can be efficiently executed on smaller slices like MIG 1g.10gb or MIG 2g.20gb.
Our cluster currently offers 32 MIG 3g.40gb partitions, 32 MIG 2g.20gb partitions, and 64 MIG 1g.10gb partitions. This configuration ensures the most efficient use of our limited GPU resources. MIG technology enables better resource allocation and allows for more diverse workloads to be executed simultaneously, enhancing the overall performance and productivity of the cluster. The partitioning of GPU nodes is expected to evolve over time, optimizing resource utilization.
Running GPUs jobs
GPU jobs can be run either from a shell session or from the Open OnDemand web dashboard.
GPU jobs from Open OnDemad
After logging into Open OnDemand, select the app you want to run on and complete the resource configuration table. To run your job on any of the available GPU resources, you need to select the 'GPU' or 'Contrib GPU' for the partition:
You also need to set the correct GPU size depending on your jobs needs:
After setting the additional options, your app will start on the selected GPU once you launch it.
GPU Jobs with SLURM
To run on the GPUs with SLURM, you need to set the correct PARTITION, QOS and GRES option when defining your SLURM parameters.
The Partition and QOS respectively are set with: - Partition:
#SBATCH --partition=gpuq
or
#SBATCH --partition=contrib-gpuq
The contrib-gpuq
partition can be used by all, but runs from accounts that are not Hopper GPU node contributors will be open to pre-emption.
- QOS:
#SBATCH --qos=gpu
You need to combine the partition and qos to run on the gpu nodes.
You also need to set the type and number of GPUs you need to use with the gres
parameter. The available GPU GRES options
are show in the following table:
Type of GPU | SLURM setting | No. of GPUs on Node | No. of CPUs | RAM |
---|---|---|---|---|
1g 10GB | --gres=gpu:1g.10gb:nGPUs | 4 | 64 | 500GB |
2g 20GB | --gres=gpu:2g.20gb:nGPUs | 4 | 64 | 500GB |
3g 40GB | --gres=gpu:3g.40gb:nGPUs | 4 | 64 | 500GB |
A100 80GB | --gres=gpu:A100.80gb:nGPUS | 4 | 64 | 500GB |
DGX A100 40GB | --gres=gpu:A100.40gb:nGPUs | 8 | 128 | 1TB |
You would modify your SLURM options to make sure that you are requesting a suitable GPU slice.
GPU runs with SLURM can be made wither interactively, directly on the gpu node or in batch mode with a SLURM script.
Working interactively on a GPU
You start an interactive session on a GPU node with the salloc
command:
salloc -p gpuq -q gpu --nodes=1 --ntasks-per-node=12 --gres=gpu:1g.10gb:1 --mem=15gb -t 0-02:00:00
This command will allocate you the specified gpu resources (a 1g.10gb MIG instance) 12 cores and 15GB of memory for 2 hours on the gpu node. Once the resources become available, your prompt should now show that you're on one of the Hopper nodes.
salloc: Granted job allocation
salloc: Waiting for resource configuration
salloc: Nodes amd021 are ready for job
[user@amd021 ~]$
Once allocated, this will give you direct acces to the gpu instance where you can then work interactively from the command line. Modules you loaded while on the head nodes are exported onto the node as well. If you had not already loaded any modules, you should be able to load them now as well. To check the currently loaded modules on the node use the command shown below :
$ module avail
The interactive session will persist until you type the 'exit' command as shown below:
$ exit
exit
salloc: Relinquishing job allocation
Using a SLURM Submission Script
Once your tests are done and you're ready to run longer jobs, you should now switch to using
the batch submission with SLURM. To do this, you write a SLURM script setting the different parameters
for your job, loading the necessary modules, and executing your Python script which is then submitted to the
selected queue from where it will run your job. Below is an example SLURM script (run.slurm
) for a Python job
on the GPU nodes. In the script, the partition is set to gpuq
and number of GPU nodes needed is set to 1:
#!/bin/bash
#SBATCH --partition=gpuq # need to set 'gpuq' or 'contrib-gpuq' partition
#SBATCH --qos=gpu # need to select 'gpu' QOS or other relvant QOS
#SBATCH --job-name=python-gpu
#SBATCH --output=/scratch/%u/%x-%N-%j.out # Output file
#SBATCH --error=/scratch/%u/%x-%N-%j.err # Error file
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 # number of cores needed
#SBATCH --gres=gpu:1g.10gb:1 # up to 8; only request what you need
#SBATCH --mem-per-cpu=3500M # memory per CORE; total memory is 1 TB (1,000,000 MB)
#SBATCH --export=ALL
#SBATCH --time=0-02:00:00 # set to 2hr; please choose carefully
set echo
umask 0027
# to see ID and state of GPUs assigned
nvidia-smi
module load gnu10
module load python
python myscript.py
Preferably, use the scratch space to submit your job's SLURM script with
sbatch run.slurm
cd /scratch/UserID
command to change directories(replace 'UserId' with your GMU GMUnetID). Please note that scratch directories have no space limit and data in /scratch gets purged 90 days from the date of creation, so make sure to move your files to a safe place before the purge.
To copy files directly from home or scratch to your projects or other space you can use the cp
command to create a copy of the contents of the file or directory specified by the SourceFile or SourceDirectory parameters into the file or directory specified by the TargetFile or TargetDirectory parameters. The cp
command also copies entire directories into other directories if you specify the -r or -R flags.
The command below copies entire files from the scratch space to your project space (" /projects/orctest" as shown in the example below, where " /projects/orctest" is a project space)
[UserId@hopper2 ~]$ cd /scratch/UserId
[UserId@hopper2 UserId]$ cp -p -r * /projects/orctest