Running Slurm Jobs With Multiple MIG Devices

IMPORTANT: Processes running on separate MIG GPUs are not able to communicate via CUDA. The information in this document only shows how to run multiple independant CUDA procsseses in a single slurm job. Distributed training with pytorch, tensorflow or any other common ML/DL framework is not currently possible using MIG devices

As mentioned in the Running GPU Jobs article, the most significant constraint on using MIG devices is the current resriction imposed by CUDA that limits the MIG device enumeration to a single device.

It is still possible to use multiple MIG devices in a single slurm job, however, the are some adjustments that need to be made in the job environment. When a job is allocated GPU resources Slurm will set an environment variable called CUDA_VISIBLE_DEVICES. For example a job that requested two 2g.20gb MIG devices would have this variable set to something like this:

CUDA_VISIBLE_DEVICES=MIG-92b6c26a-bcfc-5603-b6fe-faa085154d31,MIG-2a3e9acd-9cc5-5e7c-b513-cfef043bb574

This environment variable is what CUDA uses to enumerate available GPU devices. In the case of MIG, however, CUDA will ignore any devices after the first one in the list. To use both devices we must parse the names of the MIG devices from the CUDA_VISIBLE_DEVICES variable and then pass each device name individually to the CUDA processes when they are run.

As proof of concept, below is a sample slurm script that requests two MIG devices and 9 cores. The script runs two copies of a CUDA python program in the background, passing a single MIG device to each process (along with four of the physical cores by specifying the OMP_NUM_THREADS variable). The script runs nvdia-smi to verify that each process is started on a distinct GPU, then waits in a final loop for the python processes to complete.

#!/bin/bash
#SBATCH --job-name=multi-mig-test
#SBATCH --output=%j-%N-out.txt
#SBATCH --error=%j-%N-err.txt
#SBATCH --partition=gpuq
#SBATCH --ntasks=9
#SBATCH --mail-type=all
#SBATCH --mem-per-cpu=4G
#SBATCH --gres=gpu:2g.20gb:2
#SBATCH --qos=gpu

module load gnu10
module load python/3.9.9-jh

j=0
for i in $(echo $CUDA_VISIBLE_DEVICES | tr ',' ' '); do
    OMP_NUM_THREADS=4 CUDA_VISIBLE_DEVICES=${i} ./gpu-stress.py &
    pids[${j}]=$!
    j = j + 1
done

sleep 20
nvidia-smi

# wait for all pids
for pid in ${pids[*]}; do
    wait $pid
done

We can see from the output of the job that each python process gets its apropriate GPU device:

Fri May 17 14:58:17 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:81:00.0 Off |                   On |
| N/A   32C    P0             118W / 500W |   3161MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:C1:00.0 Off |                   On |
| N/A   35C    P0             190W / 500W |  21852MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    5   0   0  |            2219MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               2MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    3   0   0  |            2219MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               2MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    5    0    2962875      C   python3                                    2186MiB |
|    1    3    0    2962877      C   python3                                    2186MiB |
+---------------------------------------------------------------------------------------+