DGX User Guide for Hopper
Hardware Specs
You can learn more about NVIDIA DGX A100 systems here:
https://www.nvidia.com/en-us/data-center/dgx-a100/
GPUs | 8x NVIDIA A100 Tensor Core GPUs |
320 GB total memory | |
Performance | 5 petaFLOPS AI |
10 petaFLOPS INT8 | |
CPU | Dual AMD Rome 7742, |
128/256 cores/threads total, 2.25 GHz (base), 3.4 GHz (max boost) | |
System Memory | 1TB |
NVIDIA NVSwitches | 6 |
Networking | 8x Single-Port Mellanox ConnectX-6 VPI 200Gb/s HDR InfiniBand |
1x Dual-Port Mellanox ConnectX-6 VPI | |
10/25/50/100/200Gb/s Ethernet | |
Storage | OS: 2x 1.92TB M.2 NVME drives |
Internal Storage: 15TB | |
(4x 3.84TB) U.2 NVME drives | |
Base OS | Ubuntu 20.04 LTS |
Getting Access
The DGX server is a part of the new Hopper cluster. Users would need to log into Hopper to submit jobs to the DGX.
Log into the Hopper cluster with:
$ ssh <username>@hopper.orc.gmu.edu.
You can log into the DGX only if you only have an active job on the DGX:
- you have submitted a SLURM batch job (using
sbatch
) and it is actively running on the DGX, or - you have an active SLURM interactive session (using
salloc
) on the DGX
Depending on why you want access to the DGX, you can take these two approaches.
For Quick Compiling and Testing
Because the DGX has a different OS (Ubuntu 20.04 LTS) and CPU architecture (AMD EPYC Zen2), you would likely need to recompile your code on the DGX itself and run quick tests before submitting any production runs. For that purpose, you can request a small interactive session via SLURM:
- If you don't need a GPU, you can request 1 core :
$ salloc -p gpuq -q gpu --ntasks-per-node=1 -time 0-01:00:00
- If you need a GPU, you can request a GPU along with CPU cores
$ salloc -p gpuq -q gpu --ntasks-per-node=1 --gres=gpu:A100.40gb:1 -t 0-01:00:00
This will log you into the DGX as soon as the requested resource is available:
$ salloc -p gpuq -q gpu --ntasks-per-node=1 -t 0-01:00:00
salloc: Granted job allocation 5562
salloc: Waiting for resource configuration
salloc: Nodes dgx-a100-01 are ready for job
$user@dgx-a100-01:~$
For Production Calculations
For production runs, you can submit your job batch or interactive job through SLURM and ssh
into the DGX if necessary.
$ ssh dgx-a100-01
Otherwise, your connection attempt will be declined with a message like this:
_Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by server on port
Running Jobs
The DGX runs Ubuntu 20.04 LTS. You can run calculations on it by submitting jobs via SLURM in batch or interactive mode from Hopper.
Both containerized and native applications are supported. You can run
- containerized applications using Singularity containers you build or ones we provide
- native applications you have compiled or those we provision using Lmod modules
These two approaches are described below.
Running Containerized Applications
We provide a growing list of Singularity containers in a shared location. You are also welcome to pull and run your own Singularity containers.
Using Shared Containers
Containers and examples available for all users can be found on at /containers/dgx/Containers
and /containers/dgx/Examples
. The environmental variable $SINGULARITY_BASE
points to /containers/dgx/Containers
Currently available containers can be viewed with:
$ tree /containers/dgx/Containers
/containers/dgx/Containers
├── autodock
│ └── autodock_2020.06.sif
├── caffe
│ └── caffe_20.03-py3.sif
├── digits
│ └── digits_21.02-tensorflow-py3.sif
├── gamess
│ └── gamess_17.09-r2-libcchem.sif
├── gromacs
│ └── gromacs-2020_2.sif
├── lammps
│ ├── lammps_10Feb2021.sif
│ └── lammps_29Oct2020.sif
├── namd
│ ├── namd_2.13-multinode.sif
│ ├── namd_2.13-singlenode.sif
│ └── namd_3.0-alpha3-singlenode.sif
├── ngc-preflightcheck
│ └── ngc-preflightcheck_20.11.sif
├── nvidia-hpc-benchmarks
│ └── hpc-benchmarks_20.10-hpl.sif
├── pytorch
│ └── pytorch_21.02-py3.sif
├── quantum_espresso
│ └── quantum_espresso_v6.7.sif
└── tensorflow
├── tensorflow_21.02-tf1-py3.sif
├── tensorflow_21.02-tf2-py3.sif
├── tensorflow_21.04-tf1-py3.sif
└── tensorflow_21.04-tf2-py3.sif
We encourage using these shared containers because they are optimized by NVIDIA to run well on the DGX. Sharing containers also saves storage space. Please let us know if you want us to add particular containers.
Building your Own Containers
Modern containers come from many registries (Dockerhub, NGC, SingularityHub, Biocontainers, ... etc ) and in different formats (Docker, Singularity, OCI) and runtimes (Docker, Singularity, CharlieCloud, ...).
Warning
Please keep in mind that you can not build or run Docker containers directly on Hopper or the DGX. You would need to pull and convert Docker containers to Singularity format and run the Singularity containers.
We use Docker containers pulled from NVIDIA GPU Cloud (NGC) catalog in the examples below, but the same steps apply to containers from other sources. The NVIDIA GPU Cloud (NGC) provides simple access to GPU-optimized software for deep learning, data science and high-performance computing (HPC). An NGC account grants you access to these tools as well as the ability to set up a private registry to manage your customized software. However, it is not absolutely necessary that you have an NGC account. Please see the link below for more:
NGC commands:
This example below demonstrates how to search and pull down a GROMACS image using the NGC CLI:
$ ngc registry image list
$ ngc registry image list | grep -i <container_name>
$ ngc registry image info nvcr.io/<container_name>:<containter_tag>
$ ngc registry image list|grep -i gromacs
| GROMACS | hpc/gromac | 2020.2 | 275.47 MB | Sep 24, | unlocked|
$ ngc registry image info nvcr.io:hpc/gromacs
--------------------------------------------------
Image Repository Information
Name: GROMACS
Short Description: GROMACS is a popular molecular dynamics application used to simulate proteins and lipids.
Built By: KTH Royal Institute of Technology
Publisher: KTH Royal Institute of Technology
Multinode Support: False
Multi-Arch Support: True
Logo: https://assets.nvidiagrid.net/ngc/logos/ISV-OSS-Non-Nvidia-Publishing-Gromacs.png
Labels: Covid-19, HPC, Healthcare, High Performance Computing, Supercomputing, arm64, x86_64
Public: Yes
Last Updated: Sep 24, 2020
Latest Image Size: 275.47 MB
Latest Tag: 2020.2
Tags:
2020.2
2020
2020.2-arm64
2020.2-x86_64
2018.2
2016.4
$ ngc registry image info nvcr.io/hpc/gromacs:2020.2
--------------------------------------------------
Image Information
Name: hpc/gromacs:2020.2
Architecture: amd64
Schema Version: 1
Image Size: 275.47 MB
Last Updated: Jun 22, 2020
--------------------------------------------------
Pulling Docker containers and building Singularity containers:
Once you select a Docker container to use, you need to pull it down and convert it to a Singularity image format with the following command. You would need to load singularity
module first.
$ module load singularity
$ singularity build <container_name>_<container_version/tag>.sif docker://nvcr.io/<hpc>/<container_name><container_version/tag>
Here is an example for preparing a GROMACS Singularity container:
$ module load singularity
$ singularity build gromacs-2020_2.sif docker://nvcr.io/hpc/gromacs:2020.2
Please note that we have adapted the following convention on naming Singularity image files.
- we use SIF instead of SIMG for the file extension
- we name containers as
<container_name>_<container_version/tag>.sif
Also note that you can pull the containers from NGC, DockerHub or any other source, but we encourage using ones from the NGC registry if one is available because they are optimized for NVIDIA GPUs.
Running Native Applications
If you want to run native GPU-capable applications, you can run them much like you have on Argo.
- load up the module for the GPU-capable application/version
- run the application
We currently have a limited set of native applications that have been tested on the DGX. That will increase over time.
Warning
The DGX is very different from Argo and Hopper in terms of OS, CPU and GPU architecture as well as the software stack running on it. Therefore, you would generally need to recompile your code on the DGX itself using the software stack built for the DGX. Please email orchelp@gmu.edu if you need help.
System | Argo | Hopper | DGX |
---|---|---|---|
OS | CentOS 7.8 | CentOS 8.3 | Ubuntu 20.04 |
CPU | Intel | Intel | AMD |
GPUs | K80, V100 | - | A100 |
NVIDIA driver version | 440.x-455.y | - | 450.x |
To access modules built for the DGX, first load into the DGX by creating a short interactive session:
$ salloc -p gpuq -q gpu --ntasks-per-node=1 -t 0-01:00:00
salloc: Granted job allocation 5562
salloc: Waiting for resource configuration
salloc: Nodes dgx-a100-01 are ready for job
$
You should see a hosts/dgx
module loaded and other modules that are available to you:
$ module avail
...
----- GNU-9.3.0 ---------
openmpi/4.0.4-ev python/3.7.6-tf python/3.8.6-mf (L,D)
----- Independent ---------
cuda/10.2.89 cuda/11.2.1 (D) gnu9/9.3.0 intel/2020.2 orca/4.2.1 singularity/3.7.1
----- Core ---------
hosts/dgx (L) lmod settarg use.own
...
Scheduling SLURM Jobs
You can run a native or containerized application through SLURM either interactively or using batch submission scripts. Both approaches are discussed below. To run jobs on the DGX, you would need
- to have a SLURM account on Hopper AND
- be eligible to use the 'gpu' Quality-of-Service (QoS)
The DGX is part of the ‘gpuq’ partition.
$ sinfo -o "%12P %5D %14F %8z %10m %.11l %15N %G"
PARTITION NODES NODES(A/I/O/T) S:C:T MEMORY TIMELIMIT NODELIST GRES
debug 3 0/3/0/3 2:24:1 180000 1:00:00 hop[043-045] (null)
interactive 3 0/3/0/3 2:24:1 180000 12:00:00 hop[043-045] (null)
contrib 42 6/36/0/42 2:24:1 180000 6-00:00:00 hop[001-042] (null)
normal* 25 21/4/0/25 2:24:1 180000 3-00:00:00 hop[046-070] (null)
gpuq 1 0/1/0/1 8:16:1 1024000 2-00:00:00 dgx-a100-01 gpu:A100.40gb:6,gpu:1g.5gb:9,gpu:2g.10gb:1,gpu:3g.20gb:1
orc-test 70 27/43/0/70 2:24:1 180000 1-00:00:00 hop[001-070] (null)
A100
GPU resource in Slurm has been renamed as A100.40gb
. Therefore, you should request a full A100 GPU using --gres=gpu:A100.40gb
.
You can request smaller slices of a GPU using --gres=gpu:1g.5gb
, --gres=gpu:2g.10gb
or --gres=gpu:3g.20gb
. Please see below for details
The GPU list shows 6x A100.40gb GPUs as well as 9x 1g.5gb, 1x 2g.10gb and 1x 3g.20gb resources. The latter three types of resources are a product of a partitioning scheme called Multi-Instance GPU (MIG).
GPU partitioning
The DGX A100 has 8 NVIDIA Tesla A100 GPUs which can be further partitioned into smaller slices to optimize access and utilization. For example, each GPU can be sliced into as many as 7 instances when enabled to operate in MIG (Multi-Instance GPU) mode.
GPU Instance Profiles on A100 Profile
Name | Fraction of Memory | Fraction of SMs | Hardware Units | L2 Cache Size | Number of Instances Available |
---|---|---|---|---|---|
MIG 1g.5gb | 1/8 | 1/7 | 0 NVDECs | 1/8 | 7 |
MIG 2g.10gb | 2/8 | 2/7 | 1 NVDECs | 2/8 | 3 |
MIG 3g.20gb | 4/8 | 3/7 | 2 NVDECs | 4/8 | 2 |
MIG 4g.20gb | 4/8 | 4/7 | 2 NVDECs | 4/8 | 1 |
MIG 7g.40gb | Full | 7/7 | 5 NVDECs | Full | 1 |
Our DGX is currently partitioned such that six of the 8 A100 GPUs (GPU ID 0-5) are not partitioned while the last two (GPU ID 6-7) are partitioned into slices of different sizes.
GPU ID | Size | GRES name |
---|---|---|
0 | Full A100 | A100.40gb |
1 | Full A100 | A100.40gb |
2 | Full A100 | A100.40gb |
3 | Full A100 | A100.40gb |
4 | Full A100 | A100.40gb |
5 | Full A100 | A100.40gb |
6 | 7x 1/7 A100 | 1g.5gb |
7 | 2x 1/7 A100 | 1g.5gb |
1x 2/7 A100 | 2g.10gb | |
1x 4/7 A100 | 3g.20gb |
The way the GPUs are partitioned will likely change over time to optimize utilization.
The best way to take advantage of MIG operation is to analyze the demands of your job and determine which GPU size is available and suitable for it. For example, if your simulation uses very small memory, you would be better off using a 1g.5gb slice and leaving the bigger partitions to jobs that need more GPU memory. Another consideration for machine learning jobs is the difference in demands of training and inference tasks. Training tasks are more compute and memory intensive, this they are a better for for a full GPU or large partition while inference tasks would run sufficiently on smaller slices.
Interactive Mode
You can request an interactive access the DGX A100 server through SLLURM as follows:
$ salloc -p gpuq -q gpu --ntasks-per-node=1 --gres=gpu:A100.40gb:1 -t 0-01:00:00
salloc: Granted job allocation 2185
salloc: Waiting for resource configuration
salloc: Nodes dgx-a100-01 are ready for job
$
Once your reservation is available, you will be logged into the DGX automatically:
$ hostname -s
dgx-a100-01
To run the container while connected:
$ singularity run [ --nv] [other_options] <container_name>_<container_version/tag>.sif <command>
As an example, the following command runs a Python script using Tensorflow container
$ singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd /containers/dgx/Containers/tensorflow/tensorflow_21.02-tf1-py3.sif python test_single_gpu.py
You can run on any one or more GPUs. The GPUs are indexed 0-7. Since this is a shared resource, we encourage you to monitor the GPUs usage and selectively submit to idle GPU(s) when running jobs interactively. For example, the output of nvidia-smi
command suggests that there GPUs indexed 0,1,2 are being actively used, and you should run your jobs on one of the other GPUs.
$ nvidia-smi
Thu Mar 15 10:58:08 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 29C P0 52W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
.
.
.
| 6 A100-SXM4-40GB On | 00000000:B7:00.0 Off | On |
| N/A 31C P0 46W / 400W | 25MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB On | 00000000:BD:00.0 Off | On |
| N/A 31C P0 42W / 400W | 25MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 6 7 0 0 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
.
.
.
| 6 13 0 6 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 7 1 0 0 | 11MiB / 20096MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 7 5 0 1 | 7MiB / 9984MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 7 13 0 2 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 7 14 0 3 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| running processes found |
| 0 App1 1% |
| 1 App2 12% |
| 2 App3 90% |
+-----------------------------------------------------------------------------+
To select particular GPU(s), you can use the SINGULARITYENV_CUDA_VISIBLE_DEVICES
environmental variable. For example, you can select the 1st and 3rd GPU by setting
$ SINGULARITYENV_CUDA_VISIBLE_DEVICES=0,2
SLURM specifies the GPU indices assigned to your job to the SLURM_JOB_GPUS
environmental variable. So you can set
$ SINGULARITYENV_CUDA_VISIBLE_DEVICES=${SLURM_JOB_GPUS}
For example, the following commands will run on any number of GPU assigned to you:
$ SINGULARITYENV_CUDA_VISIBLE_DEVICES=${SLURM_JOB_GPUS}
$ singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd /containers/dgx/Containers/tensorflow/tensorflow_21.02-tf1-py3.sif python test_single_gpu.pyUseful tools for monitoring the GPU usage
While you are on the server, you can use these tools to monitor the GPU usage:
nvitop -m
nvtop
nvidia-smi
Please remember to log out of the DGX A100 server when you finish running your interactive job.
Batch Mode
Below is a sample SLURM batch submission file you can use as an example to submit your jobs. Save the information into a file (say run.slurm
), and submit it by entering sbatch run.slurm
. Please update <N_CPU_CORES>
, <MEM_PER_CORE>
and <N_GPUs>
to reflect the number of CPU cores and GPUs you need. Please note that the DGX has 128 CPU cores, 8 GPUs and 1TB of system memory.
#!/bin/bash
#SBATCH --partition=gpuq
#SBATCH --qos=gpu
#SBATCH --job-name=jmultigpu_basics
#SBATCH --output=jmultigpu_basics.%j
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=<N_CPU_CORES>
#SBATCH --gres=gpu:A100.40gb:<N_GPUs>
#SBATCH --mem-per-cpu=<MEM_PER_CORE>
#SBATCH --export=ALL
#SBATCH -time=0-01:00:00
set echo
umask 0022
nvidia-smi
env|grep -i slurm
SINGULARITY_BASE=/containers/dgx/Containers
CONTAINER=${SINGULARITY_BASE}/tensorflow/tensorflow_21.02-tf1-py3.sif
SINGULARITY_RUN="singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd"
SCRIPT=multigpu_basics.py
${SINGULARITY_RUN} ${CONTAINER} python ${SCRIPT} | tee ${SCRIPT}.log
We encourage the use of environmental variables to make the job submission file cleaner and easily reusable.
The syntax for running different containers varies depending on the application. Please check the NGC page for more instructions on running these containers using Singularity.
Storage Locations
Currently, these locations have been designated for storing shared and user-specific containers.
- Containers
- Shared:
/containers/dgx/Containers
- User-specific:
/containers/dgx/UserContainers/$USER
- Examples
- Native and Containerized applications:
/groups/ORC-VAST/app-tests
Sample Runs
We provide some sample calculations to facilitate setting up and running calculations:
- examples on running native and containerized applications is available here:
/groups/ORC-VAST/app-tests
- The examples at https://gitlab.com/NVHPC/ngc-examples are helpful. For many applications, there are no instructions on running the containers using Singularity, but you should be able to build one from the Docker image and run it.