LAMMPS on the DGX A100
Users can run LAMMPS on the DGX A100 as a native or containerized application. The two methods are described below
Running as a container
One can use the highly-optimized LAMMPS containers from NGC (https://ngc.nvidia.com/catalog/containers/nvidia:lammps) for single- or multi-GPU runs as follows.
The containers can currently be found at $SINGULARITY_BASE/containers/dgx/Containers/lammps
The sample tests are at all located at /opt/sw/app-tests/lammps
.
$ tree -rf /opt/sw/app-tests/lammps/dgx/containerized
├── /opt/sw/app-tests/lammps/dgx/containerized/10Feb2021
└── /opt/sw/app-tests/lammps/dgx/containerized/29Oct2020
Batch submission file
A typical batch submission file (named run.slurm
) would like this:
#!/bin/bash
#SBATCH --partition=gpuq # the DGX only belongs in the 'gpu' partition
#SBATCH --qos=gpu # need to select 'gpu' QoS
#SBATCH --job-name=jlammps
#SBATCH --output=%x.%j
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 # up to 128, but make sute ntasks x cpus-per-task < 128
#SBATCH --cpus-per-task=1 # up to 128; but make sute ntasks x cpus-per-task < 128
#SBATCH --gres=gpu:A100.40gb:1 # up to 8; only request what you need
#SBATCH --mem-per-cpu=35000M # memory per CORE; total memory is 1 PB (1,000,000 MB)
# SBATCH --mail-user=user@inst.edu
# SBATCH --mail-type=ALL
#SBATCH --export=ALL
#SBATCH --time=0-04:00:00 # set to 1hr; please choose carefully
set echo
#-----------------------------------------------------
# Example run from NVIDIA NGC
# https://ngc.nvidia.com/catalog/containers/hpc:lammps
# Please feel free to download and run it as follows
#-----------------------------------------------------
#-----------------------------------------------------
# Determine GPU and CPU resources to use
#-----------------------------------------------------
# parse out number of GPUs and CPU cores reserved for your job
env | grep -i slurm
GPU_COUNT=`echo $SLURM_JOB_GPUS | tr "," " " | wc -w`
N_CORES=${SLURM_NTASKS}
# Set OMP_NUM_THREADS
# please note that ntasks x cpus-per-task <= 128
if [ -n "$SLURM_CPUS_PER_TASK" ]; then
OMP_THREADS=$SLURM_CPUS_PER_TASK
else
OMP_THREADS=1
fi
export OMP_NUM_THREADS=$OMP_THREADS
#-----------------------------------------------------
# Set up MPI launching
#-----------------------------------------------------
# If parallel, launch with MPI
if [[ "${GPU_COUNT}" > 1 ]] || [[ ${SLURM_NTASKS} > 1 ]]; then
MPI_LAUNCH="prun"
else
MPI_LAUNCH=""
fi
#-----------------------------------------------------
# Set up container
#-----------------------------------------------------
SINGULARITY_BASE=/containers/dgx/Containers
CONTAINER=${SINGULARITY_BASE}/lammps/lammps_10Feb2021.sif
# Singularity will mount the host PWD to /host_pwd in the container
SINGULARITY_RUN="singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd"
#-----------------------------------------------------
# Run container
#-----------------------------------------------------
# Define input file and run
LMP_INPUT=in.lj.txt
LMP_OUTPUT= log-${GPU_COUNT}gpus-${SLURM_NTASKS}cores-${OMP_NUM_THREADS}thr_percore.lammps
echo "Running Lennard Jones 16x8x16 example on ${GPU_COUNT} GPUS..."
${MPI_LAUNCH} ${SINGULARITY_RUN} ${CONTAINER} lmp \
-k on g ${GPU_COUNT} \
-sf kk \
-pk kokkos cuda/aware on neigh full comm device binsize 2.8 \
-var x 16 \
-var y 8 \
-var z 16 \
-in ${LMP_INPUT} \
-log ${LMP_OUTPUT}
Input files
You would need this input file (named in.lj.txt) to run this test. You can copy
# 3d Lennard-Jones melt
variable x index 1
variable y index 1
variable z index 1
variable xx equal 20*$x
variable yy equal 20*$y
variable zz equal 20*$z
units lj
atom_style atomic
lattice fcc 0.8442
region box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box 1 box
create_atoms 1 box
mass 1 1.0
velocity all create 1.44 87287 loop geom
pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 bin
neigh_modify delay 0 every 20 check no
fix 1 all nve
run 100
Benchmarks
For this particular example, the benchmarks indicate that the code scales well up 4 GPUs. Also, the performance of the latest container (20Feb2021) is marginally better than the previous one (29Oct2020).
29Oct2020
$ grep -i performance *lammps
log-1gpus-1cores.lammps:Performance: 4304.617 tau/day, 9.964 timesteps/s
log-2gpus-2cores.lammps:Performance: 6647.065 tau/day, 15.387 timesteps/s
log-4gpus-4cores.lammps:Performance: 10345.111 tau/day, 23.947 timesteps/s
log-8gpus-8cores.lammps:Performance: 10983.890 tau/day, 25.426 timesteps/s
10Feb2021
$ grep -i performance *lammps
log-1gpus-1cores.lammps:Performance: 4484.309 tau/day, 10.380 timesteps/s
log-2gpus-2cores.lammps:Performance: 6857.607 tau/day, 15.874 timesteps/s
log-4gpus-4cores.lammps:Performance: 10636.647 tau/day, 24.622 timesteps/s
log-8gpus-8cores.lammps:Performance: 12251.451 tau/day, 28.360 timesteps/s
Comparison with CPU-only Runs
It is always informative to see how much GPU acceleration speeds up calculations. For that purpose, we compared the above benchmarks with those run on nodes with CPUs only.
Intel20-IMPI20 version
log-1nodes-48cores-1thr_per_core.lammps:Performance: 499.450 tau/day, 1.156 timesteps/s
log-2nodes-192cores-1thr_per_core.lammps:Performance: 1908.361 tau/day, 4.418 timesteps/s
log-4nodes-432cores-1thr_per_core.lammps:Performance: 4327.999 tau/day, 10.019 timesteps/s
GNU9-OpenMPI4
$ grep -i performance log-gpus-*
log-1node-1thr_per_core.lammps:Performance: 479.698 tau/day, 1.110 timesteps/s
log-4nodes-1thr_per_core.lammps:Performance: 1824.944 tau/day, 4.224 timesteps/s
log-10nodes-480cores-1thr_per_core.lammps:Performance: 4622.468 tau/day, 10.700 timesteps/s
Conclusions
- GPU-optimized LAMMPS container runs very well on our DGX A100
- The GPU code scales well with the number of GPUs used, but it will depend heavily on the size of the simulation
- The two GPU-accelerated containers we tested perform about the same
- 1 NVIDIA A100 GPU performs as well as 9-10 nodes (dual Intel Cascade Lake CPUs with 48 cores per server) combined
- Native applications built using the GNU9+OpenMPI4 and Intel20+IMPI20 perform equally well. However, the way the jobs are launched is slightly different, so users are encouraged to see the examples at
/opt/sw/app-tests/lammps
.
$ tree -d /opt/sw/app-tests/lammps
/opt/sw/app-tests/lammps
├── dgx
│ └── containerized
│ ├── 10Feb2021
│ └── 29Oct2020
└── hopper
├── containerized
│ └── 29Oct2020
│ ├── multinode
│ │ └── other
│ ├── single-node-hybrid
│ ├── single-node-mpi
│ └── single-node-omp
└── native
└── 21Jul2020
├── gnu9-openmpi4
│ ├── large-example
│ ├── multi-node
│ ├── single-node-mpi
└── intel20-impi20
├── large-example
├── multi-node
└── single-node-mpi