Slurm has many built-in features that allow users to run many different types of parallel code, leveraging the full capabilities of the cluster. Here we shall briefly go over some common parallel environments.

Running Distributed Jobs

MPI stands for Message Passing Interface. The MPI specification is basically an API that provides useful routines for communication between the nodes. The specification itself is in two main parts. MPI-1 is the first incarnation of the specification, which was later expanded and evolved into MPI-2. MPI is the de-facto standard library for message passing in a distributed computing setting. The standard is controlled by the MPI forum. All MPI-2 implementations are capable of supporting anything originally written for MPI-1, but the reverse is not true.

We will provide guidelines on how to run MPI jobs on ARGO using Slurm. First we shall briefly go over the various types of MPI software available on Argo. These are listed as modules and can be loaded as required. These modules are updated frequently and users can check the current version of the MPI modules by typing:

module avail #this lists all available modules

Currently, we have following MPIs:

intel-mpi(included in intel/ps_xe module)
mpich
mvapich2
openmpi

Note: It is highly recommended that you use the intel-mpi module as it gives the best performance out of the lot.

Each MPI type has multiple flavors depending on which compiler version was used to compile it and the architecture. For example the mvapich2 has the following flavors:

 mvapich2/gcc/64/2.0b
 mvapich2/gcc/64/2.2b
 mvapich/intel/64/2.2b
 mvapich2/open64/64/2.0b

The first two were compiled with gcc, the third with intel/ps_xe, and the last one with open64. The MPI modules are named using the format below:

mpitype{/interface/compiler/arch/}version   #interface=communication type, arch=architecture

where some of the options within the parenthesis may be omitted. Always use the correct MPI module by looking at the available module list before compiling and running your MPI code.

Compiling Your Program

Depending on which MPI environment you want to use, you will need to load the respective compiler module. Additionally each MPI compiler depends on the core C/C++/FORTRAN compiler which it is based on. For example, if you want to compile your MPI program with open64 version of the mpich compiler then you first need to load the open64 module before compiling your program. Details, as to which MPI module needs which compiler is given below:

Compiler-Module Dependency
MPI-module	Base Compiler
mpich/gcc... mvapich2/gcc/... openmpi/gcc/...	Use gcc
mpich/ge/open64/... mvapich2/open64/... openmpi/open64/...	Use open64
intel-mpi/64/...	Use intel/ps_xe

Compiler-Module Dependency

It is recommended that you use x64 architecture whenever possible.

MPI compilers associated with the mpi modules above is given in the table below:

MPI Compiler
MPI-module	Language	MPI Compiler Wrapper
mpich openmpi mvapich2	C	mpicc
C++	mpiCC
Fortran77	mpif77
Fortran90	mpif90
intel-mpi	C	mpiicc
C++	mpiicpc
Fortran77 Fortran90	mpiifort

MPI Compiler

We will show how to compile an MPI program using two different MPI libraries and give a corresponding job script to submit for each implementation. The purpose of this demonstration is to get you started on running MPI jobs on ARGO. Here is a sample program called MpiHello.c which we will use for compiling and running.

Setting the environment

To compile a program with an MPI library, start by loading the appropriate module. You can use any version available on the cluster. Here we are loading the gcc version of the mvapich2 library. Note that the gcc module is loaded by default on the head nodes, hence you do not need to load it. However, if you were to say use the open64 version of mvapich2, then you would have needed to load the open64module before loading the MPI module:

 module load  mvapich2/gcc/64/...

To use the open64 version:

module load open64/...
module load mvapich2/open64/...

Loading the environment for mpich and openmpi is similar.

Compiling the code

To compile the MpiHello.c using the gcc compiler, first load the gcc and openmpi modules and compile using mpicc:

module load gcc/7.1.0
module load openmpi/gcc/64/1.10.1
mpicc MpiHello.c -o MpiHello

To see the version that the MPI library was compiled with:

mpicc -v

And to see the compilation and linking parameters (including paths used for the include files and libraries):

mpicc -show

Once your application is compiled, you can check if the right libraries are associated with your executable by typing the following command:

ldd MpiHello

To test the program, type the following command and you should see the corresponding output:

mpirun -np 2 ./MpiHello
Hello from process 0 out of 2 processors
Hello from process 1 out of 2 processors

Running short jobs with a small number of processes on the head nodes is fine for testing, but anything requiring more time or more processes should be submitted to SLURM as a real job (see below).

Linking against a dynamic library

If the program requires linking against a dynamic library, the library path can be specified during compilation by passing "-L/location/to/static/library/directory -llibrary" option to the compiler. For example, to link against the appropriate fftw3 library, the appropriate library path from fftw3 module should be determined. Issuing a 'module show' command on the library module will show the correct LD_LIBRARY_PATH for the dynamic library. This path should be used during compilation. When preparing your SLURM submission script to run this program, you should be sure to execute the appropriate module load command to assure that the correct dynamic libraries are available.

[user@ARGO-1 user]$ module show fftw3/openmpi/gcc/64/3.3.4
-------------------------------------------------------------------
/cm/shared/modulefiles/fftw3/openmpi/gcc/64/3.3.4:
module-whatis    Adds FFTW library for 64 bits to your environment
prepend-path     LD_RUN_PATH /cm/shared/apps/fftw/openmpi/gcc/64/3.3.4/lib/
**prepend-path LD_LIBRARY_PATH /cm/shared/apps/fftw/openmpi/gcc/64/3.3.4/lib/**
prepend-path     MANPATH /cm/shared/apps/fftw/openmpi/gcc/64/3.3.4/share/man/
setenv           FFTWDIR /cm/shared/apps/fftw/openmpi/gcc/64/3.3.4/lib
setenv           FFTWINCLUDE /cm/shared/apps/fftw/openmpi/gcc/64/3.3.4/include
setenv           FFTWLIB fftw3
-------------------------------------------------------------------

Running Your MPI Job with "sbatch"

Below is a sample SLURM submission script which sets up appropriate resources and calls the MpiHello program.

#!/bin/bash

## Specify the name for your job, this is the job name by which Slurm will
## refer to your job.  This can be different from the name of your executable
## or the name of your script file.
#SBATCH --job-name MPI_job

#SBATCH --qos normal  # normal,cdsqos,phyqos,csqos,statsqos,hhqos,gaqos,esqos
#SBATCH -p all-HiPri  # partition (queue): all-LoPri, all-HiPri,
                      #   bigmem-LoPri, bigmem-HiPri, gpuq, CS_q, CDS_q, ...

## Deal with output and errors.  Separate into 2 files (not the default).
## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH -o /scratch/%u/%x-%N-%j.out    # Output file
#SBATCH -e /scratch/%u/%x-%N-%j.err    # Error file
#SBATCH --mail-type=BEGIN,END,FAIL     # NONE,BEGIN,END,FAIL,REQUEUE,ALL,...
#SBATCH --mail-user=<userID>@gmu.edu   # Put your GMU email address here

## -------MPI Specific Options---------- 
#SBATCH --nodes <N>            # Number of computers to run MPI processes on
#SBATCH --ntasks-per-node <n>  # Number of tasks (processes) per node --
                               #   -- must be less than the node core count!

## Enable one of the following module corresponding to the MPI compiler used.
## These may not be the newest version. Use "module avail" to find the best.
#module load mpich/ge/gcc/64/3.2
module load openmpi/gcc/64/1.10.1
#module load mvapich2/gcc/64/2.2b
#module load intel-mpi/64   # This will load the default version

## Run you program
mpirun -np <Nxn> ./MpiHello  # Here <Nxn> is the total number of process used

If you do not specify the "-np" (tasks per code), then Slurm will automatically determine the correct number based on the total processes(nxN) provided in the Slurm script.

When Using Intel-MPI

In case you are typing these steps in as you go, let's start from a clean slate to make sure there's no confusion.

module purge

In order to compile and run your MPI program using Intel's optimized MPI library, you need to make the following change when compiling and running your code. You have to load the intel-suite.

module load intel/ps_xe
module load intel-mpi/64

Compile your code using mpiicc instead of mpicc:

mpiicc MpiHello.c -o MpiHello

The commands to show the version, the compilation and linking libraries are the same with mpicc. Then you can use the above sample job script to run your test program. You only have to comment out the line where gcc MPI module is loaded, and uncomment the corresponding intel-mpi module.

...
## Enable one of the following module corresponding to the MPI compiler used.
## These may not be the newest version. Use "module avail" to find the best.
#module load mpich/ge/gcc/64/3.2
#module load openmpi/gcc/64/1.10.1
#module load mvapich2/gcc/64/2.2b
module load intel-mpi/64   # This will load the default version
...

Running Your MPI Job with "srun"

When using the Intel-MPI compiler and libraries, there is another approach you can use instead of running a SLURM submission script. You can use the srun command to launch MPI programs directly. Before you do this though, you must execute the following command or you will receive errors when using srun:

export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so

Now for example, the following command will first cause Slurm to allocate 2 nodes with 3 processes each, and then run the program MpiHello.

srun --nodes=2 --ntasks-per-node=3 ./MpiHello

Note that as of January 25 2019, tests using srun with other MPI library and compiler combinations did not work. Also note that setting the I_MPI_PMI_LIBRARY environment variable (as above) will cause the mpirun to stop working. To get mpirun working again, use the following command.

unset I_MPI_PMI_LIBRARY

Running Multi-threaded/Shared Jobs

It is straight forward to run threaded jobs in Slurm. You need only to specify the number of threads used per task using the "--cpus-per-task" option as shown in the job script below.

#!/bin/bash
...
#SBATCH --ntasks=1
#SBATCH --cpus-per-task $nThreads
...
#run you threaded application here
my_application

You can try this out using the Multi-threaded Python example.

Hybrid Parallelism

Again it is straight forward to use threading in conjunction with MPI in Slurm. In this case both "--ntasks-per-node" and "--cpus-per-task" must be specified simultaneously as shown below:

#!/bin/bash
#SBATCH --job-name Hybrid
#SBATCH --nodes 2
#SBATCH --ntasks 2
#SBATCH --cpus-per-task 4
module load intel/ps_xe/18.0.1.163
#The OMP_NUM_THREADS env must be set before calling your application if your application does not take the number of threads as an input argument
mpirun ./hybrid 4

The above script will run on two nodes, with 2 tasks per node (MPI processes) and each process will have 4 threads. You can use the MPI Hybrid Example (hybrid.c) to test the above script. When compiling make sure to include "-fopenmp" directive:

mpiicc -fopenmp  hybrid.c -o hybrid

GPU Parallelism

GPUs are treated as generic resources in Slurm. In order to run you CUDA or GPU based application you have to first allocate the desired number of GPUs to your job using the "--gres=gpu:" parameter and also request the GPU partition as shown below:

#!/bin/bash
#SBATCH --job-name poke_cuda
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --partition=gpuq
#SBATCH --gres=gpu:2
#SBATCH --output=/scratch/%u/sample_cuda.%N.%j.out
## If you need to run on a particular node, you can request it with the
## --nodelist option.  For example, uncomment (i.e. remove one '#') the line
## below and SLURM will run your job on node40 when it is available.
##SBATCH --nodelist=NODE040
module load cuda/9.2
./know_gpus

In the above example, 2 gpus are allocated on a single node which runs 1 tasks. If you are using MPI along with CUDA (say for example each MPI process controls a set of GPUs) then you have to specify the desired number of MPI processes in "--ntasks-per-node" option as before. Additionally you have to load the relevant CUDA module(s). Note that the number of GPUs requested must not be more than that allocated to the node.

Note: In a multi-gpu programming context, peer-to-peer communication between gpus on a single node may be necessary. In that case gpu resources should be selected carefully from the following groups. In the following table, peer-to-peer communication-capable gpus are grouped with square braces.


GPU-Node Number	GPU id
40	[0,1,2,3], [4,5,6,7]
50	[0,1], [2,3]
55	[0,1], [2,3]
56	[0,1], [2,3]

The listed program 'know_gpus.cu' can be compiled as shown below on any login node and can be run on gpu-nodes using the provided sample Slurm job submission script.

 module load cuda/10.0
 nvcc know_gpus.cu -o know_gpus

//know_gpus.cu

#include <iostream>
int main()
{
    int nDevices;
    cudaGetDeviceCount(&nDevices);
    for (int i = 0; i < nDevices; i++)
    {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);
        std::cout << "Device Number: " << i << std::endl;
        std::cout << "Device name: "<< prop.name << std::endl;
        std::cout << "Memory Clock Rate (KHz):" << prop.memoryClockRate << std::endl;
        std::cout << "Memory Bus Width (bits): " << prop.memoryBusWidth << std::endl;
        std::cout << "Peak Memory Bandwidth (GB/s): " <<
            2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6 << std::endl;
        std::cout << std::endl;
    }
}