Running R on Hopper

Running Serial R Jobs

To use the optimized version of R compiled with OpenBLAS on the Hopper cluster, you need to load the R module. The most up-to-date version of R is 3.6.3. To check the available R modules installed on the cluster, you can use the command module spider r to see :

--------------------------------------------------------------------------------------------------------------------
  r:
--------------------------------------------------------------------------------------------------------------------
     Versions:
        r/3.6.3
        r/4.0.3-hb
        r/4.0.3-hx
        r/4.0.3-pn
        r/4.0.3-ta
        r/4.1.2-dx
        r/4.1.2-zx
     Other possible modules matches:
        .charliecloud  .compiler-rt32  .compiler32  advisor  amber  aria2  armadillo  arpack-ng  biocontainers/bowtie2/v2.4.1_cv1  ...

--------------------------------------------------------------------------------------------------------------------
  To find other possible module matches execute:

      $ module -r spider '.*r.*'

--------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "r" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider r/4.1.2-zx
--------------------------------------------------------------------------------------------------------------------

To load the R module on the Hopper cluster, you need to follow these steps:

Load the GNU Compiler Collection (GCC) module by running the command:

$ module load gnu10

This module provides the necessary compilers and libraries required for building and running R.

Load the R module with the specific version you want (R/3.6.3) by running the command:

$ module load r/3.6.3

This command loads the R module with the specified version and sets up the environment to use that version of R.

After executing these two commands, you will have successfully loaded the R module and can proceed with running your R scripts or submitting R jobs using Slurm on the Hopper cluster. You can submit batch R jobs with a Slurm submission script. At the end of your Slurm script, you can run your R script with the following command: RScript [options].R. To find what the options can be passed to RScript type R --help after you have loaded the R module. You need to load the R module explicitly inside your Slurm job submission file. NOTE: R uses ".RData" file in your current directory to load/save workspace every time it starts/finishes a session. This can significantly slow down execution of your job depending on the size of the ".RData" file.

It is advisable to use the following options "--no-restore --quiet --no-save" when starting a job. The option "--quiet" suppresses the start up messages, "--no-restore" directs R not to restore anything from ".RData", and "--no-save" ensures that the workspace is not saved to ".RData" at the end of the R session. Given below is a sample Slurm job submission script for a R job:

#!/bin/sh

## Specify the name for your job, this is the job name by which Slurm will
## refer to your job.  This can be different from the name of your executable
## or the name of your script file
#SBATCH --job-name My_R_Job

#SBATCH --partition=normal

## Deal with output and errors.  Separate into 2 files (not the default).
## May help to put your result files in a directory: e.g. /scratch/%u/logs/...
## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayID, %a=arrayTaskID
#SBATCH --output=/scratch/%u/%x-%N-%j.out  # Output file
#SBATCH --error=/scratch/%u/%x-%N-%j.err   # Error file
#SBATCH --mail-type=BEGIN,END,FAIL         # ALL,NONE,BEGIN,END,FAIL,REQUEUE,..
#SBATCH --mail-user=<GMUnetID>@gmu.edu     # Put your GMU email address here

## Specifying an upper limit on needed resources will improve your scheduling
## priority, but if you exceed these values, your job will be terminated.
## Check your "Job Ended" emails for actual resource usage info.
#SBATCH --mem=1M         # Total memory needed for your job (suffixes: K,M,G,T)
#SBATCH --time=0-00:02   # Total time needed for your job: Days-Hours:Minutes

## Load the relevant modules needed for the job
module load r/3.6.3

## Start the job
Rscript --no-restore --quiet --no-save RHello.R

In this script, make sure to replace <GMUnetID> with your actual GMU email address.

The module load R/3.6.3 command loads the R module with version 3.6.3. Adjust the memory (--mem) and time (--time) values according to your specific job requirements.Just note that the values for --mem and --time are set to be extremely low. You will want to increase these when you run your own programs.

The Rscript command runs the R script RHello.R with the specified options: --no-restore, --quiet, and --no-save. You can modify the RHello.R script to perform your desired computations.

Here's an example content for the RHello.R script that you can use:

#!/usr/bin/env Rscript

## Just output the text "Hello, world!"
cat("Hello, world!\n")

This script simply outputs the text "Hello, world!".

Save the Slurm submission script as my_r_job.sh, and the R script as RHello.R in the same directory. Then, submit the job to the cluster using the following command:

$ sbatch my_r_job.sh

Make sure you are in the correct directory where the scripts are located. You will receive email notifications for job begin, end, and failure events based on the specified email address. Adjust the script and options as needed for your specific R job on the Hopper cluster.

Running Multi-threaded R Jobs

On the Hopper cluster, in addition to requesting extra CPUs for multi-threaded R jobs, you can also utilize GPUs for parallel computation. To use GPUs, you need to change the partition to gpuq in your Slurm submission script. Here's the steps/instructions: Modify the partition in your Slurm submission script to gpuq:

#SBATCH --partition=gpuq   # Request the GPU partition

This ensures that your job runs on a node with GPUs. Request extra CPUs for threads by adding the following line to your Slurm submission script, replacing <C> with the number of CPU cores you want to request.also you could set the OPENBLAS_NUM_THREADS environment variable to a value greater than or equal to the number of CPU cores requested.:

#SBATCH --cpus-per-task <C>   # Request extra CPUs for threads
export OPENBLAS_NUM_THREADS=<C>

Be aware that cannot exceed the number of cores on the node where your job will run. Making this value too high could cause Slurm to delay your job until an appropriate node becomes available, or just reject it completely.

There are some R packages that take advantage of the R parallel library to make use of additional CPUs cores e.g. parallelDist, Rdsm and RcppThread.

A common R programming practice with the parallel library is to set the number of cores for the parallel cluster using the parallel::detectCores() routine e.g.:

mc.cores = parallel::detectCores()

This can cause problems when running with SLURM, as SLURM will restrict your job to the cores requested but detectCores() will return the total number of cores on the node. Unless you are requesting a full node this will overload the cores in your job and may severly impact performance. The best practice for using the parallel library is to set mc.cores to the number of cores assigned by SLURM. This can be done by adding the code below to your R script:

# set the number of parallel worker according to SLURM
nworkers <- Sys.getenv("SLURM_CPUS_PER_TASK")
message("I have ", nworkers, " cores available")

Running Parallel R Jobs Using Rmpi

Rmpi works with OpenMPI as a wrapper to spawn slave processes through your R script file. Hence OpenMPI needs to be loaded before using Rmpi. Detailed information about Rmpi can be found in Rmpi documentation.

To load OpenMpi module, use the following commands:

$ module load gnu10
$ module load openmpi

Below is a sample job submission script that shows how to submit Rmpi jobs:

#!/bin/sh

## Specify the name for your job, this is the job name by which Slurm will
## refer to your job.  This can be different from the name of your executable
## or the name of your script file.
#SBATCH --job-name=RmpiHello
#SBATCH --partition=normal

## Deal with output and errors.  Separate into 2 files (not the default).
## May help to put your result files in a directory: e.g. /scratch/%u/logs/...
## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayID, %a=arrayTaskID
#SBATCH --output=/scratch/%u/%x-%N-%j.out  # Output file
#SBATCH --error=/scratch/%u/%x-%N-%j.err   # Error file
#SBATCH --mail-type=BEGIN,END,FAIL         # ALL,NONE,BEGIN,END,FAIL,REQUEUE,..
#SBATCH --mail-user=<GMUnetID>@gmu.edu     # Put your GMU email address here

## You can improve your scheduling priority by specifying upper limits on
## needed resources, but jobs that exceed these values will be terminated.
## Check your "Job Ended" emails for actual resource usage info as a guide.
#SBATCH --mem=1M        # Total memory needed per task (units: K,M,G,T)
#SBATCH --time=0-00:02  # Total time needed for job: Days-Hours:Minutes

## ----- Parallel Processes  -----
## Some libraries (MPI) implement parallelism using processes that communicate.
## This allows tasks to run on any set of cores in the cluster.  Programs can
## use this approach in combination with threads (if designed to).
#SBATCH --ntasks <T>          # Number of processes you plan to launch

# Optional parameters.  Uncomment (remove one leading '#') to use.
##SBATCH --nodes <N>           # If you want some control over how tasks are
                               #    distributed on nodes.  <T> >= <N>
##SBATCH --ntasks-per-node <Z> # If you want more control over how tasks are
                               #    distributed on nodes.  <T> = <N> * <Z>

## Load the R module which also loads the OpenBLAS module
module load r/3.6.3
## To use Rmpi, you need to load the openmpi module
module load openmpi

# R still wants to write files to our current directory, despite using
# "--no-restore --quiet --no-save" below, so move someplace writable.
ORIG_DIR=$PWD
cd $SCRATCH

echo "Calling mpirun now!!!"

## Use "--no-restore --quiet --no-save" to be as quiet as possible.
## Note: You do not spawn the parallel processes directly through mpirun, but
##       instead from inside your R script, hence parameter -np is set to 1.
mpirun -np 1 Rscript --no-restore --quiet --no-save $ORIG_DIR/RmpiHello.R

Below is a parallel hello world Rmpi program that can be used to test the above script. Be sure to replace the with the appropriate value on the line where slaves are spawned.

## RmpiHello.R

## Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize"))
{
    library("Rmpi")
}

## Specify how many slave processes will be spawned.
## This must be 1 less than the number of tasks requested (master uses 1).
mpi.spawn.Rslaves(nslaves=<T-1>)   # Change this to match your Slurm script

## In case R exits unexpectedly, automatically clean up
## resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function()
{
    if (is.loaded("mpi_initialize"))
    {
        if (mpi.comm.size(1) > 0)
        {
            print("Please use mpi.close.Rslaves() to close slaves.")
            mpi.close.Rslaves()
        }
        print("Please use mpi.quit() to quit R")
        .Call("mpi_finalize")
    }
}

## Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("Hello, World from process ",mpi.comm.rank(),"of",mpi.comm.size()))

## Tell all slaves to close down, and exit the program
mpi.close.Rslaves()
mpi.quit()

Note. Number of slaves requested must be 1 less than the number of tasks requested as shown in the above scripts.

Installing Additional R-Packages

If you want to add additional packages to the R module (user packages are added in such a way that they will only be available to you when you load R) you need to run the following command from inside R terminal. By default R will install these packages to the default user package directory :~/R/x86_64-unknown-linux-gnu-library/VERSION/. Where, VERSION is the R version you are using. The first time you install a package you will be asked to select whether to use the default location or not. Select yes to proceed.

$ module load R/<version>
$ R
> install.packages("package_name")

You may see a prompt mentioning which server to download the package from. Select a server of your choice and proceed.

To load the said package inside your R script use:

library("package_name")

To install an R package to a specific location you need to specify the location in the install.packages command:

> install.packages("package_name", lib="/custom/path/to/R-packages/")

If you don't specify a lib= parameter then R will ask you if you want to use a default path in your home directory. This is probably a better choice so that you don't have to remember where you put your packages a month or a year from now.

> install.packages("package_name")

If you receive compile errors while installing a package, then you may need to load a newer version of the gcc compiler. First exit R, then type:

$module avail gcc

---------------------------------------------------- Global Aliases ----------------------------------------------------
   compiler/gnu/10.3.0     -> gnu10/10.3.0-ya       math/openblas/0.3.7    -> openblas/0.3.7
   compiler/gnu/9.3.0      -> gnu9/9.3.0            mpi/intel-mpi/2020.2   -> impi/2020.2
   compiler/intel/2020.2   -> intel/2020.2          mpi/intel-mpi/2021.5.1 -> mpi/2021.5.1
   compiler/intel/2022.0.2 -> compiler/2022.0.2     mpi/openmpi/4.0.4      -> openmpi4/4.0.4
   math/intel-mkl/2020.2   -> mkl/2020.2            mpi/openmpi/4.1.2      -> openmpi/4.1.2-4a
   math/intel-mkl/2022.0.2 -> mkl/2022.0.2          openmpi4/4.1.2         -> openmpi/4.1.2-4a
   math/openblas/0.3.20    -> openblas/0.3.20-iq

------------------------------------------------------ GNU-9.3.0 -------------------------------------------------------
   gcc/10.3.0-xr-xr    gcc/10.3.0-xr    gcc/10.3.0-ya (D)

----------------------------------------------------- Independent ------------------------------------------------------
   vasp/6.3.2-gcc

  Where:
   D:  Default Module
$ module load gnu10

The latest compiler is not necessarily the best, so you may need to go through this process a couple of times, selecting different compilers until you get it to work. Each time you will need to restart R, and try installing the package again. If you still cannot get it to work, send a request for help.

To load this package use:

> library("package_name", lib.loc="/custom/path/to/R-packages/")

or if you installed it in the default location, just use:

> library("package_name")

The user can also add a custom library path to the R module which mitigates the need to specify the library path (like above) each time a library is loaded. This can be done in several ways, here is a good thread on the topic.