Running R on Hopper
Running Serial R Jobs
To use the optimized version of R compiled with OpenBLAS on the Hopper cluster, you need to load the R module. The most up-to-date version of R is 3.6.3.
To check the available R modules installed on the cluster, you can use the command module spider r
to see :
--------------------------------------------------------------------------------------------------------------------
r:
--------------------------------------------------------------------------------------------------------------------
Versions:
r/3.6.3
r/4.0.3-hb
r/4.0.3-hx
r/4.0.3-pn
r/4.0.3-ta
r/4.1.2-dx
r/4.1.2-zx
Other possible modules matches:
.charliecloud .compiler-rt32 .compiler32 advisor amber aria2 armadillo arpack-ng biocontainers/bowtie2/v2.4.1_cv1 ...
--------------------------------------------------------------------------------------------------------------------
To find other possible module matches execute:
$ module -r spider '.*r.*'
--------------------------------------------------------------------------------------------------------------------
For detailed information about a specific "r" package (including how to load the modules) use the module's full name.
Note that names that have a trailing (E) are extensions provided by other modules.
For example:
$ module spider r/4.1.2-zx
--------------------------------------------------------------------------------------------------------------------
Load the GNU Compiler Collection (GCC) module by running the command:
$ module load gnu10
Load the R module with the specific version you want (R/3.6.3) by running the command:
$ module load r/3.6.3
After executing these two commands, you will have successfully loaded the R module and can proceed with running your R scripts or submitting R jobs using Slurm on the Hopper cluster.
You can submit batch R jobs with a Slurm submission script. At the end
of your Slurm script, you can run your R script with the following
command: RScript [options]
.R
. To find what the
options can be passed to RScript
type R --help
after you have loaded
the R module. You need to load the R module explicitly inside your Slurm
job submission file. NOTE: R uses ".RData" file in your current
directory to load/save workspace every time it starts/finishes a
session. This can significantly slow down execution of your job
depending on the size of the ".RData" file.
It is advisable to use the following options
"--no-restore --quiet --no-save
" when starting a job. The option
"--quiet
" suppresses the start up messages, "--no-restore
" directs R
not to restore anything from ".RData", and "--no-save
" ensures that
the workspace is not saved to ".RData" at the end of the R session.
Given below is a sample Slurm job submission script for a R job:
#!/bin/sh
## Specify the name for your job, this is the job name by which Slurm will
## refer to your job. This can be different from the name of your executable
## or the name of your script file
#SBATCH --job-name My_R_Job
#SBATCH --partition=normal
## Deal with output and errors. Separate into 2 files (not the default).
## May help to put your result files in a directory: e.g. /scratch/%u/logs/...
## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayID, %a=arrayTaskID
#SBATCH --output=/scratch/%u/%x-%N-%j.out # Output file
#SBATCH --error=/scratch/%u/%x-%N-%j.err # Error file
#SBATCH --mail-type=BEGIN,END,FAIL # ALL,NONE,BEGIN,END,FAIL,REQUEUE,..
#SBATCH --mail-user=<GMUnetID>@gmu.edu # Put your GMU email address here
## Specifying an upper limit on needed resources will improve your scheduling
## priority, but if you exceed these values, your job will be terminated.
## Check your "Job Ended" emails for actual resource usage info.
#SBATCH --mem=1M # Total memory needed for your job (suffixes: K,M,G,T)
#SBATCH --time=0-00:02 # Total time needed for your job: Days-Hours:Minutes
## Load the relevant modules needed for the job
module load r/3.6.3
## Start the job
Rscript --no-restore --quiet --no-save RHello.R
<GMUnetID>
with your actual GMU email address.
The module load R/3.6.3 command loads the R module with version 3.6.3. Adjust the memory (--mem
) and time (--time
) values according to your specific job requirements.Just note that the values for --mem
and --time
are set to be
extremely low. You will want to increase these when you run your own
programs.
The Rscript command runs the R script RHello.R with the specified options: --no-restore
, --quiet
, and --no-save
. You can modify the RHello.R script to perform your desired computations.
Here's an example content for the RHello.R script that you can use:
#!/usr/bin/env Rscript
## Just output the text "Hello, world!"
cat("Hello, world!\n")
Save the Slurm submission script as my_r_job.sh, and the R script as RHello.R in the same directory. Then, submit the job to the cluster using the following command:
$ sbatch my_r_job.sh
Running Multi-threaded R Jobs
On the Hopper cluster, in addition to requesting extra CPUs for multi-threaded R jobs, you can also utilize GPUs for parallel computation. To use GPUs, you need to change the partition to gpuq
in your Slurm submission script. Here's the steps/instructions:
Modify the partition in your Slurm submission script to gpuq:
#SBATCH --partition=gpuq # Request the GPU partition
<C>
with the number of CPU cores you want to request.also you could set the OPENBLAS_NUM_THREADS
environment variable to a value greater than or equal to the number of CPU cores requested.:
#SBATCH --cpus-per-task <C> # Request extra CPUs for threads
export OPENBLAS_NUM_THREADS=<C>
Be aware that
There are some R packages that take advantage of the R parallel library to make use of additional CPUs cores e.g. parallelDist, Rdsm and RcppThread.
A common R programming practice with the parallel library is to set the number of cores for the parallel cluster using the parallel::detectCores() routine e.g.:
mc.cores = parallel::detectCores()
This can cause problems when running with SLURM, as SLURM will restrict your job to the cores requested but detectCores() will return the total number of cores on the node. Unless you are requesting a full node this will overload the cores in your job and may severly impact performance. The best practice for using the parallel library is to set mc.cores to the number of cores assigned by SLURM. This can be done by adding the code below to your R script:
# set the number of parallel worker according to SLURM
nworkers <- Sys.getenv("SLURM_CPUS_PER_TASK")
message("I have ", nworkers, " cores available")
Running Parallel R Jobs Using Rmpi
Rmpi works with OpenMPI as a wrapper to spawn slave processes through your R script file. Hence OpenMPI needs to be loaded before using Rmpi. Detailed information about Rmpi can be found in Rmpi documentation.
To load OpenMpi module, use the following commands:
$ module load gnu10
$ module load openmpi
Below is a sample job submission script that shows how to submit Rmpi jobs:
#!/bin/sh
## Specify the name for your job, this is the job name by which Slurm will
## refer to your job. This can be different from the name of your executable
## or the name of your script file.
#SBATCH --job-name=RmpiHello
#SBATCH --partition=normal
## Deal with output and errors. Separate into 2 files (not the default).
## May help to put your result files in a directory: e.g. /scratch/%u/logs/...
## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayID, %a=arrayTaskID
#SBATCH --output=/scratch/%u/%x-%N-%j.out # Output file
#SBATCH --error=/scratch/%u/%x-%N-%j.err # Error file
#SBATCH --mail-type=BEGIN,END,FAIL # ALL,NONE,BEGIN,END,FAIL,REQUEUE,..
#SBATCH --mail-user=<GMUnetID>@gmu.edu # Put your GMU email address here
## You can improve your scheduling priority by specifying upper limits on
## needed resources, but jobs that exceed these values will be terminated.
## Check your "Job Ended" emails for actual resource usage info as a guide.
#SBATCH --mem=1M # Total memory needed per task (units: K,M,G,T)
#SBATCH --time=0-00:02 # Total time needed for job: Days-Hours:Minutes
## ----- Parallel Processes -----
## Some libraries (MPI) implement parallelism using processes that communicate.
## This allows tasks to run on any set of cores in the cluster. Programs can
## use this approach in combination with threads (if designed to).
#SBATCH --ntasks <T> # Number of processes you plan to launch
# Optional parameters. Uncomment (remove one leading '#') to use.
##SBATCH --nodes <N> # If you want some control over how tasks are
# distributed on nodes. <T> >= <N>
##SBATCH --ntasks-per-node <Z> # If you want more control over how tasks are
# distributed on nodes. <T> = <N> * <Z>
## Load the R module which also loads the OpenBLAS module
module load r/3.6.3
## To use Rmpi, you need to load the openmpi module
module load openmpi
# R still wants to write files to our current directory, despite using
# "--no-restore --quiet --no-save" below, so move someplace writable.
ORIG_DIR=$PWD
cd $SCRATCH
echo "Calling mpirun now!!!"
## Use "--no-restore --quiet --no-save" to be as quiet as possible.
## Note: You do not spawn the parallel processes directly through mpirun, but
## instead from inside your R script, hence parameter -np is set to 1.
mpirun -np 1 Rscript --no-restore --quiet --no-save $ORIG_DIR/RmpiHello.R
Below is a parallel hello world Rmpi program that can be used to test
the above script. Be sure to replace the
## RmpiHello.R
## Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize"))
{
library("Rmpi")
}
## Specify how many slave processes will be spawned.
## This must be 1 less than the number of tasks requested (master uses 1).
mpi.spawn.Rslaves(nslaves=<T-1>) # Change this to match your Slurm script
## In case R exits unexpectedly, automatically clean up
## resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function()
{
if (is.loaded("mpi_initialize"))
{
if (mpi.comm.size(1) > 0)
{
print("Please use mpi.close.Rslaves() to close slaves.")
mpi.close.Rslaves()
}
print("Please use mpi.quit() to quit R")
.Call("mpi_finalize")
}
}
## Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("Hello, World from process ",mpi.comm.rank(),"of",mpi.comm.size()))
## Tell all slaves to close down, and exit the program
mpi.close.Rslaves()
mpi.quit()
Note. Number of slaves requested must be 1 less than the number of tasks requested as shown in the above scripts.
Installing Additional R-Packages
If you want to add additional packages to the R module (user packages
are added in such a way that they will only be available to you when you
load R) you need to run the following command from inside R terminal. By
default R will install these packages to the default user package
directory :~/R/x86_64-unknown-linux-gnu-library/VERSION/
. Where,
VERSION
is the R version you are using. The first time you install a
package you will be asked to select whether to use the default location
or not. Select yes to proceed.
$ module load R/<version>
$ R
> install.packages("package_name")
You may see a prompt mentioning which server to download the package from. Select a server of your choice and proceed.
To load the said package inside your R script use:
library("package_name")
To install an R package to a specific location you need to specify the
location in the install.packages
command:
> install.packages("package_name", lib="/custom/path/to/R-packages/")
If you don't specify a lib=
parameter then R will ask you if you want
to use a default path in your home directory. This is probably a better
choice so that you don't have to remember where you put your packages a
month or a year from now.
> install.packages("package_name")
If you receive compile errors while installing a package, then you may need to load a newer version of the gcc compiler. First exit R, then type:
$module avail gcc
---------------------------------------------------- Global Aliases ----------------------------------------------------
compiler/gnu/10.3.0 -> gnu10/10.3.0-ya math/openblas/0.3.7 -> openblas/0.3.7
compiler/gnu/9.3.0 -> gnu9/9.3.0 mpi/intel-mpi/2020.2 -> impi/2020.2
compiler/intel/2020.2 -> intel/2020.2 mpi/intel-mpi/2021.5.1 -> mpi/2021.5.1
compiler/intel/2022.0.2 -> compiler/2022.0.2 mpi/openmpi/4.0.4 -> openmpi4/4.0.4
math/intel-mkl/2020.2 -> mkl/2020.2 mpi/openmpi/4.1.2 -> openmpi/4.1.2-4a
math/intel-mkl/2022.0.2 -> mkl/2022.0.2 openmpi4/4.1.2 -> openmpi/4.1.2-4a
math/openblas/0.3.20 -> openblas/0.3.20-iq
------------------------------------------------------ GNU-9.3.0 -------------------------------------------------------
gcc/10.3.0-xr-xr gcc/10.3.0-xr gcc/10.3.0-ya (D)
----------------------------------------------------- Independent ------------------------------------------------------
vasp/6.3.2-gcc
Where:
D: Default Module
$ module load gnu10
The latest compiler is not necessarily the best, so you may need to go through this process a couple of times, selecting different compilers until you get it to work. Each time you will need to restart R, and try installing the package again. If you still cannot get it to work, send a request for help.
To load this package use:
> library("package_name", lib.loc="/custom/path/to/R-packages/")
> library("package_name")