Skip to content

General

1. Connect via a GMU Network/VPN

Before using cluster resources, it is best to make sure you are either connected through a GMU campus network or are using the GMU Cisco Secure Client VPN. This is not only good for security reasons, but can affect the how the cluster runs certain jobs.

2. How to make sure the my jobs gets schedule quickly.

Don't request more resources you need to run your job. For example, if you know your job requires a certain amount of memory, it is better to specify it in the job script so that the resource manager can better schedule your job to an appropriate node. It may also reduce query wait time. These requirements can be specified using the --mem=VALUE option in the job script.

3. Always run jobs in the most appropriate queue.

For example, if you want to run a job that requires lots of memory (> 10GB), then it is best to run it using one of the big memory queues. This can be done by requested one of the "bigmem" partitions in your job submission script (e.g. --partition=bigmem-HiPri). Otherwise your job may get killed, other users' jobs may be culled, or you could even cause a kernel panic on a compute node.

4. Close Interactive/OnDemand Sessions when you're done with them

Sometimes you may finish all the work you need to do within an interactive session or Open OnDemand app session before your time allocation runs out. It is prudent to close the session when this happens in order to free up resources for other users

Resource Utilization

Using Multiple Nodes/Slots

1. Which MPI to use?

Unless your code requires a specific MPI library it is highly recommended that you use the latest intel-MPI installed in the Hopper cluster. Also note that the mpich does not use the high speed InfiniBand interconnect, so if you are using it with a communication heavy process then you could create a bottleneck due to slow communications.

2. Number of nodes to use

Although the cluster has > 700 nodes, it recommended that you restrict your MPI program to 128 nodes. Even with 128 nodes, it could take a while for your job to get scheduled depending on the cluster load.

GPU Based Jobs

Hopper has a limited number of GPU nodes that may not all be available at any given time. For the sake of other users, please be prudent about how many GPU nodes you request per job

GPU Nodes (A100) 31 AMD - 1984 cores 64 / node 512 GB 124 Nvidia A100 - 80GB GPUs
GPU Nodes (DGX) 2 AMD - 256 cores 128 / node 1024 GB 16 Nvidia A100 - 40GB GPUs
GPU Nodes (H100) 1 Intel - 112 cores 112 / node 2048 GB 4 Nvidia H100 - 80GB GPUs

Troubleshooting

1. Before contacting the system admin, look in the The ORC Wiki to see if it has the solution to your problem.

2. Follow the bread crumbs. To get information about a prematurely killed job check the log file created by SLURM. To get a list of failed(F), completed(CD), cancelled(CA) jobs with error codes the following command can be used,

sacct -s F,CD,CA --starttime yyyy-mm-dd -u $USER | less

where --starttime option fixes the specific beginning date from where and onward job records are to be printed.