General
1. Connect via a GMU Network/VPN
Before using cluster resources, it is best to make sure you are either connected through a GMU campus network or are using the GMU Cisco Secure Client VPN. This is not only good for security reasons, but can affect the how the cluster runs certain jobs.
2. How to make sure the my jobs gets schedule quickly.
Don't request more resources you need to run your job. For example, if
you know your job requires a certain amount of memory, it is better to
specify it in the job script so that the resource manager can better
schedule your job to an appropriate node. It may also reduce query wait
time. These requirements can be specified using the --mem=VALUE
option
in the job script.
3. Always run jobs in the most appropriate queue.
For example, if you want to run a job that requires lots of memory (>
10GB), then it is best to run it using one of the big memory queues.
This can be done by requested one of the "bigmem" partitions in your job
submission script (e.g. --partition=bigmem-HiPri
). Otherwise your job
may get killed, other users' jobs may be culled, or you could even cause
a kernel panic on a compute node.
4. Close Interactive/OnDemand Sessions when you're done with them
Sometimes you may finish all the work you need to do within an interactive session or Open OnDemand app session before your time allocation runs out. It is prudent to close the session when this happens in order to free up resources for other users
Resource Utilization
Using Multiple Nodes/Slots
1. Which MPI to use?
Unless your code requires a specific MPI library it is highly
recommended that you use the latest intel-MPI installed in the Hopper
cluster. Also note that the mpich
does not use the high speed
InfiniBand interconnect, so if you are using it with a communication
heavy process then you could create a bottleneck due to slow
communications.
2. Number of nodes to use
Although the cluster has > 700 nodes, it recommended that you restrict your MPI program to 128 nodes. Even with 128 nodes, it could take a while for your job to get scheduled depending on the cluster load.
GPU Based Jobs
Hopper has a limited number of GPU nodes that may not all be available at any given time. For the sake of other users, please be prudent about how many GPU nodes you request per job
GPU Nodes (A100) | 31 | AMD - 1984 cores | 64 / node | 512 GB | 124 Nvidia A100 - 80GB GPUs |
GPU Nodes (DGX) | 2 | AMD - 256 cores | 128 / node | 1024 GB | 16 Nvidia A100 - 40GB GPUs |
GPU Nodes (H100) | 1 | Intel - 112 cores | 112 / node | 2048 GB | 4 Nvidia H100 - 80GB GPUs |
Troubleshooting
1. Before contacting the system admin, look in the The ORC Wiki to see if it has the solution to your problem.
2. Follow the bread crumbs. To get information about a prematurely killed job check the log file created by SLURM. To get a list of failed(F), completed(CD), cancelled(CA) jobs with error codes the following command can be used,
sacct -s F,CD,CA --starttime yyyy-mm-dd -u $USER | less
where --starttime option fixes the specific beginning date from where and onward job records are to be printed.