FAQ

List of FAQs based on review of OS tickets

1.I need to use GPUs for my computations. Of the available GPU resources of the Hopper cluster, which node(s) or partitions should I use that is/are appropriate for my jobs?

We recommend submitting jobs to the partitions that have the A100:80GB nodes if your memory needs are not of the order of 1TB or more.

There are 24 A100:80GB nodes with 512GB memory each, whereas there are only 2 A100:40GB nodes with 1TB+ memory each. If your memory requirement is not of the order of 1TB, then using the A100:80GB will result in shorter wait times for your job to start.

Additionally, future plans for the A100:80GB nodes include partitioning into smaller slices, which will further increase their availability for jobs and reduce wait times even further.

2.I need to use GPUs for my computations. What are the 2-3 most important criteria I should consider in deciding which GPU nodes are most appropriate for my jobs?

You should have a good estimate of at least the following 2 items: amount of memory (RAM) needed for the job and time needed to complete the job. These determine the appropriate partition(s) for the job.

3.I am getting an out-of-memory error for my job. How do I resubmit the job to avoid this error?

In your Slurm script, increase the amount of memory requested via the appropriate slurm script directive (for example, #SBATCH --mem-per-task=50GB).

4.How do I determine the amount of memory my job needs before submitting a time-intensive batch job? How do I use this information to select the appropriate node(s)?

To determine the amount of memory needed for your jobs, we suggest that you examine your code to determine the size of arrays, the number of iterations, etc., that will need to fit in the memory to run the job. This is also a general good practice for any program that you write or use for your work.

5.My job requires a large amount of memory (>500GB). Which partition(s) or node(s) should I use?

For jobs requiring a large amount of memory, we suggest using nodes of the 'bigmem' partition.

To list all the partitions available and their corresponding nodes, you can use the command:

$ sinfo

To determine the maximum amount of memory available on a specific node, for example, the amd069 node on the bigmem partition, use the following command:

$ scontrol show node amd069 | grep mem | tail -n 1 | tr "," "\n" | sed -n '2 p'
mem=185G

Note: This also gives the correct format to specify the memory needed for the slurm script.

6.My job is currently pending (PD) for almost 2 days. I do not know how long the job will take. How should I specify the time option in the slurm script to avoid long wait times?

The time option, in the format day-hours:minutes:seconds, is specified by using the following directive in the slurm script:

#SBATCH --time=0-00:30:00

Most partitions on the Hopper cluster have a default time limit of 3 or 5 days for a submitted job to complete. If time is not specified in the slurm script, it defaults to the default time of the partition. Therefore, it is recommended that you specify the time you estimate your job to take, especially if it is significantly less than the maximum time.

7.Is Python installed on the Hopper cluster?

Yes, Python is installed. Only Python 3 versions are available. To find various available versions, use the command:

$ module avail python

Then use the following command to load Python for your use:

$ module load python/<version>

NOTE: The versions available for the gnu9 and the gnu10 compilers are different. For the versions available for the gnu10 compiler, first load the gnu10 module and then load the Python version of your choice:

$ module load gnu10
$ module avail python
$ module load python/<version>

8.Is R installed on the Hopper cluster?

Yes, R is installed. To find various available versions, use the command:

$ module avail r

Use the following command to load R for your use:

$ module load r/<version>

Also, the RStudio server is available as a module from the command-line interface (CLI) and a GUI-based application on Open OnDemand (OOD). To access the OOD web server, point your browser to: https://ondemand.orc.gmu.edu

You will have to authenticate, and the credentials are your GMU NetID and Password.

9.Is Matlab installed on the Hopper cluster?

Yes, Matlab is installed. To find various available versions, use the command:

$ module avail matlab

Use the following command to load Matlab for your use:

$ module load matlab/<version>

10.Do you have a quota for each user? How can I check my quota usage?

For the $HOME directory of each user, the amount of file space used (i.e., quota) is 60 GB. You can check your current usage with the following command:

$ du -sh $HOME

PhD students or their advisors can request additional space on the '/projects' filesystem. Usage here should not exceed 1 TB per student.

A '/scratch/$USER' directory is available to each user for temporary storage, such as job results. We will perform occasional sweeps of this filesystem, removing any files that are older than 90 days (about 3 months).

11.How do I submit jobs?

Jobs are submitted through Slurm. Slurm is a workload manager for Linux that manages job submission, deletion, and monitoring.

The command for submitting a batch job is:

$ sbatch <slurm script>

12.Why do my jobs have Low CPU Efficiency?

Common reasons for low CPU efficiency include:

Running a serial code using multiple CPU-cores. Make sure that your code is written to run in parallel before using multiple CPU-cores.
Using too many CPU-cores for parallel jobs. You can find the optimal number of CPU-cores by performing a scaling analysis.
Writing job output to the /groups or /projects storage systems. Actively running jobs should write output files to /scratch/.
Using "mpirun" instead of "srun" for parallel codes. Please use "srun".

Consult the documentation or write to the mailing list of the software that you are using for additional reasons for low CPU efficiency and for potential solutions.

13.My files were deleted from '/scratch', but weren't older than 90-days. What gives?

The most common cause of this issue is because the files in question were extracted from a tarball or zip archive. That's because these files can retain the modification date they had when they were originally archived. Since the purge policy is based on the modification date, that may cause those files to be marked for deletion during the next purge.

14.But I need those files to do my research, what do I do?

The purge process starts on the 1st of every month, and will delete files with a modification date older than 90-days from that date. Depending on the number of files to be purged, the process may extend multiple days, but it continues to work from the original list of files generated on the 1st.

In other words, if you need those files for less than a month then start early in the month (on the 2nd to be safe).

If you need those files for longer than a month, then you can (in order of preference):

Move them to your home directory if they will fit.
Move them to your group's storage location (if you have one). Make sure there is enough space first.
Request a timed exclusion for the files. Submit a ticket, specifying the full path(s) you'd like to be excluded, the end date, and any other relevant information. Once reviewed by support, the files will be ignored by the purge until after the end date, at which point they will be marked for deletion as usual. The path you specify can be a directory, in which case anything under that path will be excluded.

If the files represent a publicly-available dataset, you can also request to have them added to '/datasets' where they will persist and be available to all cluster users. Submit a ticket to request this, and include as much information about the dataset as is relevant. Since storage available for the /datasets filesystem is finite and there are limits on the types of data we can share in this way, support will need to review the dataset to make sure it's a good fit.

If your case falls outside the bounds of the options listed above, please submit a ticket and we will try to accommodate as best as possible, or help to find other solutions.

15.Not able to start a session on jupyter lab OR not able to access Open OnDemand OR facing Disk quota exceeded message

You are approaching or exceeding your 60GB (55.88 GiB) /home quota limit. You will not be able to save anything to your /home directory after you exceed your quota. You can quickly get your disk usage using the following options, in order decreasing speed

1.gdu (fast)

gdu --si -s $HOME

2.ncdu

ncdu --si $HOME

3.du (slow)

du --si -s $HOME

You can make space by:

1.Removing unnecessary files from '/home'

2.Compressing files in '/home'

3.Moving files to your '/projects' directory Faculty members can request a '/projects' allocation of 1TB to start and get additional amount based on the number of active graduate students and post-doctoral associates they advise. If you are not a faculty member, your faculty advisor/supervisor can request the '/projects' space and have you added as a group member so that you can use the '/projects' space.

4.Moving files to your '/groups' directory Faculty can buy '/groups' space at a cost of $60/TB/year by emailing orchelp@gmu.edu.

5.Moving files to your '/scratch' directory. Please note that '/scratch' is subject to a 90-day purge policy. So, if they were modified more than 90 days ago, they will be deleted permanently.

6.You can also try to clear the cache by following the below steps: Login to Hopper Using SSH

ssh netID@hopper.orc.gmu.edu

Once you had successfully logged on to hopper.Execute the below command to delete the cache files.

rm -r .cache