Skip to content

Running Python on the ORC Clusters

!!! NOTE

 The DGX nodes on HOPPER have been migrated to RedHat Linux 8.5 OS. Previously Ubuntu OS was running on the DGX. 
 Because of this OS change, the modules installed under the old Ubuntu system are no longer available, including 
 the hosts/dgx directory. Python virtual environments **need** to be rebuilt if you want to run them on the DGX node.

 To run python jobs or build Python Virtual Environments, use`: the python module **python/3.8.6-ff** which can 
 run on both the dgx nodes and cpu nodes.

 Python jobs that were running only on the CPU nodes or using containers should not need any changes made.

Different versions of Python are available on both ARGO and HOPPER. Since the two clusters are set up in versy similar ways, the information below on running python jobs will be applicable to both clusters.

The main differences between ARGO and Hopper are outlined in this page: Difference between ARGO and HOPPER

The examples will however be based on the HOPPER cluster. Slight modifications in the SLURM scripts will be necessary to run them on ARGO.

Python Versions

To see the available version of Python, run the command

ml spider python

This will list all the available version of Python that are installed on the cluster and include all the different builds.

------------------------------------------------------------------------------------------------
  python:
------------------------------------------------------------------------------------------------
     Versions:
        python/2.7.18-z2
        python/2.7.18-z4
        python/3.7.6-iu
        python/3.7.6-ks
        python/3.7.7-intel
        python/3.8.6-vw
        python/3.8.6-ye
     Other possible modules matches:
        intel-python

------------------------------------------------------------------------------------------------
  To find other possible module matches execute:

      $ module -r spider '.*python.*'

------------------------------------------------------------------------------------------------
You can also run

module load gnu10
module avail python
Which will show you only the gcc builds or the intel builds, depending on which compiler you're working with. Going with the recommended option (GNU-10.3.0 build), you should see

-------------------------------------------- GNU-10.3.0 ---------------------------------------------
 python/3.8.6-pi    python/3.9.9-jh (D)

  Where:
   D:  Default Module

Running

module load python
will load the default version. You can also load a specific version with

module load python/<version>
With the python module now in your path, you should be able to execute python commands and run python scripts.

Running a Python Job

Interactively on a CPU

It is not advised to run Python jobs directly on the head nodes. The prefered method, even if you're testing a small job, is to start an interactive session directly on a compute node and then test your script or, for short jobs, run it directly from the node.

To connect directly to a compute node, use the salloc command together with additional SLURM parameters

salloc -p normal  -n 1  --cpus-per-task=12 --mem=15GB -t 0-01:00:00
The above command will allocate you a single node with 12 cores and 15GB of memory for 1 hour on the normal partition. Once the resources become available, your prompt should now show that you're on one of the HOPPER nodes.

salloc: Granted job allocation 
salloc: Waiting for resource configuration
salloc: Nodes hop065 are ready for job
[user@hop065 ~]$
Modules you loaded while on the head nodes are exported onto the node as well. If you had not already loaded any modules, you should be able to load them now as well. You can now start python and use it interactively

[user@hop065 ~]$ python

Python 3.8.6 (default, Apr 19 2021, 10:56:01)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
You could also run your python script directly

$ python myscript.py

The interactive session will persit until you type

$ exit

exit
salloc: Relinquishing job allocation 

Interactively on a GPU

In a similar manner, you can start an interactive session on a GPU node with

salloc -p gpuq -q gpu --ntasks-per-node=1 --gres=gpu:A100.40gb:1 -t 0-01:00:00 

Using a SLURM Submission Script

Once your tests are done and you're ready to run longer Python jobs, you should now switch to using the batch submission with SLURM. To do this, you write a SLURM script setting the different parameters for your job, loading the necessary modules and executing your python script which is then submitted to the selected queue from where it will run your job. Below is an example SLURM script (run.slurm):

#!/bin/bash
#SBATCH --partition=normal                 # will run on any cpus in the 'normal' partition
#SBATCH --job-name=python-cpu
#SBATCH --output=python-cpu.%j.out
#SBATCH --error=python-cpu.%j.err
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1                   # up to 48 per node
#SBATCH --mem-per-cpu=3500M                 # memory per CORE; maximum is 180GB per node
#SBATCH --export=ALL
#SBATCH --time=0-01:00:00                   # set to 1hr; please choose carefully

set echo
umask 0027

module load gnu10
module load python                          # load the recommended python version

python myscript.py                          # execute your python script

On a gpu, you would change the partition information to the gpuq and set the number of gpus needed

#!/bin/bash
#SBATCH --partition=gpuq                    # the DGX only belongs in the 'gpu'  partition
#SBATCH --qos=gpu                           # need to select 'gpu' QoS
#SBATCH --job-name=python-gpu
#SBATCH --output=python-gpu.%j.out
#SBATCH --error=python-gpu.%j.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1                 # up to 128; 
#SBATCH --gres=gpu:A100.40gb:1              # up to 8; only request what you need
#SBATCH --mem-per-cpu=3500M                 # memory per CORE; total memory is 1 TB (1,000,000 MB)
#SBATCH --export=ALL 
#SBATCH --time=0-01:00:00                   # set to 1hr; please choose carefully

set echo
umask 0027

# to see ID and state of GPUs assigned
nvidia-smi

module load gnu10                           
module load python

python myscript.py

You then submit your SLURM script with

sbatch run.slurm

Optimizing your GPU runs

The DGX (Hopper's GPU) is currently set-up such that six of the 8 A100 GPUs (GPU ID 0-5) are whole while the last two (GPU ID 6-7) are partitioned into slices of different sizes.

GPU ID Size GRES name
0 Full A100 A100.40gb
1 Full A100 A100.40gb
2 Full A100 A100.40gb
3 Full A100 A100.40gb
4 Full A100 A100.40gb
5 Full A100 A100.40gb
6 7x 1/7 A100 1g.5gb
7 2x 1/7 A100 1g.5gb
1x 2/7 A100 2g.10gb
1x 4/7 A100 3g.20gb

The way the GPUs are partitioned will likely change over time to optimize utilization.

The best way to take advantage of this (Multi-Instance GPU) MIG mode operation is to analyze the demands of your job and determine which GPU size is available and suitable for it. For example, if your simulation uses very small memory, you would be better off using a 1g.5gb slice and leaving the bigger partitions to jobs that need more GPU memory. Another consideration for machine learning jobs is the difference in demands of training and inference tasks. Training tasks are more compute and memory intensive, this they are a better for for a full GPU or large partition while inference tasks would run sufficiently on smaller slices.

You would modify your slurm script so that you are now requesting a suitable gpu slice:

#!/bin/bash
#SBATCH --partition=gpuq                    # the DGX only belongs in the 'gpu'  partition
#SBATCH --qos=gpu                           # need to select 'gpu' QoS
#SBATCH --job-name=python-gpu
#SBATCH --output=python-gpu.%j.out
#SBATCH --error=python-gpu.%j.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1                 # up to 128; 
#SBATCH --gres=gpu:1g.5gb:1                 # request a slice of the GPU
#SBATCH --mem-per-cpu=3500M                 # memory per CORE; total memory is 1 TB (1,000,000 MB)
#SBATCH --export=ALL 
#SBATCH --time=0-01:00:00                   # set to 1hr; please choose carefully

set echo
umask 0027

# to see ID and state of GPUs assigned
nvidia-smi

module load gnu10                            
module load python

python myscript.py

Read more about the Hopper GPUs and other examples on the DGX USER GUIDE.

Using External Python Packages

To install and run your python code with external python packages, after loading the python module, first create a directory for storing those packages (e.g. ~/python-packages/projectX)

mkdir ~/python-packages

mkdir ~/python-packages/projectX

Then install the appropriate packages in there:

pip install <package1> -t ~/python-packages/projectX

To run your code with these extra packages, you would need to add the export command to your SLURM submission script so that the last part would now be

module load gnu10
module load python
export PYTHONPATH=~/python-packages/projectX:$PYTHONPATH


python myscript.py
If instead you were running interactively, then you need to run the export command from the terminal

$ export PYTHONPATH=~/python-packages/projectX:$PYTHONPATH

Running with Python Virtual Environments

To have better control over the Python packages and libraries you need to use on the Cluster, the best way to run Python is through the use of Python Virtual Envronments. This is especially useful for codes that use Tensorflow, Keras or Pytorch. Read our instructions on building Python Virtual Environments here and how to run Tensorflow here.

Remember

When running on ARGO, the SLURM scripts have to be updated so that they can be run on ARGO. The main differences between ARGO and HOPPER are detailed in these pages.

Running with Jupyter NoteBooks

You also have the option of using Jupyter Notebooks (on Hopper) to run Python code. The steps for doing this are outlined in these pages.