Introduction to SLURM and MPI

This Section covers basic usage of the SLURM infrastructure, particularly when launching MPI applications.

Inspecting the state of the cluster

There are two main commands that can be used to display the state of the cluster. These are sinfo, for showing node information, and squeue for showing job information.

Node information

sinfo displays the state of nodes, structured by partition.

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch-short    up 2-00:00:00      3  alloc hpc[004-005,007]
batch-short    up 2-00:00:00      7   idle hpc[006,008-013]
batch-long     up   infinite      1  down* hpc170
batch-long     up   infinite     43  alloc hpc[014-019,021-030,051-070,151,156,176,179-182]
batch-long     up   infinite     23   idle hpc[031-033,071-086,172,183,185,187]
batch-long     up   infinite     24   down hpc[020,152-154,157-167,169,171,173-175,177-178,184,186]
inf-short      up 2-00:00:00      1 drain* hpc-be014
inf-short      up 2-00:00:00      4  drain hpc-be[099,116-117,128]
inf-short      up 2-00:00:00     64   idle hpc-be[001-002,006-008,011-013,016-017,019-023,025-026,030-032,035,040-042,045,047,049,051,053-056,059,064,067,072,080,085,087-088,094-096,102-104,106-109,112-113,120,123-124,126,129,134-136,139-142]
inf-long       up   infinite      2  down* hpc-be[060,081]
inf-long       up   infinite      1  drain hpc-be048
inf-long       up   infinite     32  alloc hpc-be[004-005,009,015,018,027,029,034,038,044,046,050,052,057-058,061,068-069,074,077,079,083,086,091-092,101,114,118,121-122,125,127]
inf-long       up   infinite     36   idle hpc-be[003,024,028,033,036,039,043,062-063,065-066,070-071,073,075-076,078,082,084,089-090,093,097-098,100,105,110-111,115,119,130-133,137-138]
inf-long       up   infinite      1   down hpc-be010

The most relevant is probably the STATE column, which groups together groups of nodes, by partition, with the same state.

Nodes in idle state are currently not in use, and therefore a job using these nodes should be scheduled immediately for running.
States alloc or mix are currently in use.
alloc means that the whole node is in use.
mix (not displayed above) indicates that a user is using only part of the resources of this node. However, we disable the sharing of nodes between users. The only way you may end up sharing a node is with yourself: in the case that you are only partly using a node, and you submit a new job which also does not fully use a node, and there are enough resources on a node for both of your jobs, SLURM may run both your jobs on the same node(s).
drain means that this node is currently unavailable for a technical reason. The system may have detected malfunction and disabled it, or we may be performing maintenance on this node.
drng (not displayed above) means that the node is being drained, but is still running a user job. We may schedule nodes for maintenacne while a job is running, which means that the node will be marked as unavailable right after the user job is finished. So do not worry if you have a job running on a node with this state.
down means that the node is unavailable.

The star* after a state is irrelevant for the user, it only indicates that SLURM is unable to contact this node. Therefore this is usually seen together with drain or down states.

Job information

squeue displays one line per job, showing the state of the job (typically either running "R", or pending "PD"). Most of the columns are self-describing, as shown in the example below.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             15002 batch-sho       gs   ken    R    8:51:31      1 hpc004
             14907  inf-long PyHT_ben theresa  R 1-17:40:40      2 hpc-be[029,034]
             14908  inf-long PyHT_ben theresa  R 1-17:40:40      2 hpc-be[044,057]
             14909  inf-long PyHT_ben theresa  R 1-17:40:40      2 hpc-be[068,083]
             14910  inf-long PyHT_ben theresa  R 1-17:40:40      2 hpc-be[091-092]
             14911  inf-long PyHT_ben theresa  R 1-17:40:40      2 hpc-be[105,110]
             14912  inf-long PyHT_ben theresa  R 1-17:40:40      2 hpc-be[125,127]
             14913  inf-long PyHT_ben theresa  R 1-17:40:40      2 hpc-be[038,050]
             14914  inf-long PyHT_ben theresa  R 1-17:40:40      2 hpc-be[052,069]
             14915  inf-long PyHT_ben theresa  R 1-17:40:40      2 hpc-be[074,077]
             14996  inf-long longFCC_ theresa  R   15:56:18      5 hpc-be[004,018,046,058,061]
             14940  inf-long triMagne  louis   R   22:11:24      7 hpc-be[079,086,101,114,118,121-122]
             14502 batch-lon triMagne charlie  R 3-02:00:39     10 hpc[061-070]
             14501 batch-lon   1_3Ghz charlie  R 3-02:01:29     10 hpc[051-060]
             14500 batch-lon     Simu charlie  R 3-02:02:15      8 hpc[023-030]
             14499 batch-lon     Simu charlie  R 3-02:02:27      8 hpc[014-019,021-022]
             14497 batch-lon     Simu charlie  R 3-02:09:15      7 hpc[151,156,176,179-182]
             13903  inf-long reverseK   willy  R 4-23:00:20      4 hpc-be[005,009,015,027]

Other useful options include displaying only your own jobs, using squeue -u $USER. Or jobs from a specific partition, such as squeue -p batch-long.

For more information, please see the man page man squeue.

Available MPI distributions

Linux HPC currently supports different MPI distributions. Mainly MPICH, MVAPICH, and OpenMPI. We recommend using the latest stable version of either MVAPICH or OpenMPI. We specifically recommend running openmpi4 over openmpi3. Mileage may vary depending on the application that is being run. Supported MPI distributions and versions are shown as follows:

→ module avail
------------------------------------------------------------------------------------------------------- /usr/share/Modules/modulefiles -------------------------------------------------------------------------------------------------------
dot  module-git  module-info  modules  null  use.own

-------------------------------------------------------------------------------------------------------------- /etc/modulefiles --------------------------------------------------------------------------------------------------------------
mpi/mvapich2/2.3  mpi/openmpi/3.1.6  mpi/openmpi/4.1.1

To use mvapich version 2.3, you would issue the command module load mpi/mvapich2/2.3.

Imagine you then would like to switch to OpenMPI 4.1.1. Before running module load mpi/openmpi/4.1.1, you would have to unload previously loaded environments. To achieve this, you may use module unload mpi/mvapich2/2.3 to remove a specific module. Alternatively, you may wish to delete everything that was loaded and return to a "clean" state using module purge.

In addition to loading MPI distributions, you may also select alternative compilers. The default is gcc version 4, but more modern gcc versions are also supported. These may change as we update the operating system. If you have a special hard requirement for a specific compiler or MPI distribution/version, please open a Request.

Running jobs

There are three ways to run jobs:

srun for interactive runs. This is good for quick tests, but it will block your terminal until the job is completed.
sbatch for batch submissions. This is the main use case, as it allows you to create a job submission script where you may put all the arguments, commands, and comments for a particular job submission. It is also useful for recording or sharing how a particular job is run.
salloc simply allocates resources (typically a set of nodes), allowing you to decouple the allocation from the execution. This is typically used in the rare case where you need to use the mpirun/mpiexec commands. With salloc you would allocate a set of nodes, and mpirun/mpiexec will typically pick up the allocation parameters from the environment, making it unnecessary to use hostfiles.

We will focus on batch submission next.

Batch submissions

Batch submission consist of a batch submission file, which is essentially just a script telling SLURM the amount of resources that are needed (e.g. partition, number of tasks/nodes) how these resources will be used (e.g. tasks per node), and one or different job steps (i.e. program runs). This file is then submitted using the sbatch command. Consider for instance the following batch submission file:

#!/usr/bin/bash
#SBATCH -p inf-short
#SBATCH -t 1:00:00
#SBATCH -n 64

srun ./mpi_program parameters

The first line indicates the kind of shell that will be run, typically bash. Next, SLURM submission parameters are defined by lines starting with #SBATCH, followed by the submission parameters documented in the sbatch manpage (man sbatch). You may also read the sbatch documentation online. These parameters are almost a 1:1 mirror of the options available to srun.

In the case of the above script, it is requesting the batch-short partition (argument -p or --partition), setting a maximum timelimit of 1 hour (argument -t or --time), and requesting 64 tasks (-n or --ntasks).

To do the actual submission, one would run the following, assuming the submission file was called file (the name does not matter and could be anything):

$ sbatch file
Submitted batch job 15276

$

By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. Other than the batch script itself, Slurm does no movement of user files.

For more details on different submission patterns and parameters, please see Submission patterns.

Querying finished jobs

Querying past jobs is also possible using sacct, which will output job name, the amount of allocated nodes, cpus, memory, job state, and so forth. You will be able to recover the exit code and final job state, but not the full output or log.

sacct -j 15374
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
15374        helloworld batch-sho+ hpc-batch+          1  COMPLETED      0:0
15374.batch       batch            hpc-batch+          1  COMPLETED      0:0
15374.0            echo            hpc-batch+          1  COMPLETED      0:0

In this case, you will see three different rows. These details are irrelevant, but in case you were wondering, this is just the way SLURM reports job information. The first row contains the (global) job information. The second row just depicts the fact that this is a batch job. If we had used srun, we would only see the first row. Finally, we see a row entry for every job step, one for every srun that was launched from within the batch job submission script. This job just ran srun echo once, and therefore created a single job step (15374.0) whose JobNamewill be the program name by default, in this case "echo".

You may also list the nodes that were involved, as follows:

sacct -j 15370 --format JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,NodeList
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode        NodeList
------------ ---------- ---------- ---------- ---------- ---------- -------- ---------------
15370        triMagnet+    be-long te-vsc-scc        280     FAILED    127:0 hpc-be[010,079+
15370.batch       batch            te-vsc-scc         40     FAILED    127:0       hpc-be010
15370.0      hydra_pmi+            te-vsc-scc         28     FAILED      7:0 hpc-be[010,079+

You may see all available fields to use with --format in the sacct man page, or using sacct -e. In case a field is too big to be displayed, the sacct output will end with a +. In this case, we just need to explicitly tell sacct how many characters you would like to display, as follows:

sacct -j 15370 --format JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,NodeList%40
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode                                 NodeList
------------ ---------- ---------- ---------- ---------- ---------- -------- ----------------------------------------
15370        triMagnet+    be-long te-vsc-scc        280     FAILED    127:0      hpc-be[010,079,086,101,114,118,121]
15370.batch       batch            te-vsc-scc         40     FAILED    127:0                                hpc-be010
15370.0      hydra_pmi+            te-vsc-scc         28     FAILED      7:0      hpc-be[010,079,086,101,114,118,121]

Note that by default you will only be able to query jobs which you own.

Reporting issues

Before reporting an issue, please keep an eye out for the SSB in case there is a known issue that prevents jobs from running.

Issues/Incidents should be reported using the Service Portal, and should indicate in every case which application was being used, and how the application was being launched (srun command or sbatch submission script). Requests can also be issued using the Service Portal as well.

For some cases, like e.g. application crashes, we may require a way to reproduce the issue. The easiest is that you prepare a tarball, zipfile, or directory in your home folder containing everything necessary for us to launch the job, and you tell us where this file is located so we may use it to launch your application.

Last update: May 2, 2022