Introduction to SLURM and MPI
This Section covers basic usage of the SLURM infrastructure, particularly when launching MPI applications.
Inspecting the state of the cluster
There are two main commands that can be used to display the state of the cluster. These are sinfo
, for showing node information, and squeue
for showing job information.
Node information
sinfo
displays the state of nodes, structured by partition.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch-short up 2-00:00:00 3 alloc hpc[004-005,007]
batch-short up 2-00:00:00 7 idle hpc[006,008-013]
batch-long up infinite 1 down* hpc170
batch-long up infinite 43 alloc hpc[014-019,021-030,051-070,151,156,176,179-182]
batch-long up infinite 23 idle hpc[031-033,071-086,172,183,185,187]
batch-long up infinite 24 down hpc[020,152-154,157-167,169,171,173-175,177-178,184,186]
inf-short up 2-00:00:00 1 drain* hpc-be014
inf-short up 2-00:00:00 4 drain hpc-be[099,116-117,128]
inf-short up 2-00:00:00 64 idle hpc-be[001-002,006-008,011-013,016-017,019-023,025-026,030-032,035,040-042,045,047,049,051,053-056,059,064,067,072,080,085,087-088,094-096,102-104,106-109,112-113,120,123-124,126,129,134-136,139-142]
inf-long up infinite 2 down* hpc-be[060,081]
inf-long up infinite 1 drain hpc-be048
inf-long up infinite 32 alloc hpc-be[004-005,009,015,018,027,029,034,038,044,046,050,052,057-058,061,068-069,074,077,079,083,086,091-092,101,114,118,121-122,125,127]
inf-long up infinite 36 idle hpc-be[003,024,028,033,036,039,043,062-063,065-066,070-071,073,075-076,078,082,084,089-090,093,097-098,100,105,110-111,115,119,130-133,137-138]
inf-long up infinite 1 down hpc-be010
The most relevant is probably the STATE column, which groups together groups of nodes, by partition, with the same state.
- Nodes in
idle
state are currently not in use, and therefore a job using these nodes should be scheduled immediately for running. - States
alloc
ormix
are currently in use. alloc
means that the whole node is in use.mix
(not displayed above) indicates that a user is using only part of the resources of this node. However, we disable the sharing of nodes between users. The only way you may end up sharing a node is with yourself: in the case that you are only partly using a node, and you submit a new job which also does not fully use a node, and there are enough resources on a node for both of your jobs, SLURM may run both your jobs on the same node(s).drain
means that this node is currently unavailable for a technical reason. The system may have detected malfunction and disabled it, or we may be performing maintenance on this node.drng
(not displayed above) means that the node is being drained, but is still running a user job. We may schedule nodes for maintenacne while a job is running, which means that the node will be marked as unavailable right after the user job is finished. So do not worry if you have a job running on a node with this state.down
means that the node is unavailable.
The star* after a state is irrelevant for the user, it only indicates that SLURM is unable to contact this node. Therefore this is usually seen together with drain or down states.
Job information
squeue
displays one line per job, showing the state of the job (typically either running "R", or pending "PD"). Most of the columns are self-describing, as shown in the example below.
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
15002 batch-sho gs ken R 8:51:31 1 hpc004
14907 inf-long PyHT_ben theresa R 1-17:40:40 2 hpc-be[029,034]
14908 inf-long PyHT_ben theresa R 1-17:40:40 2 hpc-be[044,057]
14909 inf-long PyHT_ben theresa R 1-17:40:40 2 hpc-be[068,083]
14910 inf-long PyHT_ben theresa R 1-17:40:40 2 hpc-be[091-092]
14911 inf-long PyHT_ben theresa R 1-17:40:40 2 hpc-be[105,110]
14912 inf-long PyHT_ben theresa R 1-17:40:40 2 hpc-be[125,127]
14913 inf-long PyHT_ben theresa R 1-17:40:40 2 hpc-be[038,050]
14914 inf-long PyHT_ben theresa R 1-17:40:40 2 hpc-be[052,069]
14915 inf-long PyHT_ben theresa R 1-17:40:40 2 hpc-be[074,077]
14996 inf-long longFCC_ theresa R 15:56:18 5 hpc-be[004,018,046,058,061]
14940 inf-long triMagne louis R 22:11:24 7 hpc-be[079,086,101,114,118,121-122]
14502 batch-lon triMagne charlie R 3-02:00:39 10 hpc[061-070]
14501 batch-lon 1_3Ghz charlie R 3-02:01:29 10 hpc[051-060]
14500 batch-lon Simu charlie R 3-02:02:15 8 hpc[023-030]
14499 batch-lon Simu charlie R 3-02:02:27 8 hpc[014-019,021-022]
14497 batch-lon Simu charlie R 3-02:09:15 7 hpc[151,156,176,179-182]
13903 inf-long reverseK willy R 4-23:00:20 4 hpc-be[005,009,015,027]
Other useful options include displaying only your own jobs, using squeue -u $USER
. Or jobs from a specific partition, such as squeue -p batch-long
.
For more information, please see the man page man squeue
.
Available MPI distributions
Linux HPC currently supports different MPI distributions. Mainly MPICH, MVAPICH, and OpenMPI. We recommend using the latest stable version of either MVAPICH or OpenMPI. We specifically recommend running openmpi4 over openmpi3. Mileage may vary depending on the application that is being run. Supported MPI distributions and versions are shown as follows:
→ module avail
------------------------------------------------------------------------------------------------------- /usr/share/Modules/modulefiles -------------------------------------------------------------------------------------------------------
dot module-git module-info modules null use.own
-------------------------------------------------------------------------------------------------------------- /etc/modulefiles --------------------------------------------------------------------------------------------------------------
mpi/mvapich2/2.3 mpi/openmpi/3.1.6 mpi/openmpi/4.1.1
To use mvapich version 2.3, you would issue the command module load mpi/mvapich2/2.3
.
Imagine you then would like to switch to OpenMPI 4.1.1. Before running module load mpi/openmpi/4.1.1
, you would have to unload previously loaded environments.
To achieve this, you may use module unload mpi/mvapich2/2.3
to remove a specific module. Alternatively, you may wish to delete everything that was loaded and return to a "clean" state using module purge
.
In addition to loading MPI distributions, you may also select alternative compilers. The default is gcc version 4, but more modern gcc versions are also supported. These may change as we update the operating system. If you have a special hard requirement for a specific compiler or MPI distribution/version, please open a Request.
Running jobs
There are three ways to run jobs:
srun
for interactive runs. This is good for quick tests, but it will block your terminal until the job is completed.sbatch
for batch submissions. This is the main use case, as it allows you to create a job submission script where you may put all the arguments, commands, and comments for a particular job submission. It is also useful for recording or sharing how a particular job is run.salloc
simply allocates resources (typically a set of nodes), allowing you to decouple the allocation from the execution. This is typically used in the rare case where you need to use the mpirun/mpiexec commands. Withsalloc
you would allocate a set of nodes, and mpirun/mpiexec will typically pick up the allocation parameters from the environment, making it unnecessary to use hostfiles.
We will focus on batch submission next.
Batch submissions
Batch submission consist of a batch submission file, which is essentially just a script telling SLURM the amount of resources that are needed (e.g. partition, number of tasks/nodes) how these resources will be used (e.g. tasks per node), and one or different job steps (i.e. program runs). This file is then submitted using the sbatch
command. Consider for instance the following batch submission file:
#!/usr/bin/bash
#SBATCH -p inf-short
#SBATCH -t 1:00:00
#SBATCH -n 64
srun ./mpi_program parameters
The first line indicates the kind of shell that will be run, typically bash. Next, SLURM submission parameters are defined by lines starting with #SBATCH
, followed by the submission parameters documented in the sbatch manpage (man sbatch
). You may also read the sbatch documentation online. These parameters are almost a 1:1 mirror of the options available to srun
.
In the case of the above script, it is requesting the batch-short
partition (argument -p
or --partition
), setting a maximum timelimit of 1 hour (argument -t
or --time
), and requesting 64 tasks (-n
or --ntasks
).
To do the actual submission, one would run the following, assuming the submission file was called file
(the name does not matter and could be anything):
$ sbatch file
Submitted batch job 15276
$
By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. Other than the batch script itself, Slurm does no movement of user files.
For more details on different submission patterns and parameters, please see Submission patterns.
Querying finished jobs
Querying past jobs is also possible using sacct
, which will output job name, the amount of allocated nodes, cpus, memory, job state, and so forth. You will be able to recover the exit code and final job state, but not the full output or log.
sacct -j 15374
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
15374 helloworld batch-sho+ hpc-batch+ 1 COMPLETED 0:0
15374.batch batch hpc-batch+ 1 COMPLETED 0:0
15374.0 echo hpc-batch+ 1 COMPLETED 0:0
In this case, you will see three different rows. These details are irrelevant, but in case you were wondering, this is just the way SLURM reports job information. The first row contains the (global) job information. The second row just depicts the fact that this is a batch job. If we had used srun, we would only see the first row. Finally, we see a row entry for every job step, one for every srun
that was launched from within the batch job submission script. This job just ran srun echo
once, and therefore created a single job step (15374.0) whose JobNamewill be the program name by default, in this case "echo".
You may also list the nodes that were involved, as follows:
sacct -j 15370 --format JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,NodeList
JobID JobName Partition Account AllocCPUS State ExitCode NodeList
------------ ---------- ---------- ---------- ---------- ---------- -------- ---------------
15370 triMagnet+ be-long te-vsc-scc 280 FAILED 127:0 hpc-be[010,079+
15370.batch batch te-vsc-scc 40 FAILED 127:0 hpc-be010
15370.0 hydra_pmi+ te-vsc-scc 28 FAILED 7:0 hpc-be[010,079+
You may see all available fields to use with --format
in the sacct man page, or using sacct -e
. In case a field is too big to be displayed, the sacct
output will end with a +
. In this case, we just need to explicitly tell sacct how many characters you would like to display, as follows:
sacct -j 15370 --format JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,NodeList%40
JobID JobName Partition Account AllocCPUS State ExitCode NodeList
------------ ---------- ---------- ---------- ---------- ---------- -------- ----------------------------------------
15370 triMagnet+ be-long te-vsc-scc 280 FAILED 127:0 hpc-be[010,079,086,101,114,118,121]
15370.batch batch te-vsc-scc 40 FAILED 127:0 hpc-be010
15370.0 hydra_pmi+ te-vsc-scc 28 FAILED 7:0 hpc-be[010,079,086,101,114,118,121]
Note that by default you will only be able to query jobs which you own.
Reporting issues
Before reporting an issue, please keep an eye out for the SSB in case there is a known issue that prevents jobs from running.
Issues/Incidents should be reported using the Service Portal, and should indicate in every case which application was being used, and how the application was being launched (srun command or sbatch submission script). Requests can also be issued using the Service Portal as well.
For some cases, like e.g. application crashes, we may require a way to reproduce the issue. The easiest is that you prepare a tarball, zipfile, or directory in your home folder containing everything necessary for us to launch the job, and you tell us where this file is located so we may use it to launch your application.