Common submission patterns and parameters

The submission parameters will greatly depend on your application characteristics, including scalability.

In its most basic form, you would just select the number of tasks (i.e. the number of processes) that will run. Each task will run on a CPU core, and if you request more cores than what is available in a single node (which you should totally be doing), your job's tasks will be spread out on multiple nodes, and will typically communicate over MPI.

Task placement

Task placement can greatly affect performance, and for this the CPU topology and task/node placement density will play a big role. Again, the outcome will greatly depend on each application, so some trial and error, or ideally deep knowledge of application internals, even profiling, will help determine the best way to run an MPI application.

Mostly, you will want to experiment using the following combinations of options:

--ntasks. Using this option alone is sufficient in many cases, and works well when using a multiple of the number of available cores for a node.
--nodes and --ntasks-per-node. You may also set the number of nodes you want to use, and how many tasks per node the job needs. If you do not set --ntasks, SLURM will set it to nodes * ntasks-per-node. If you do set it, it will check that the previous equation is valid.

Other options

Memory per CPU. By default, SLURM gives you a share of memory that is proportional to the amount of CPU cores that are requested. Therefore, if you request all CPU cores, you will get all of the memory. You may override this option by using either:

--mem, to set the required memory per node.
--mem-per-cpu, to override the default memory per cpu.

Setting the job's name, using --job-name. This is the name that will be displayed in squeue.

Change the default email in case of failure, using --mail-user.

See the sbatch man page documentation for the full options and explanations.

Hybrid OpenMP + MPI applications

You do not have to do anything special to run programs compiled with OpenMP on the Linux HPC infrastructure. Just take into account that Slurm does not know about the number of OpenMP threads that you will run, but you can give the relevant hint by using --cpus-per-task.

However, each compiler and MPI framework will handle affinity in a different way, and the default settings may result in bad performance.

Specifically, when using MVAPICH2, the framework will set the affinity of each process to a single CPU core. As a result, all threads of the same process will run on a single CPU core.

To prevent this from happening and to utilize all CPU cores in a multithreaded environment, one must set the environment variable MV2_ENABLE_AFFINITY to 0. The easiest way to achieve this is to just include the following line in the submit file.

export MV2_ENABLE_AFFINITY=0

For more details, please see the official documentation on this matter in the MVAPICH2 user guide

An example for using OpenMP (with, for instance, OpenMPI) follows:

#!/bin/bash

#SBATCH --job-name openmp-test
#SBATCH --partition inf-short
#SBATCH --time 00:01:00
#SBATCH --ntasks 12
#SBATCH --cpus-per-task 2

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --ntasks $SLURM_NTASKS ./a.out

Hyperthreading

Modern processors provide a form of parallelism called hyperthreading, where a single physical core will appear as two cores to the operating system. While some functional execution units will be "mirrored" to provide parallel execution, not all of the hardware features will provide this kind of parallelism, and therefore application will usually be unable to achieve a 2x speedup. In fact, some applications may see degraded performance as a result of hyperthreading, for instance due to the increased number of tasks becoming a memory bottleneck.

Linux HPC resources have hyperthreading enabled, but whether your application will benefit from this or not will greatly depend on the application itself. This is something each user has to try on their own, or to follow recommendations from your application provider.

Using Hyperthreading

Users do not have to do anything to use hyperthreading. Since hyperthreading is already enabled in the processor, just by using all available cores, applications will be effectively using hyperthreading.

Disabling Hyperthreading

While it is not possible to disable hyperthreading, it is possible to launch jobs in a way that tasks will only use one physical core at a time, instead of sharing the physical core among more than one (generally 2) tasks. As a result, half of the cores of the processor will appear as unused from the operating system's perspective, which would be the intention of not using hyperthreading.

To achieve this, just use the following option in your srun command: --hint=nomultithread. Please note that the name is highly misleading, as this is essentially a nohyperthreading option, it does not mean that multithreading is disabled at the processor level. (Benchmarks have shown that the effect of this setting on HPC applications for e.g. CFD like Fluent is the same, or slightly better than running on the same hardware with hyperthreading disabled.)

The following is an example for an application that wants to run 40 tasks without hyperthreading:

#!/bin/bash

#SBATCH --job-name nohyperthreading-test
#SBATCH --partition inf-short
#SBATCH --time 01:00:00
#SBATCH --ntasks 40
#SBATCH --ntasks-per-node 20

srun -n $SLURM_NTASKS --hint=nomultithread ./a.out

In this example, Slurm automatically calculates how many nodes will be needed based on the number of tasks you ask for, which may be easier to reason about when launching jobs. Note that when it comes to resource allocation, Slurm only considers the #SBATCH lines and does not look at any other lines, including the srun options, so we need to supply enough information to hint at how many cpu cores and number of nodes we will need for the following srun. This is achieved by using both --ntasks and --ntasks-per-node, but you may also use --ntasks together with --nodes to be explicit.

If we left the --ntasks-per-node 20 option out, Slurm would not know about the --hint=nomultithread option during resource allocation (--hint=nomultithread sometimes does flaky things when used in an #SBATCH line), and would think a single 40-core node would be able to fit these 40 tasks. Therefore it would only allocate one node. This would be really bad, as in the following srun, the nomultithread hint would be honored, but it would be too late to use 2 nodes since resource allocation already happened, resulting in 40 tasks being run on only 20 cores, which is obviously bad for performance and surely not what we intended. Therefore the --ntasks-per-node 20 option is needed as a hint in the #SBATCH lines so that Slurm will allocate the 2 nodes we need.

Furthermore, imagine we wanted to run an application with 12 tasks, each task running 2 OpenMP threads, and we wanted to benefit from disabling hyperthreading. The following example shows an example of how to achieve this (for inf- partitions).

#!/bin/bash

#SBATCH --job-name nohyperthreading-openmp-test
#SBATCH --partition inf-short
#SBATCH --time 01:00:00
#SBATCH --exclusive
#SBATCH --ntasks 12
#SBATCH --ntasks-per-node 10
#SBATCH --cpus-per-task 2

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun -n $SLURM_NTASKS --hint=nomultithread ./a.out

Please take into account that we have to use --exclusive so that the allocation takes the entire node. The remaining options under --exclusive will be "inherited" by srun.

The --ntasks-per-node=10 option is there to make sure that we get enough node allocations to be able to run all 12*2 OpenMP threads, each on its own physical core (no use of hyperthreading). However, to disable hyperthreading only 20 cores can be used in total per node, and in addition we want each task to take up 2 physical cores. Therefore the value of 10: ((number_of_hyperthreaded_cores_per_node/2)/cpus_per_task), which would evaluate to ((40/2)/2) = 10. You may also simply do the math and supply the appropriate value for --nodes instead of using --ntasks-per-node.

For nodes from the batch* partitions, the equivalent would be valid except nodes have 32 hyperthreaded cores, meaning for this example application we would end up with --ntasks-per-node=8.

Last update: May 5, 2022