Skip to content

Cloud partition

Instructions about the Cloud partition

The cloud partition will run jobs, as its name indicates, in the (Azure) cloud. Most applications should be able to run in this partition, but there are some differences to the on site clusters, notably with respect to MPI versions and storage.

In many cases, the only change required in your submit file is to use the cloud partition:

--partition cloud

However, you may want to adjust the number of cores per node to --ntasks-per-node=64, and take into account other caveats as mentioned below.

Cloud node characteristics

Each of the cloud nodes have 64-core AMD EPYC™ 7543 (Base freq 2.8Ghz) and around 448GB of memory. They are also interconnected with a low-latency 200Gb/s Infiniband network.

We have plans to either extend this partition, or add a second partition with different CPUs.

Supported MPI modules

This point is of on concern for running software that provides its own MPI, such as most Ansys products (but see Section ansys-note below). However, if you are building your own MPI software, and rely on module load commands in your job submit scripts, please note which MPI modules are available for cloud:

The following MPI modules are supported in Cloud partitions:

  • mpi/mvapich2/2.3
  • mpi/openmpi/4.1.1
  • For Intel MPI/OneAPI, please use: source /hpcscratch/project/intel/intelhpc21.sh

If you were using openmpi v3: Even though openmpi v3 and v4 are binary-compatible, we recommend recompiling your application using module load mpi/openmpi/4.1.1 before running in the cloud partition.

Please note that the module names on the Azure cloud nodes are different from the CERN HPC cluster. Hence if you have compiled e.g. with mpi/mvapich2/2.3 on the local cluster, you should use: module load mpi/mvapich2 in your job script for the cloud partition. To avoid conflicts, you would need to unload the previously loaded module.

Special Considerations for Cloud nodes

Mostly, one can submit to the cloud partition as if it were a local partition. Just change the name of the partition to cloud. Nodes that are shown as ~idle when running sinfo -p cloud will need to be booted up before the job can run. This can take several minutes, and therefore your job may start running with a small delay.

AFS, EOS, and CVMFS filesystems at CERN

Cloud jobs for the moment do not have access to AFS or EOS files that require authentication. Therefore, it is currently not possible to automatically copy data in/out of your private AFS or EOS directories.

Running software that is configured as readable without authentication (such is the case for Ansys products, for instance), should run without issues.

CVMFS is also fully accessible from the cloud nodes. We recommend using CVMFS software installations where possible.

Scratch space

All your data is accessible via /hpcscratch, but you may experience data I/O to be slower. We may add a dedicated cloud scratch space in the future.

Job appears as running in Slurm, but is not actually running

When nodes need to be booted, your job may appear as running even though it is not really running yet.

This is because Slurm is starting your job on the nodes/VMs that need to be created first. However, once the job actually starts running, Slurm will automatically reset the job starttime to the correct value.

Once the nodes are fully booted, or for instance if your job is ran right after another job has quit, the nodes will not need to be re-created, which means that your job should start running immediatelly.

Ansys note

You will need to have your ssh keys in your home directory in order to ensure that Ansys starts up correctly, just as it is already documented for the cluster in KB0006086:

[ -e ~/.ssh/id_rsa.pub ] || ssh-keygen -q -t rsa -f ~/.ssh/id_rsa -N ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Other issues

If you detect other issues, please open a ticket as usual.