GPUs

GPU resources are available for use and can be accessed using HTCondor. (Check the basics for how to submit jobs under Local.)

There are a few examples available for common use cases.

For notebook and Machine Learning use cases, there are also GPUs available under SWAN and the ML service (based on Kubeflow.)

Available GPU resources for HTCondor:

To view what resources are available, please run the following command:

condor_status -constraint  '!isUndefined(DetectedGPUs)' -compact  -af Machine GPUs_DeviceName TotalGPUs

Requesting GPU resources requires a single addition to the submit file.

request_gpus            = 1

To request more than one GPU in HTCondor, the number (n) can be used where 0 < n < 5

request_gpus            = n

The following examples show how to submit a Hello World GPU job, a simple TensorFlow job, a job in Docker and a job in Singularity.

Running on specific platforms

The Batch Service offers a variety of GPU platforms in HTCondor. Depending on the use, you may want to run on specific models with the right capabilities for your jobs.

HTCondor publishes automatically some GPU attributes in our GPU machines that you can use in the requirements attribute for your submit file. The following examples show some of the possible options:

Requirements based on device: you can use the Machine ClassAd attribute GPUs_DeviceName to match the GPU type you need:
- requirements = regexp("V100", TARGET.GPUs_DeviceName): this expression will make your job able to run on our V100 or V100S cards.
- requirements = TARGET.GPUs_DeviceName =?= "Tesla T4": this expression will make your job run only on Tesla T4 cards.
- requirements = regexp("A100", TARGET.GPUs_DeviceName): this expression will make your job able to run on our A100 cards.
Requirements based on compute capabilities: the compute capability version of the device is published by HTCondor in the Machine ClassAdd attribute CUDACapability and can be used as well in your submit file like this:
- requirements = TARGET.GPUs_Capability =?= 8.0: this expression will make your job able to run on devices which support a minimum CUDA version.

While using device names are more friendly in terms of human readability, using compute capabilities can be more flexible in the long term as you won't have to update your jobs if we add more hardware that matches the desired capability version.

Interactive Jobs

Interactive jobs can be run to gain access to the resource for testing and development. There are two ways in which interactive access can be gained to the resources, either requested at the time of job submission or once the job has already started. If interactive access is required from the moment the user is first allocated to a machine, the -interactive parameter must be specified when the job is submitted:

condor_submit -interactive gpu_job.sub

The user will then be presented with the following statement whilst the job is waiting to be assigned a machine:

Waiting for job to start...

Once the machine has been assigned, the user will have access to the terminal:

Welcome to slot1_1@b7g47n0004.cern.ch!

[jfenech@b7g47n0004 dir_21999]$ nvidia-smi
Mon Dec  9 16:39:16 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   35C    P0    28W / 250W |     11MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Alternatively, if you would rather submit a batch job and access the machine whilst it is running for debugging or general monitoring, you can condor_ssh_to_job to gain access to the terminal on the machine that is currently running the job. The -auto-retry flag will periodically check to see whether the job has been assigned to a machine and set up the connection when it has.

condor_ssh_to_job -auto-retry jobid

For example:

[jfenech@lxplus752 Int_Hel_Mul]$ condor_ssh_to_job -auto-retry 2257625.0 
Waiting for job to start...
Welcome to slot1_1@b7g47n0003.cern.ch!
Your condor job is running with pid(s) 12564.
[jfenech@b7g47n0003 dir_12469]$ nvidia-smi
Mon Dec  9 16:59:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   37C    P0    36W / 250W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Monitoring

You can monitor your jobs in via condor_q in the normal way, and you can see how many other jobs are currently in the queue requesting GPU's:

condor_q -global -all -const '(requestGPUs > 0)'

There is a custom Grafana dashboard depicting information on the GPU's here. There are two time series heatmaps that are dedicated to either short or long time periods: changing the time period in the drop down menu in the top right will affect which of the two graphs has data. There is another drop down menu at the top left which will specify the metric to be displayed in the heatmaps.

Notes

Please note that benchmark jobs are currently not compatible with GPU nodes: All GPU VM's are identical and so benchmark jobs are redundant. Jobs marked as both Benchmark and GPU jobs will not be scheduled.

Machines are currently separated into subhostgroups that are dedicated to different job lengths. This is to prevent long running jobs from blocking the resources so that less demanding jobs can have a chance to be scheduled. The flavours available are: 'espresso', 'longlunch' and 'nextweek' (see the appropriate section in the tutorial for more info on job flavours). Short jobs can run on long job nodes but long jobs cannot run on short job nodes so you will maximise the chance that your job will be scheduled if you use as short a job flavour as you can.

For example, include in your submit file:

+JobFlavour = "espresso"

Last update: November 16, 2023