Skip to content

Retry jobs in case of failure

There are cases where the worker nodes cannot execute condor jobs correctly (e.g. CVMFS problems). Although there are mechanisms to check the health of these nodes and prevent them from accepting new jobs, in certain cases they do eventually accept jobs and cause them to fail.

In that case we suggest retrying the job by using the following additional lines in the submit file:

on_exit_remove   = (ExitBySignal == False) && (ExitCode == 0)
max_retries      = 3
requirements     = Machine =!= LastRemoteHost
queue
on_exit_remove: HTCondor removes the job from the queue if the job has not been killed by a signal and has been successfully completed.

An alternative way to submit in case of a specific exit code is:

on_exit_remove = (ExitBySignal == False) && (ExitCode != 11)
Htcondor removes the job from the queue when the job has not been killed by a signal and it has an exit code other than 11.

max_retries: HTCondor will retry the job up to three times in case of a failure.

requirements = Machine =!= LastRemoteHost: The job will not run again in the same worker node.

Example:

executable     = hello_world.sh
arguments      = $(ClusterID) $(ProcId)
output         = output/hello.$(ClusterId).$(ProcId).out
error          = error/hello.$(ClusterId).$(ProcId).err
log            = log/hello.$(ClusterId).log
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
max_retries    = 3
requirements   = Machine =!= LastRemoteHost
queue

The hello_world.sh script contains one command that will fail (ls: cannot access /home/condor-test: No such file or directory) and it will cause exit code=1.

In that case, HTCondor will retry the job up to three times in a different worker node each time.

#!/bin/bash
ls /home/condor-test || exit 1

Last update: April 11, 2023