Retry jobs in case of failure
There are cases where the worker nodes cannot execute condor jobs correctly (e.g. CVMFS problems). Although there are mechanisms to check the health of these nodes and prevent them from accepting new jobs, in certain cases they do eventually accept jobs and cause them to fail.
In that case we suggest retrying the job by using the following additional lines in the submit file:
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
max_retries = 3
requirements = Machine =!= LastRemoteHost
queue
An alternative way to submit in case of a specific exit code is:
on_exit_remove = (ExitBySignal == False) && (ExitCode != 11)
max_retries: HTCondor will retry the job up to three times in case of a failure.
requirements = Machine =!= LastRemoteHost: The job will not run again in the same worker node.
Example:
executable = hello_world.sh
arguments = $(ClusterID) $(ProcId)
output = output/hello.$(ClusterId).$(ProcId).out
error = error/hello.$(ClusterId).$(ProcId).err
log = log/hello.$(ClusterId).log
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
max_retries = 3
requirements = Machine =!= LastRemoteHost
queue
The hello_world.sh
script contains one command that will fail (ls: cannot access /home/condor-test: No such file or directory
) and it will cause exit code=1.
In that case, HTCondor will retry the job up to three times in a different worker node each time.
#!/bin/bash
ls /home/condor-test || exit 1