Skip to content

Exercise 8c: Retry

If a node fails for any reason, DAGMan can retry to execute this node before the whole DAG fails and a Rescue file will be created. The node can be specified in the dag file by using the following:

RETRY <name of job> <number of retries>  [UNLESS-EXIT value]

The Retry B 3 command which means that if the node B (job's RETURN value different than 0) fails, DAGMan will retry this node. It is important to understand that this is the failure of the executable within the node. It will not resubmit the job if the jobs get to the on hold status (error during the procedure of submission).

The scripts A.sub, C.sub, D.sub are simple submit files that contain:

executable              = welcome.sh
arguments               = $(ClusterID)$(ProcId)
output                  = output/welcome.$(ClusterId).$(ProcId).out
error                   = error/welcome.$(ClusterId).$(ProcId).err
log                     = log/welcome.$(ClusterId).log
queue

The script welcome.sh contains the following:

#!/bin/bash

echo "welcome to HTCondor Tutorial"

The script B.sub contains:

executable              = nodeB.sh
arguments               = $(ClusterID)$(ProcId)
output                  = output/nodeB.$(ClusterId).$(ProcId).out
error                   = error/nodeB.$(ClusterId).$(ProcId).err
log                     = log/nodeB.$(ClusterId).log
queue

The script nodeB.sh contains the following:

#!/bin/bash

echo "welcome to node B"
./noScript.sh

The noScript.sh does not exist and this causes failure to the node B.

In the exercise8b.dag, a new line with the Retry command is added:

JOB A A.sub
JOB B B.sub
JOB C C.sub
JOB D D.sub
PARENT A CHILD B C
PARENT B CHILD D
PARENT C CHILD D
Retry B 3

Execution of condor_submit_dag -force exercise8b.dag command in order to submit the DAG. In order to avoid the deletion of the files that the previous submission of the same dag file (exercise8b.dag) created, the argument -force is needed to overwrite them.

The B.sub will be resubmit for three more times. The total number of submissions is 4.

DaGMan will retry to submit the exercise8b.dag.rescue001 file that is created so it will resubmit only the failed node and it will not submit again the nodes that have already been submitted the first time.

The exercise8b.dag.nodes.log file, shows how many times the node B has been submitted and also the RETURN value of the jobs. Node B returns value 127 because the noScript.sh does not exist.

000 (23205.000.000) 04/03 11:54:00 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
    DAG Node: B
...
001 (23205.000.000) 04/03 11:54:01 Job executing on host: <188.185.178.17:9618?addrs=188.185.178.17-9618+[--1]-9618&noUDP&sock=31625_0319_3>
...
005 (23205.000.000) 04/03 11:54:01 Job terminated.
        (1) Normal termination (return value 127)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        106  -  Run Bytes Sent By Job
        75  -  Run Bytes Received By Job
        106  -  Total Bytes Sent By Job
        75  -  Total Bytes Received By Job
        Partitionable Resources :    Usage  Request Allocated
           Cpus                 :                 1         1
           Disk (KB)            :       16        1    568962
           Memory (MB)          :        0     2000      2000
...
000 (23212.000.000) 04/03 11:54:48 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
    DAG Node: B
...
001 (23212.000.000) 04/03 11:54:49 Job executing on host: <188.185.178.17:9618?addrs=188.185.178.17-9618+[--1]-9618&noUDP&sock=31625_0319_3>
...
005 (23212.000.000) 04/03 11:54:50 Job terminated.
        (1) Normal termination (return value 127)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        106  -  Run Bytes Sent By Job
        75  -  Run Bytes Received By Job
        106  -  Total Bytes Sent By Job
        75  -  Total Bytes Received By Job
        Partitionable Resources :    Usage  Request Allocated
           Cpus                 :                 1         1
           Disk (KB)            :       16        1    568962
           Memory (MB)          :        0     2000      2000
...
000 (23213.000.000) 04/03 11:54:58 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
    DAG Node: B
...
001 (23213.000.000) 04/03 11:54:59 Job executing on host: <188.185.178.17:9618?addrs=188.185.178.17-9618+[--1]-9618&noUDP&sock=31625_0319_3>
...
005 (23213.000.000) 04/03 11:55:00 Job terminated.
        (1) Normal termination (return value 127)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        106  -  Run Bytes Sent By Job
        75  -  Run Bytes Received By Job
        106  -  Total Bytes Sent By Job
        75  -  Total Bytes Received By Job