Exercise 8c: Retry
If a node fails for any reason, DAGMan can retry to execute this node before the whole DAG fails and a Rescue file will be created. The node can be specified in the dag file by using the following:
RETRY <name of job> <number of retries> [UNLESS-EXIT value]
The Retry B 3 command which means that if the node B (job's RETURN value different than 0) fails, DAGMan will retry this node. It is important to understand that this is the failure of the executable within the node. It will not resubmit the job if the jobs get to the on hold status (error during the procedure of submission).
The scripts A.sub, C.sub, D.sub are simple submit files that contain:
executable = welcome.sh
arguments = $(ClusterID)$(ProcId)
output = output/welcome.$(ClusterId).$(ProcId).out
error = error/welcome.$(ClusterId).$(ProcId).err
log = log/welcome.$(ClusterId).log
queue
#!/bin/bash
echo "welcome to HTCondor Tutorial"
executable = nodeB.sh
arguments = $(ClusterID)$(ProcId)
output = output/nodeB.$(ClusterId).$(ProcId).out
error = error/nodeB.$(ClusterId).$(ProcId).err
log = log/nodeB.$(ClusterId).log
queue
#!/bin/bash
echo "welcome to node B"
./noScript.sh
In the exercise8b.dag, a new line with the Retry command is added:
JOB A A.sub
JOB B B.sub
JOB C C.sub
JOB D D.sub
PARENT A CHILD B C
PARENT B CHILD D
PARENT C CHILD D
Retry B 3
The B.sub will be resubmit for three more times. The total number of submissions is 4.
DaGMan will retry to submit the exercise8b.dag.rescue001 file that is created so it will resubmit only the failed node and it will not submit again the nodes that have already been submitted the first time.
The exercise8b.dag.nodes.log file, shows how many times the node B has been submitted and also the RETURN value of the jobs. Node B returns value 127 because the noScript.sh does not exist.
000 (23205.000.000) 04/03 11:54:00 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
DAG Node: B
...
001 (23205.000.000) 04/03 11:54:01 Job executing on host: <188.185.178.17:9618?addrs=188.185.178.17-9618+[--1]-9618&noUDP&sock=31625_0319_3>
...
005 (23205.000.000) 04/03 11:54:01 Job terminated.
(1) Normal termination (return value 127)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
106 - Run Bytes Sent By Job
75 - Run Bytes Received By Job
106 - Total Bytes Sent By Job
75 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 16 1 568962
Memory (MB) : 0 2000 2000
...
000 (23212.000.000) 04/03 11:54:48 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
DAG Node: B
...
001 (23212.000.000) 04/03 11:54:49 Job executing on host: <188.185.178.17:9618?addrs=188.185.178.17-9618+[--1]-9618&noUDP&sock=31625_0319_3>
...
005 (23212.000.000) 04/03 11:54:50 Job terminated.
(1) Normal termination (return value 127)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
106 - Run Bytes Sent By Job
75 - Run Bytes Received By Job
106 - Total Bytes Sent By Job
75 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 16 1 568962
Memory (MB) : 0 2000 2000
...
000 (23213.000.000) 04/03 11:54:58 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
DAG Node: B
...
001 (23213.000.000) 04/03 11:54:59 Job executing on host: <188.185.178.17:9618?addrs=188.185.178.17-9618+[--1]-9618&noUDP&sock=31625_0319_3>
...
005 (23213.000.000) 04/03 11:55:00 Job terminated.
(1) Normal termination (return value 127)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
106 - Run Bytes Sent By Job
75 - Run Bytes Received By Job
106 - Total Bytes Sent By Job
75 - Total Bytes Received By Job