Exercise 8d: ABORT-DAG-ON
If a node fails to finish, DAG will continue running until no more jobs can be submitted due to the dependencies.
If the ABORT-DAG-ON NameOfJob ExitValue [RETURN DAG-Return-Value] is defined in the submit file, the DAG is immediately aborted.
In the exercise8b.sub a new command is added:
JOB A A.sub
JOB B B.sub
JOB C C.sub
JOB D D.sub
PARENT A CHILD B C
PARENT C CHILD D
PARENT B CHILD D
ABORT-DAG-ON B "number of exit code" RETURN 1
In this case, when Job B returns a specific exit code and Job C is completed, all the jobs from the queue will be removed, even if they have not dependency on Job B. In this case Job D is not going to be run and the DAG will return with value equal to 1.
The DAG Return value is in the exercise8b.dag.dagman.log.
If the ABORT-DAG-ON
000 (17220.000.000) 03/17 10:21:51 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
DAG Node: A
005 (17220.000.000) 03/17 10:26:11 Job terminated.
(1) Normal termination (return value 0)
...
...
...
000 (17221.000.000) 03/17 10:26:16 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
DAG Node: B
...
000 (17222.000.000) 03/17 10:26:16 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
DAG Node: C
...
001 (17221.000.000) 03/17 10:26:18 Job executing on host: <188.185.141.99:9618?addrs=188.185.141.99-9618+[--1]-9618&noUDP&sock=27144_6b37_3>
...
005 (17221.000.000) 03/17 10:26:18 Job terminated.
(1) Normal termination (return value 127)
001 (17222.000.000) 03/17 10:26:19 Job executing on host: <188.185.141.99:9618?addrs=188.185.141.99-9618+[--1]-9618&noUDP&sock=27144_6b37_3>
...
005 (17222.000.000) 03/17 10:26:20 Job terminated.
(1) Normal termination (return value 0)
.....
.....
...