Skip to content

Exercise 8d: ABORT-DAG-ON

If a node fails to finish, DAG will continue running until no more jobs can be submitted due to the dependencies.

If the ABORT-DAG-ON NameOfJob ExitValue [RETURN DAG-Return-Value] is defined in the submit file, the DAG is immediately aborted.

In the exercise8b.sub a new command is added:

JOB A A.sub
JOB B B.sub
JOB C C.sub
JOB D D.sub
PARENT A CHILD B C
PARENT C CHILD D
PARENT B CHILD D
ABORT-DAG-ON B "number of exit code" RETURN 1

Execute condor_submit_dag -force exercise8b.dag to submit the DAG.

In this case, when Job B returns a specific exit code and Job C is completed, all the jobs from the queue will be removed, even if they have not dependency on Job B. In this case Job D is not going to be run and the DAG will return with value equal to 1.

The DAG Return value is in the exercise8b.dag.dagman.log.

If the ABORT-DAG-ON is not defined, as can be seen in the following exercise8b.dag.nodes.log file, DADMan will keep executing nodes that they do not have any dependency on the one which failed.

000 (17220.000.000) 03/17 10:21:51 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
    DAG Node: A

005 (17220.000.000) 03/17 10:26:11 Job terminated.
        (1) Normal termination (return value 0)
        ...
        ...
...
000 (17221.000.000) 03/17 10:26:16 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
    DAG Node: B
...
000 (17222.000.000) 03/17 10:26:16 Job submitted from host: <128.142.194.115:9618?addrs=128.142.194.115-9618+[2001-1458-301-e1--100-6d]-9618&noUDP&sock=2796832_b331_37>
    DAG Node: C
...
001 (17221.000.000) 03/17 10:26:18 Job executing on host: <188.185.141.99:9618?addrs=188.185.141.99-9618+[--1]-9618&noUDP&sock=27144_6b37_3>
...
005 (17221.000.000) 03/17 10:26:18 Job terminated.
        (1) Normal termination (return value 127)

001 (17222.000.000) 03/17 10:26:19 Job executing on host: <188.185.141.99:9618?addrs=188.185.141.99-9618+[--1]-9618&noUDP&sock=27144_6b37_3>
...
005 (17222.000.000) 03/17 10:26:20 Job terminated.
        (1) Normal termination (return value 0)
        .....
        .....
...