Skip to content

Exercise 8 : DAGMan - Directed Acyclic Graph Manager

HTCondor uses DAGMan in order to manage dependencies between HTCondor jobs. DaGMan submits jobs to HtCondor and is responsible for scheduling and reporting them.
Every DAG has a number of nodes. Each node represents a HTCondor cluster and has dependencies with other nodes in the same DAG (as Parents-Children).

Simple DAG

  • Create a *.dag file.

DAGMan uses an input file in order to submit jobs to HTCondor. This file contains the dependencies between nodes and there must be at least one JOB in it.

Each node in the DAG needs its own submit file.

JOB A filename01.sub
JOB B filename02.sub
....
JOB N filenameN.sub
PARENT A CHILD  B
PARENT B CHILD  N-1, N

Each HTCondor cluster is a node within the graph. The dependencies of the nodes, define the submission order. To be more specific, the fact that Job B is child of Job A means that A will be submitted and run first and when completed, Job B will be submitted.

If there are more than one child, then HTCondor jobs (children) will be submitted concurrently.
- To submit the DAG the following command is used:

condor_submit_dag <DAG filename>
Submits a Scheduler Universe Job

-force: Can be used in order to overwrite the existed following DAGMan files (output files).

condor_submit_dag -force <DAG filename>  

Other useful arguments for condor_submit

  _-maxjobs_ : Specifies the upper limit of the nodes that can be submitted consequently by condor_dagman.

  _-maxidle_ : Specifies the upper limit of the processes that can be idle. DAGMan will not submit nodes until the number of the idle processes goes below this value.

By executing condor_q -dag -nobatch, it can be seen:

-- Schedd: bigbird04.cern.ch : <128.142.194.115:9618?... @ 03/23/17 14:36:55
 ID       OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
20047.0   fprotops        3/23 14:35   0+00:01:31 R  0    0.3 condor_dagman -p 0 -f -l . -Lockfile testCase01.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag testCase01.dag -Suppress
20048.0   fprotops        3/23 14:35   0+00:00:00 I  0    0.0 array1.1.sh 200480

DAG job (20047.0) will run until all the nodes exit.

The following files are generated:

  1. *.dag.condor.sub file

The *.dag.condor.sub file is the submit file for DAGMan and it will be submitted by HTCondor as a simple job.

This submit file will create the following output files: .dag.lib.err_, _.dag.lib.out, *.dag.dagman.log.

  1. *.dag.dagman.log file

The *.dag.dagman.log file displays information about the DAG job's (20047.0) status.

  1. *.dag.dagman.out file

The *.dag.dagman.out file displays what DAGMan does. It includes information about the nodes' execution order, their successes or failures, etc.

  1. *.dag.lib.err file

  2. *.dag.lib.out file

  3. *.dag.metrics file

  4. *.dag.nodes.log file

The *.dag.nodes.log file displays the details about the nodes' execution. For example the details of the jobs' submission progress that each node contains, the execution order of the nodes, etc.

If for any reason a node of the DAG fails then a Rescue file is created in the same directory in which the DAG file has invoked and contains all the necessary information. The Rescue file will be created with the current state of DAG.

The DAGMan will retry to submit the Rescue DAG file which will restore the previous status of the DAG.