Exercise 8 : DAGMan - Directed Acyclic Graph Manager
HTCondor uses DAGMan in order to manage dependencies between HTCondor jobs. DaGMan submits jobs to HtCondor and is responsible for scheduling and reporting them.
Every DAG has a number of nodes. Each node represents a HTCondor cluster and has dependencies with other nodes in the same DAG (as Parents-Children).
- Create a *.dag file.
DAGMan uses an input file in order to submit jobs to HTCondor. This file contains the dependencies between nodes and there must be at least one JOB in it.
Each node in the DAG needs its own submit file.
JOB A filename01.sub JOB B filename02.sub .... JOB N filenameN.sub PARENT A CHILD B PARENT B CHILD N-1, N
Each HTCondor cluster is a node within the graph. The dependencies of the nodes, define the submission order. To be more specific, the fact that Job B is child of Job A means that A will be submitted and run first and when completed, Job B will be submitted.
If there are more than one child, then HTCondor jobs (children) will be submitted concurrently.
- To submit the DAG the following command is used:
condor_submit_dag <DAG filename> Submits a Scheduler Universe Job
-force: Can be used in order to overwrite the existed following DAGMan files (output files).
condor_submit_dag -force <DAG filename>
Other useful arguments for condor_submit
_-maxjobs_ : Specifies the upper limit of the nodes that can be submitted consequently by condor_dagman. _-maxidle_ : Specifies the upper limit of the processes that can be idle. DAGMan will not submit nodes until the number of the idle processes goes below this value.
By executing condor_q -dag -nobatch, it can be seen:
-- Schedd: bigbird04.cern.ch : <220.127.116.11:9618?... @ 03/23/17 14:36:55 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 20047.0 fprotops 3/23 14:35 0+00:01:31 R 0 0.3 condor_dagman -p 0 -f -l . -Lockfile testCase01.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag testCase01.dag -Suppress 20048.0 fprotops 3/23 14:35 0+00:00:00 I 0 0.0 array1.1.sh 200480
DAG job (20047.0) will run until all the nodes exit.
The following files are generated:
- *.dag.condor.sub file
The *.dag.condor.sub file is the submit file for DAGMan and it will be submitted by HTCondor as a simple job.
This submit file will create the following output files: .dag.lib.err_, _.dag.lib.out, *.dag.dagman.log.
- *.dag.dagman.log file
The *.dag.dagman.log file displays information about the DAG job's (20047.0) status.
- *.dag.dagman.out file
The *.dag.dagman.out file displays what DAGMan does. It includes information about the nodes' execution order, their successes or failures, etc.
The *.dag.nodes.log file displays the details about the nodes' execution. For example the details of the jobs' submission progress that each node contains, the execution order of the nodes, etc.
If for any reason a node of the DAG fails then a Rescue file is created in the same directory in which the DAG file has invoked and contains all the necessary information. The Rescue file will be created with the current state of DAG.
The DAGMan will retry to submit the Rescue DAG file which will restore the previous status of the DAG.