Skip to content

Exercise 1a: Job Submission

The aim of this exercise is to submit a simple job. To achieve this it is important to understand the submit description file as it is responsible for describing the requirements and characteristics of the job. It is here that certain aspects of job submission can be controlled such as placing restrictions on machine characteristics or the number of the times that an executable should be run.

The basic commands in a simple submit file are:

  • executable: The fully qualified name of the executable to be run.
  • arguments: Any arguments that are to be passed to the executable.
  • output: Where the STDOUT of the executable is written. This can be a relative or absolute path. HTCondor will not create the directory and hence an error will occur if the specified directory does not exist.
  • error: Where the STDERR of the executable is written. The same rules apply as for output.
  • log: Where HTCondor writes logging information regarding the job lifecycle (not the job itself). It shows the submission times, execution machine and times, and on termination will shows some statistics.
  • queue: This command submits the job.

The commands and the attributes are case insensitive, hence OUTPUT and output or Queue and queue are equivalent. Comments can be added to the submit file with #. Interpolated values can be used in the submit file. Two useful values are ClusterId and ProcId. ClusterId is unique to each submission. ProcId is incremented by one for each instance of the executable in that submission. When submitting a single job, the value of ProcId is 0.

The following is a submit description file for a simple job. This job executes the welcome.sh script. It will add only one job to the queue.

The script welcome.sh contains a simple command:

#!/bin/bash

echo "welcome to HTCondor tutorial"

Then exercise01.sub submit description file contains the following:

    executable              = welcome.sh
    arguments               = $(ClusterId)$(ProcId)
    output                  = output/welcome.$(ClusterId).$(ProcId).out
    error                   = error/welcome.$(ClusterId).$(ProcId).err
    log                     = log/welcome.$(ClusterId).log
    queue

Submitting The Job

On submission machine create the welcome.sh script, the exercise01.sub submssion file and then run the following command:

condor_submit exercise01.sub

This should produce the following output which shows that the job has been submitted with ClusterId 2464.

Submitting job(s).
1 job(s) submitted to cluster 2464.

Monitoring the job

The command condor_q can be used to see the current status of the jobs in the queue:

    -- Schedd: bigbird04.cern.ch : <128.142.194.115:9618?... @ 12/07/16 15:10:09
    OWNER    BATCH_NAME         SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
    fprotops CMD: welcome.sh  12/6  15:08      _      _      1      1 2464.0

    1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

   ```

The condor_q command provides information regarding the current state of the jobs, the name of the schedd, the name of the owner, etc.

The progress of a job can be followed by executing:
```Ini
watch condor_q
 ```
The -nobatch option can be used to show the status of each individual job rather the cluster summary.
  ```Ini
 condor_q -nobatch
  ```

```Ini
  -- Schedd: bigbird04.cern.ch : <128.142.194.115:9618?... @ 03/28/17 17:13:42
 ID       OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
21847.0   fprotops        3/28 17:13   0+00:00:00 I  0    0.0 welcome.sh

More information regarding the lifecycle of the job can be found in the log file. It contains information about the submitted jobs in chronological order such as which machine executed the job, if the job terminated correctly, if the job aborted, etc.

       001 (2465.000.000) 12/07 15:18:17 Job executing on host: <188.185.177.87:9618?addrs=188.185.177.87-9618+[--1]-9618&noUDP&sock=2869_c9b5_3>
       ...
       006 (2465.000.000) 12/07 15:18:18 Image size of job updated: 1
       0  -  MemoryUsage of job (MB)
       0  -  ResidentSetSize of job (KB)
       ...
       005 (2465.000.000) 12/07 15:18:18 Job terminated.
         (1) Normal termination (return value 0)
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
       28  -  Run Bytes Sent By Job
       47  -  Run Bytes Received By Job
       28  -  Total Bytes Sent By Job
       47  -  Total Bytes Received By Job
       Partitionable Resources :    Usage  Request Allocated
          Cpus                 :                 1         1
          Disk (KB)            :       15        1    501507
          Memory (MB)          :        0     2000      2000


The command condor_wait displays the progress of the job by watching in the submission's log file.

condor_wait <path_to_log_file> <JobId>