Exercise 1a: Job Submission
The aim of this exercise is to submit a simple job. To achieve this it is important to understand the submit description file as it is responsible for describing the requirements and characteristics of the job. It is here that certain aspects of job submission can be controlled such as placing restrictions on machine characteristics or the number of the times that an executable should be run.
The basic commands in a simple submit file are:
- executable: The fully qualified name of the executable to be run.
- arguments: Any arguments that are to be passed to the executable.
- output: Where the STDOUT of the executable is written. This can be a relative or absolute path. HTCondor will not create the directory and hence an error will occur if the specified directory does not exist.
- error: Where the STDERR of the executable is written. The same rules apply as for output.
- log: Where HTCondor writes logging information regarding the job lifecycle (not the job itself). It shows the submission times, execution machine and times, and on termination will shows some statistics.
- queue: This command submits the job.
The commands and the attributes are case insensitive, hence OUTPUT and output or Queue and queue are equivalent. Comments can be added to the submit file with #. Interpolated values can be used in the submit file. Two useful values are ClusterId and ProcId. ClusterId is unique to each submission. ProcId is incremented by one for each instance of the executable in that submission. When submitting a single job, the value of ProcId is 0.
The following is a submit description file for a simple job. This job executes the welcome.sh script. It will add only one job to the queue.
The script welcome.sh contains a simple command:
#!/bin/bash echo "welcome to HTCondor tutorial"
Then exercise01.sub submit description file contains the following:
executable = welcome.sh arguments = $(ClusterId)$(ProcId) output = output/welcome.$(ClusterId).$(ProcId).out error = error/welcome.$(ClusterId).$(ProcId).err log = log/welcome.$(ClusterId).log queue
Submitting The Job
On submission machine create the welcome.sh script, the exercise01.sub submssion file and then run the following command:
This should produce the following output which shows that the job has been submitted with ClusterId 2464.
Submitting job(s). 1 job(s) submitted to cluster 2464.
Monitoring the job
The command condor_q can be used to see the current status of the jobs in the queue:
-- Schedd: bigbird04.cern.ch : <22.214.171.124:9618?... @ 12/07/16 15:10:09 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS fprotops CMD: welcome.sh 12/6 15:08 _ _ 1 1 2464.0 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended ``` The condor_q command provides information regarding the current state of the jobs, the name of the schedd, the name of the owner, etc. The progress of a job can be followed by executing: ```Ini watch condor_q ``` The -nobatch option can be used to show the status of each individual job rather the cluster summary. ```Ini condor_q -nobatch ``` ```Ini -- Schedd: bigbird04.cern.ch : <126.96.36.199:9618?... @ 03/28/17 17:13:42 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 21847.0 fprotops 3/28 17:13 0+00:00:00 I 0 0.0 welcome.sh
More information regarding the lifecycle of the job can be found in the log file. It contains information about the submitted jobs in chronological order such as which machine executed the job, if the job terminated correctly, if the job aborted, etc.
001 (2465.000.000) 12/07 15:18:17 Job executing on host: <188.8.131.52:9618?addrs=184.108.40.206-9618+[--1]-9618&noUDP&sock=2869_c9b5_3> ... 006 (2465.000.000) 12/07 15:18:18 Image size of job updated: 1 0 - MemoryUsage of job (MB) 0 - ResidentSetSize of job (KB) ... 005 (2465.000.000) 12/07 15:18:18 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 28 - Run Bytes Sent By Job 47 - Run Bytes Received By Job 28 - Total Bytes Sent By Job 47 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 15 1 501507 Memory (MB) : 0 2000 2000
The command condor_wait displays the progress of the job by watching in the submission's log file.
condor_wait <path_to_log_file> <JobId>