Best practices and common pitfalls
Foreword
Below is a list of common errors that should in general be avoided in order to protect our jobs and the Batch infrastructure (ie. impact on other users).
In most cases, the indicated patterns are not really harmful when used with a few jobs, but quickly become problematic when many jobs are submitted or e.g. large amounts of data are read/produced.
Test jobs thoroughly before submitting many
HTCondor makes it easy to submit lots of jobs, but this is a big responsibility. If thousands of broken jobs are sent and e.g. killed by too long runtime, it is a waste of resources and of your time. Sometimes defective jobs may also cause problems on the infrastructure, affecting other users.
Be sure to test a single job first before submitting many. In particular, make sure that your code completes without failures in a test HTCondor job, that this job does not need more memory than requested (see resources and limits), does not take longer than requested (see job flavours, and has correct data input and output settings, i.e. it does not try to read from non-existing files or write to directories where you don't have permission to do it (see also the section on data flow.
Do not use getenv=True
Either the submit file Environment directive or the main executable script (by exporting variables or sourcing a
setup file) can be used to set the necessary environment for our job. While HTCondor also supports the getenv=True
directive to replicate in the execution node the whole environment of the submit machine, this often leads to problems:
-
The characteristics/conditions of the execute machine may be different than in the submit host, and may also vary with time. Thus, replicating the environment may not lead us to a working environment.
-
Even if things work,
getenv=Trueoften results in very large job descriptions, which causes problems to the schedds when many jobs are submitted.
Please, only use getenv=True if really needed to debug a single job test (notice that you may also use tools like
condor_ssh_to_job for interactive debugging). Once the problem is understood, or otherwise when submitting regularly,
please do not use it.
Be careful with input/output data
If you are submitting many jobs or handling large amounts of data, you may face the following problems:
-
AFS is often a point of failure: user/project quotas get filled, the limit on files per directory is reached, too many jobs accessing a volume degrade performance dramatically.
-
Accessing
/eos/...paths directly makes use of fuse mounted filesystems, which are fragile. -
If you use spool submission with large executables or input data files, you may end up filling the disk of the schedd machine, rendering it inoperative.
The following are recommended practices when a significant number of jobs are submitted:
-
Make sure
transfer_output_filesonly includes what you absolutely require (or nothing at all!), otherwise HTCondor will transfer back all the new files generated by your job, even if they are useless temporary files. -
If possible, avoid AFS by reading/writing to EOS by one of the following options: using EosSubmit schedds (preferred), using xrootd transfer URLs, or transferring files directly from your executable.
-
If interacting with EOS from your job, make sure you're using transfer commands like
xrdcp root://..., not directly reading/writing from an/eos/...path.
More information on all these concepts in the data flow section