Migration from LSF to HTCondor: practical
Design of your framework / job-layouts
In general, since the shared filesystems between lxplus and HTCondor are the same as LSF, you don't need to change your job frameworks or the way you layout the job input data on the underlying filesystem. You should be able to craft a Condor submit file that does the same job as your LSF submits.
Since Condor offers a more flexible way to submit and control multiple jobs that belong to the same set (e.g. to glob over multiple input files), you may subsequently wish to take advantage of those features to streamline your job submissions - but there is no requirement to do so in order to migrate over.
Create a submit file
Unlike LSF, HTCondor jobs cannot be submitted directly on the command line. Each job (or set of jobs) requires a
submit file. If you already use LSF submit files, the concept is the same.
The Quickstart guide describe the basic format. A skeleton submit-file is copied here:
executable = hello_world.sh arguments = $(ClusterID) $(ProcId) output = output/hello.$(ClusterId).$(ProcId).out error = error/hello.$(ClusterId).$(ProcId).err log = log/hello.$(ClusterId).log queue
There is one directive (or ClassAd) per line. The ClassAds described below (if needed) should just be added on their own line before the
Chose a time-limit
This part replaces the
bsub -q queuenamespecification.
A described, the time-limit on HTCondor is real-time - this part effectively replaces the "queue" specification for LSF. Check the real time ("Run time") as returned in the email from one of your LSF jobs:
- Choose the next largest jobflavour (as described in the Submit section) and add to your submit file:
Setting the job flavour in the submit file is achieved like this:
```Ini +JobFlavour = "longlunch"
The possible options are: ```Ini espresso = 20 minutes microcentury = 1 hour longlunch = 2 hours workday = 8 hours tomorrow = 1 day testmatch = 3 days nextweek = 1 week
- Instead, set the real run-time manually. The advantage of specifying a (smaller) run time more precisely are that your job is more likely to fit in spare/opportunistic/draining resources, so will have a chance of running sooner.
+MaxRuntime = NNNs
We will soon provide a proper benchmarking setup, that allows you to benchmark your job runtimes more precisely.
Choose the job limits (#cores, etc)
The part replaces the the
bsub -noption (for #cores) and
-Moption for memory limit.
As per LSF, the number of cores defaults to 1 with 2 GB of memory. You may choose a different multiple by specifying the number of cores, as described in the Submit section). To respect the agreed WLCG ratios (which determine what we purchase for lxbatch), you can't choose the number of cores and memory independently.
In the "vanilla" LSF universe (normal jobs), all cores are allocated on the same node: the
span directive (used to ensure this on LSF) is not needed for HTCondor.
Choose the STDOUT, STDERR and state log locations
This part replaces the STDOUT and STDERR file handling of LSF, and if used, the
The skeleton file above gives the locations of:
outout- standard out of your job script (as specified by
error- standard error of your job script
log- the condor state log, as described in the Submit guide. Note this is purely the Condor state log and not any output from your job.
If the specified paths are relative, they are relative to the filesystem directory from which the job is submitted.
Main difference from LSF:
- It was very rare for people to specify these files explicitly with LSF. For HTCondor, they must be specified.
- The default auto-created directory of style
LSFJOB_789482027/is not done. If a simple filename is specified, the logs will be written to the submission directory rather than one level below.
- If a relative directory path (or a full file path) is specified, those directories must exist at job submit time (or the submit will fail).
For now, if you have other
bsub parameters you need to migrate to a condir submit file, please contact the Batch Service team to discuss those. We can document popular ones as they come along.
The general procedure is:
- Look at the
bsublines in your code, and replace that with code that emits the equivalent submit file (
SUB-FILE), using the sections above as a guide. Then replace the whole
Multiple job submits
The previous submit examples show a single
queue command which submits a single Condor "cluster" [of jobs] containing just one job - this is equivalent to a single
bsub command. See [LSF migration: concepts] for the terminology differences ("cluster" vs. "job").
Condor also supports very easily multiple job submits from a single submit file, paramterised by the
$(ProcId) as described in the Submit guide. This is the equivalent of LSF array jobs.