How to benchmark jobs

Why would I need to benchmark my job?

Not all batch worker machines run at the same speed. There are some variations in performance, even if this isn't particularly significant. It may therefore be useful to run a job on one of the machines at the lower end in terms of performance. When you submit a job, we ask you to indicate the length of the job, either via +JobFlavour or +MaxRuntime. When the time expires the job is removed, so understanding when that is likely to be is important.

Another important issue to note is that when jobs "fail" either by abnormal exit, or by their time expiring, the files that they may have created in the sandbox are thrown away. This behaviour can be controlled, in order to benchmark or debug jobs.

Warning

It is worth though bearing in mind that CPU performance is just one aspect that may affect the running time of your jobs. Other factors may be much more important, such as data transfer. If your job is doing a lot of i/o, it's very possible that any differences in CPU speed are moot compared to any factors affecting the speed of getting or storing data.

How to submit a job to run as a benchmark job

The following can be added to the submit file:

+BenchmarkJob = True

When submitting the job, you can see the effect it has on the job requirements:

$ condor_q  2414427.0 -af MinDecile -af:r VanillaRequirements
7.84375 ( ( TARGET.HEPSPEC / TARGET.TotalCpus ) <= MY.MinDecile ) && PreBenchmarkRequirements

This is targeting machines which have a HEPSPEC / per core value less than or equal to 7.84375, which represents the lowest 10% of machines. This value is calculated automatically on the condor schedds.

How to preserve ouput data of failed / terminated jobs

As mentioned earlier, jobs that exit abnormally - including by running out of time - have their output sandbox thrown away. This can make it difficult to debug the cause of the problem. There are a couple of ways to work around this.

Firstly, as long as jobs are being submitted using the shared filesystem - which is the default case unless you use the -spool option to condor_submit, you can add the following attributes to the submit file:

WHEN_TO_TRANSFER_OUTPUT = ON_EXIT_OR_EVICT
+SpoolOnEvict = False

This should mean that you get stdout/stderr, and anything else you specified in transfer_output_files back.

The second option is to stream output, so you would add either or both of the following:

stream_output = True
stream_error = True

This should see these files updated more or less as the job writes them. It should be noted though, that this does have an impact on the schedd and the filesystem, so it should not be done systematically for large numbers of jobs.

Last update: December 12, 2019