Batch Service Concepts
The CERN Batch Service is a fairly standard High Throughput Computing (HTC) Batch System with a fair-sharing mechanism.
Its purpose is to allow users to queue up jobs in the system, and maximise the utilisation of the batch farm (currently around 100k cores) while respecting the agreed-upon fair-share policies set by CERN and experiment management.
Users submit a job to the batch system. A job is a discrete unit of work, comprising:
- an executable and optionally some arguments
- some requirements (e.g. memory, operating system, etc)
The job is accepted by the batch system and put in a queue. A fair-share system, based on shares agreed by CERN management and current utilisation, determines whose job goes next. The larger your agreed share, the more jobs you and your experiment get to run. The more you use the batch service, the more of your share you 'use-up' and the more likely it is that the next job to run will belong to someone else. The aim of the system is to ensure that, on average, the shares set by CERN management between the various experiments are respected.
Once a job is scheduled for execution, the batch system passes its definition to a spare slot on an execution host (or worker node). The job starts and normally continues to run until it is complete, at which point the user is notified (by default) and the job log, standard out and standard error are written.
At any time, a user can query the batch service for the status of his or her jobs, seeing which are completed, which are running and which are still waiting in the queue.