AFS

This section covers common issues and recommendations related to the usage of AFS in HTCondor.

Best practices for data access from batch machines

The storage team provides a KB with a set to best practices for data access from batch machines: KB0003076.

Full AFS folder

A common issue that users can face while launching their jobs is filling up the AFS folder where either logs, output or error files are stored.

This can happen even if there is enough quota in the filesystem. As explained in KB0000042, AFS directories have a maximun number of entries between 16000 and 25000.

When this limit is reached, writing operations to the filesystem will fail and this can have an impact in HTCondor. If this situation occurs when a user is submitting thousands of jobs, HTCondor will try to initialise the job environment and will slow down its operation due to the filesystem errors. This has a very negative impact on the scheduler, potentially affecting other users running jobs on the same machine.

How to solve it

This is an issue that cannot be mitigated on the service side at the moment, requiring the user intervention.

In case you have spotted this issue, or you have received an automated notification from the Batch Service, you can apply one of these proposed solutions:

Remove the jobs. This is the recommended option as in this state HTCondor won't be able to run them properly. For that, you can do condor_rm <job_cluster_id>.
If the scheduler cannot handle your request, the quickest way is to move the affected folder and re-create it. For example, if the folder in question is /afs/cern.ch/r/random/htcondor/job:

mv /afs/cern.ch/r/random/htcondor/job /afs/cern.ch/r/random/htcondor/job_full
mkdir /afs/cern.ch/r/random/htcondor/job

Please note that, depending how the jobs were submitted, this will break their execution. If files required for the job execution were located in the same directory, they should be copied as well.

Recommendations to prevent it

There are some recommendations to prevent this issue from happening in HTCondor. These are the main ones:

Don't use the same folder for log, output and error in your submit file.
If using macros to expand the log file, use only $(ClusterId) rather than $(ClusterId).$(ProcdId). This doesn't apply to output or error, where ProcId is needed to avoid overwriting files.
Script the submission to create new folders per submission.

These points are particularly useful if you plan to run thousands of jobs or running the same set of jobs frequently.

Last update: June 25, 2020