This section covers common issues and recommendations related to the usage of AFS in HTCondor.
Best practices for data access from batch machines
The storage team provides a KB with a set to best practices for data access from batch machines: KB0003076.
Full AFS folder
A common issue that users can face while launching their jobs is filling up the AFS folder where either logs, output or error files are stored.
This can happen even if there is enough quota in the filesystem. As explained in KB0000042, AFS directories have a maximun number of entries between 16000 and 25000.
When this limit is reached, writing operations to the filesystem will fail and this can have an impact in HTCondor. If this situation occurs when a user is submitting thousands of jobs, HTCondor will try to initialise the job environment and will slow down its operation due to the filesystem errors. This has a very negative impact on the scheduler, potentially affecting other users running jobs on the same machine.
How to solve it
This is an issue that cannot be mitigated on the service side at the moment, requiring the user intervention.
In case you have spotted this issue, or you have received an automated notification from the Batch Service, you can apply one of these proposed solutions:
Remove the jobs. This is the recommended option as in this state HTCondor won't be able to run them properly. For that, you can do
If the scheduler cannot handle your request, the quickest way is to move the affected folder and re-create it. For example, if the folder in question is
mv /afs/cern.ch/r/random/htcondor/job /afs/cern.ch/r/random/htcondor/job_full mkdir /afs/cern.ch/r/random/htcondor/job
Please note that, depending how the jobs were submitted, this will break their execution. If files required for the job execution were located in the same directory, they should be copied as well.
Recommendations to prevent it
There are some recommendations to prevent this issue from happening in HTCondor. These are the main ones:
- Don't use the same folder for
errorin your submit file.
- If using macros to expand the
logfile, use only
$(ClusterId).$(ProcdId). This doesn't apply to
ProcIdis needed to avoid overwriting files.
- Script the submission to create new folders per submission.
These points are particularly useful if you plan to run thousands of jobs or running the same set of jobs frequently.