Common problems

This page aims to provide simple instructions for the most common problems that might occur during job submission and execution.

Wrong shebang

Providing a valid shebang is mandatory when submiting jobs whose executable is a script.

If your script doesn't contain a valid shebang you can experience the following symptoms:

Your job finishes succesfully in a few seconds but there is no output or logs.
Your job is on hold mentioning that the expected output files were not found.

Valid shebangs are: #!/bin/bash, #!/bin/env python, #!/bin/sh, etc.

Invalid shebangs: #! /bin/bash (extra espace), #!/bin/typo (wrong path).

It's important to mention that the shebang must be the first line of your script.

A way to validate part of the most common issues is to run the file command and confirm that the file format contains the word executable:

lxplus $ cat valid.sh
#!/bin/bash
echo "Hello"
lxplus $ file valid.sh
valid.sh: Bourne-Again shell script, ASCII text executable

lxplus $ cat invalid.sh

#!/bin/bash
echo "Hello"
lxplus $ file invalid.sh
invalid.sh: ASCII text

Incompatible end-of-line encoding

When submitting jobs whose executables are scripts special attention must be paid to the enconding of the file. Providing scripts with CRLF line terminators won't produce the expected behaviour, and even if the job will succesfully complete, the job won't run.

The symptoms of this problem are:

Your job finishes succesfully in a few seconds but there is no output or logs.
Your job is on hold mentioning that the expected output files were not found.

To validate your script, the file command can be executed to confirm that the file format does not contain the word CRLF line terminators.

[lxplus]$ file invalid.sh
invalid.sh: Bourne-Again shell script, ASCII text executable, with CRLF line terminators

[lxplus]$ file valid.sh
script.sh: Bourne-Again shell script, ASCII text executable

The solution is to use the tool dos2unix to sanitize the file:

[lxplus]$ file invalid.sh
invalid.sh: Bourne-Again shell script, ASCII text executable, with CRLF line terminators

[lxplus]$ dos2unix invalid.sh
dos2unix: converting file invalid.sh to Unix format ...

[lxplus]$ file invalid.sh
invalid.sh: Bourne-Again shell script, ASCII text executable

Can't find address of local schedd

Sometimes you might encounter the following issue when submiting a new job or querying the status of your current jobs (condor_q):

Error: Can't find address of local schedd

Extra Info: You probably saw this error because the condor_schedd is not
running on the machine you are trying to query. If the condor_schedd is not
running, the Condor system will not be able to find an address and port to
connect to and satisfy this request. Please make sure the Condor daemons are
running and try again.

Extra Info: If the condor_schedd is running on the machine you are trying to
query and you still see the error, the most likely cause is that you have
setup a personal Condor, you have not defined SCHEDD_NAME in your
condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE
setting. You must define either or both of those settings in your config
file, or you must use the -name option to condor_q. Please see the Condor
manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.

This usually means that your current kerberos credentials have expired and you need new kerberos tickets on the submit machine. You can try some of the following commands to confirm the problem:

Normal output:

$ condor_status -schedd
Name               Machine            RunningJobs   IdleJobs   HeldJobs

bigbird01.cern.ch  bigbird01.cern.ch          960        371        785
bigbird02.cern.ch  bigbird02.cern.ch         2916         26       2501
bigbird03.cern.ch  bigbird03.cern.ch          434         45         99
bigbird04.cern.ch  bigbird04.cern.ch         4507        275         10
...

Problematic output:

$ condor_status -schedd
Error: communication error
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using KERBEROS

Normal output:

$ klist
Ticket cache: FILE:/tmp/krb5cc_...
Default principal: <user@CERN.CH

Valid starting       Expires              Service principal
12/19/2017 11:04:55  12/20/2017 11:19:14  krbtgt/CERN.CH@CERN.CH
    renew until 12/24/2017 10:19:14
12/19/2017 11:04:56  12/20/2017 11:19:1

Problematic output:

$ klist
klist: No credentials cache found (filename: /tmp/krb5cc_...)

In any case this issue should be resolved if you run the kinit command. If on the other hand you do have valid kerberos tickets but the problem persists, please let us know by opening a CERN Service Now ticket.

Busy Schedd hosts

In certain occasions users might encounter an error like the following:

-- Failed to fetch ads from: <... : bigbirdxx.cern.ch
SECMAN:2007:Failed to end classad message.

This is usually due to heavy load, either on a specific schedd host or the central manager. In exceptional cases this might be caused by a centralized outage causing delays to the system. If you encounter this error please inform us by opening a CERN Service Now ticket.

If the cause of the problem is a busy schedd host, you may wish to move yourself to a different schedd. To see how to do this, please refer to the documentation on myschedd.

next_job_start_delay is not allowed

If you're submitting jobs containing the HTCondor submission option next_job_start_delay or adding the job attribute NextJobStartDelay the scheduler will reject them with an error like this:

ERROR: Failed to commit job submission into the queue.
ERROR: Setting next_job_start_delay or +NextJobStartDelay in submission is not allowed

This option can have a serious impact on the scheduler and other people's jobs, so its usage is not allowed. The purpose of this flag is to protect the scheduler and now this throttling is configured by the administrators.

If you're unsure about whether you need this option or not, you can contact our support line.

In case you were using this option with the purpose of controlling the number of jobs running simultaneously, we recommend using max_materialize in your submit file:

Example:

executable = test.sh

max_materialize = 10

queue 1000

The previous example will send 1000 jobs to the scheduler and HTCondor will only queue 10 simultaneously. This means that only up to 10 jobs will run at the same time. As soon as jobs are completed, HTCondor automatically adds new ones to the queue, always keeping the max_materialize limit.

Last update: December 20, 2022