This page covers the common job failures that may be seen and ways to troubleshoot and correct these failures. This page will cover both general troubleshooting techniques as well as give common errors and ways to fix them.
General troubleshooting techniques
The condor_q command has several tools that can be used to diagnose why jobs are not running or are in the held state. The first is the -analyze and -better-analyze options. These option allows users to get information on which available job slots a given job matches with. If the job had errors during execution, these options will give more detailed messages about errors that occurred.
This command allows the user to ssh to the compute node that is running a specified job_id. Once the command is run, the user will be in the job's working directory and can examine the job's environment and run commands. The _condor_stdout and _condor_stderr files will have the current job's stdout and stderr outputs. Note, this command requires that the site running the job allow users to ssh to the job. Most sites allow this but some sites do not.
Jobs not matching
If submitted jobs remain in the idle state and don't start, then there is usually an issue with the job requirements that prevent the job from being matched with an available resource. Users can troubleshoot this by running
condor_q -better-analyze jobid and then examining the output. E.g.
The output clearly indicates that job did not match any resources. Additionally, by looking through the conditions listed, it becomes apparent that the job requires version Scientific Linux 10 (target.OpSysMajorVer == 10) which won't be matched by available resources. Looking at the submit file for the job shows the following requirement:
This can be corrected in two different ways. The entire job cluster can be removed using
condor_rm 371156, followed by editing the submission file and then resubmitting. Alternatively, condor_qedit can be used to change requirements for submitted jobs:
Job output missing
If a job's submit file uses the transfer_output_files setting in the submit file to indicate that HTCondor should transfer files back after the job completes, HTCondor will put the job in the held state if the output is missing. If condor_q -analyze is run on the job, this is indicated in the error message:
The important parts are the message indicating that HTCondor couldn't transfer the job back (
SHADOW failed to receive file(s)) and the part just before this that gives the name of the file or directory that HTCondor couldn't find. This is failure is probably due to your application encountering an error while executing and then exiting before writing it's output files. If you think that the error is a transient one and won't reoccur, you can run
condor_release job_id to requeue the job. Alternatively, you can use
condor_ssh_to_job job_id to examine the job environment and investigate further.