Job or queue errors are indicated by an uppercase E in the qstat output.
A job enters the error state when the grid engine system tries to run a job but fails for a reason that is specific to the job.
A queue enters the error state when the grid engine system tries to run a job but fails for a reason that is specific to the queue.
The grid engine system offers a set of possibilities for users and administrators to gather diagnosis information in case of job execution errors. Both the queue and the job error states result from a failed job execution. Therefore the diagnosis possibilities are applicable to both types of error states.
User abort mail. If jobs are submitted with the qsub -m a command, abort mail is sent to the address specified with the -M user[@host] option. The abort mail contains diagnosis information about job errors. Abort mail is the recommended source of information for users.
qacct accounting. If no abort mail is available, the user can run the qacct -j command. This command gets information about the job error from the grid engine system's job accounting function.
Administrator abort mail. An administrator can order administrator mails about job execution problems by specifying an appropriate email address. See under administrator_mail on the sge_conf(5) man page. Administrator mail contains more detailed diagnosis information than user abort mail. Administrator mail is the recommended method in case of frequent job execution errors.
Messages files. If no administrator mail is available, you should investigate the qmaster messages file first. You can find entries that are related to a certain job by searching for the appropriate job ID. In the default installation, the qmaster messages file is sge-root/cell/spool/qmaster/messages.
You can sometimes find additional information in the messages of the execd daemon from which the job was started. Use qacct -j job-id to discover the host from which the job was started, and search in sge-root/cell/spool/host/messages for the job ID.