The grid engine system offers several reporting methods to help you diagnose problems. The following sections outline their uses.
Sometimes a pending job is obviously capable of being run, but the job does not get dispatched. To diagnose the reason, the grid engine system offers a pair of utilities and options, qstat -j job-id and qalter-w v job-id.
qstat -j job-id
When enabled, qstat -j job-id provides a list of reasons why a certain job was not dispatched in the last scheduling run. This monitoring can be enabled or disabled. You might want to disable monitoring because it can cause undesired communication overhead between the sge_schedd daemon and sge_qmaster. See schedd_job_info in the sched_conf(5) man page. The following example shows output for a job with the ID 242059:
% qstat -j 242059 scheduling info: queue "fangorn.q" dropped because it is temporarily not available queue "lolek.q" dropped because it is temporarily not available queue "balrog.q" dropped because it is temporarily not available queue "saruman.q" dropped because it is full cannot run in queue "bilbur.q" because it is not contained in its hard queuelist (-q) cannot run in queue "dwain.q" because it is not contained in its hard queue list (-q) has no permission for host "ori" |
This information is generated directly by the sge_schedd daemon. The generating of this information takes the current usage of the cluster into account. Sometimes this information does not provide what you are looking for. For example, if all queue slots are already occupied by jobs of other users, no detailed message is generated for the job you are interested in.
qalter -w v job-id
This command lists the reasons why a job is not dispatchable in principle. For this purpose, a dry scheduling run is performed. All consumable resources, as well as all slots, are considered to be fully available for this job. Similarly, all load values are ignored because these values vary.
Job or queue errors are indicated by an uppercase E in the qstat output.
A job enters the error state when the grid engine system tries to run a job but fails for a reason that is specific to the job.
A queue enters the error state when the grid engine system tries to run a job but fails for a reason that is specific to the queue.
The grid engine system offers a set of possibilities for users and administrators to gather diagnosis information in case of job execution errors. Both the queue and the job error states result from a failed job execution. Therefore the diagnosis possibilities are applicable to both types of error states.
User abort mail. If jobs are submitted with the qsub -m a command, abort mail is sent to the address specified with the -M user[@host] option. The abort mail contains diagnosis information about job errors. Abort mail is the recommended source of information for users.
qacct accounting. If no abort mail is available, the user can run the qacct -j command. This command gets information about the job error from the grid engine system's job accounting function.
Administrator abort mail. An administrator can order administrator mails about job execution problems by specifying an appropriate email address. See under administrator_mail on the sge_conf(5) man page. Administrator mail contains more detailed diagnosis information than user abort mail. Administrator mail is the recommended method in case of frequent job execution errors.
Messages files. If no administrator mail is available, you should investigate the qmaster messages file first. You can find entries that are related to a certain job by searching for the appropriate job ID. In the default installation, the sge_qmaster messages file is sge-root/cell/spool/qmaster/messages.
You can sometimes find additional information in the messages of the sge_execd daemon from which the job was started. Use qacct -j job-id to discover the host from which the job was started, and search in sge-root/cell/spool/host/messages for the job ID.