You can use the information on the Queue Alerts page to troubleshoot any queue problems. You access this page from the Alerts table on the Overview page. Queue alerts are generated when the Queue Resource Limit parameters defined using the queue_conf command are exceeded.
The three types of queue alerts are:
Warnings – When resource limits are exceeded, a warning can be generated before a queue is disabled.
Errors – Errors are generated when a queue makes an invalid request.
Disabled – After receiving a set number of warnings, queues are aborted after the notification time defined in the queue configuration parameter notify has passed.
The Queue states are:
a (alarm) – At least one of the load thresholds defined in the load_thresholds list of the queue configuration is currently exceeded. This state prevents N1GE from scheduling further jobs to that queue. For more information, see the queue_conf) man page.
A (Alarm) – At least one of the suspend thresholds of the queue is currently exceeded. This state causes jobs running in that queue to be successively suspended until no threshold is violated. For more information, see the queue_conf man page.
c (configuration ambiguous) – The queue instance configuration specified using sge_conf is ambiguous. The state resolves when the configuration becomes unambiguous again. This state prevents you from scheduling further jobs to that queue instance. You can find detailed reasons why a queue instance entered this state in the sge_qmaster messages file. You can also see the reasons using the qstat command with -explain. For queue instances in this state, the cluster queue's default settings are used for the ambiguous attribute.
C (Calendar suspended) – The queue has been disabled or suspended automatically using the N1GE calendar facility. See the calendar_conf man page for more information.
d (disabled) – This setting is assigned to queues and released using the qmod command. Suspending a queue will suspend all jobs executing in that queue.
D (Disabled) – The queue has been disabled or suspended automatically using the N1GE calendar facility. See the calendar_conf man page for more information.
E (Error) – This setting appears when the N1GE daemon (sge_execd) on that host was unable to locate the sge_shepherd executable on that host in order to start a job. Check that daemon's error log for information how to resolve the problem. Enable the queue afterwards using the qmod command with the -c option.
o (orphaned) – The current cluster queue's configuration and host group configuration no longer needs this queue instance. The queue instance is kept because unfinished jobs are still associated with it. The orphaned state prevents you from scheduling further jobs to that queue instance. It disappears from qstat output when these jobs finish. To help resolve an orphaned queue instance associated with a job, use the qdel command. You can revive an orphaned queue instance by changing the cluster queue configuration so that the configuration covers that queue instance.
s (suspended) – Assigned to queues and released using the qmod command. Suspending a queue suspends all jobs executing in that queue.
S (Subordinate) – The queue has been suspend due to subordination to another queue. See queue_conf for details. When suspending a queue, regardless of the cause, all jobs executing in that queue are suspended too.
u (unknown) – The corresponding sge_execd(8) cannot be contacted.