Chapter 8 Troubleshooting N1 Grid Engine

This chapter tells you how to use the various alerts and the N1 Grid Engine daemon logs to troubleshoot a grid.

Using N1 Grid Engine Daemon Logs

You use the N1 Grid Engine Daemon Logs page to see a historical view of all the messages logged by the various N1 Grid Engine daemons. To see the log file for a particular host, click its host name. To see the log files for the system hosting the queue, click on a name in the QMASTER column.

Figure 8–1 Daemon Logs List Page

This page shows you the list of available daemon logs.

The log file for a particular host contains fields for a Flag, a Time Stamp, and a Message. The flag tells you what kind of message was logged. Flags exist for the following message types:

N (notice) – for informational purposes
I (info) – for informational purposes
W (warning)
E (error) – An error condition has been detected
C (critical) – Which can lead to a program abort

Use the loglevel parameter in the cluster configuration to specify on a global basis or a local basis what message types you want to log.

Troubleshooting Queues

You can use the information on the Queue Alerts page to troubleshoot any queue problems. You access this page from the Alerts table on the Overview page. Queue alerts are generated when the Queue Resource Limit parameters defined using the queue_conf command are exceeded.

Figure 8–2 Queue Alerts List Page

This page shows you the lists of queue alerts.

The three types of queue alerts are:

Warnings – When resource limits are exceeded, a warning can be generated before a queue is disabled.
Errors – Errors are generated when a queue makes an invalid request.
Disabled – After receiving a set number of warnings, queues are aborted after the notification time defined in the queue configuration parameter notify has passed.

The Queue states are:

a (alarm) – At least one of the load thresholds defined in the load_thresholds list of the queue configuration is currently exceeded. This state prevents N1GE from scheduling further jobs to that queue. For more information, see the queue_conf) man page.
A (Alarm) – At least one of the suspend thresholds of the queue is currently exceeded. This state causes jobs running in that queue to be successively suspended until no threshold is violated. For more information, see the queue_conf man page.
c (configuration ambiguous) – The queue instance configuration specified using sge_conf is ambiguous. The state resolves when the configuration becomes unambiguous again. This state prevents you from scheduling further jobs to that queue instance. You can find detailed reasons why a queue instance entered this state in the sge_qmaster messages file. You can also see the reasons using the qstat command with -explain. For queue instances in this state, the cluster queue's default settings are used for the ambiguous attribute.
C (Calendar suspended) – The queue has been disabled or suspended automatically using the N1GE calendar facility. See the calendar_conf man page for more information.
d (disabled) – This setting is assigned to queues and released using the qmod command. Suspending a queue will suspend all jobs executing in that queue.
D (Disabled) – The queue has been disabled or suspended automatically using the N1GE calendar facility. See the calendar_conf man page for more information.
E (Error) – This setting appears when the N1GE daemon (sge_execd) on that host was unable to locate the sge_shepherd executable on that host in order to start a job. Check that daemon's error log for information how to resolve the problem. Enable the queue afterwards using the qmod command with the -c option.
o (orphaned) – The current cluster queue's configuration and host group configuration no longer needs this queue instance. The queue instance is kept because unfinished jobs are still associated with it. The orphaned state prevents you from scheduling further jobs to that queue instance. It disappears from qstat output when these jobs finish. To help resolve an orphaned queue instance associated with a job, use the qdel command. You can revive an orphaned queue instance by changing the cluster queue configuration so that the configuration covers that queue instance.
s (suspended) – Assigned to queues and released using the qmod command. Suspending a queue suspends all jobs executing in that queue.
S (Subordinate) – The queue has been suspend due to subordination to another queue. See queue_conf for details. When suspending a queue, regardless of the cause, all jobs executing in that queue are suspended too.
u (unknown) – The corresponding sge_execd(8) cannot be contacted.

Troubleshooting Hosts

You can see potential host problems from the Host Alerts page. This page is available from the Alerts table on the Overview page.

Figure 8–3 Hosts Alerts List Page

This page shows you the list of host alerts.

The following host alert parameters can all be alarmed so that if they pass a specified threshhold, an alert will be generated and appear on the Overview Alerts table.

Load Per CPU – Shows how efficiently the Host's CPU is being used. This parameter can be any positive decimal number but is usually between zero and 2 or 3. Ideally, this number should be close to 1. A smaller number could mean the host is under utilized, and a larger number could mean the host is overutilized. The ideal value depends on the workload that is being run. Only the local administrator can really know the implications of the workload.
Used Mem. – The percentage of total memory currently being used to execute jobs. If the used memory is too close to the total memory, then the host could be in trouble. However, if the workloads are tuned to fit in the server, then it could be perfectly fine that the used memory is just under the total memory. In fact, this is tunable. You can set the value at which the difference between these two parameters triggers an alarm. So, in one case, a difference of less than 100 MB triggers a warning, while in another case it could be at 25 MB.
Total Mem. – The total amount of memory on this host.
Swap Used – The amount of free swap space left on this host measured in MBs. In a well-architected grid, the free swap space should never drop very far below its initial value. It is possible that temporary drops in this value can be tolerated depending on how the grid is architected. If this value goes close to zero, then the host is in danger of failing completely.
Date/Time – The timestamp for when the alert was generated.

Troubleshooting Jobs

You can view potential job problems from the Job Alerts page. This page is available from the Alerts table on the Overview page. The Pending Time and Deadline job alert parameters can be alarmed so that if the values pass a specified threshold, an alert will be generated and appear on the Overview Alerts table.

Figure 8–4 Job Alerts List Page

This page shows you the list of job alerts.

The Job Alerts page shows the following information:

Job ID – The unique identifier for the job. Clicking on the Job ID brings you to the Job Details page.
Task – The currently executing task. Some jobs consist of a single task (in which case, the task ID is always 1.) However, parallel jobs and array jobs each consist of more than one task. The tasks are usually numbered in ascending order starting with 1. Depending upon how the job was submitted, sometimes the numbers might skip as in 1,3,5. On running jobs, each task runs distinctly and so has its own configuration information, environment, and trace. For details about the task, click the task number to display the Task Details page.
Job Name – The name assigned to the job.
Pending time – How long the job has been waiting to be assigned to a queue.
Deadline – The time specified by which a job must start or generate an alarm.

See the qstat man page for more information about alarms and thresh holds.

Previous: Chapter 7 Working With N1 Grid Engine Hosts