Sun N1 System Manager 1.3 Grid Engine Provisioning and Monitoring Guide

Chapter 6 Working With N1 Grid Engine Queues

This chapter describes how to access information about a grid's queues. You can see a general picture of the performance health of all the queues and view details about a particular queue.

Monitoring Queues

Queue information is available from the Queue Summary tab. You use this page to see whether a queue is functioning and how efficiently it is performing. From this page you can also view extensive details on any queue.

A queue in the N1GE environment is a means of defining a job's execution environment. This context includes features like:

job runtime limits (memory, stack, and CPU time)
control action methods (how to suspend and resume the job)
virtual job container (Solaris, Linux, or MS–Windows resource pools)

A queue instance is the portion of the queue that exists on a single host.

The information in this tab is presented in a table of queue instances, that is, the portion of the queue that runs on a particular host. Every queue instance that exists in the grid is listed.

Figure 6–1 Queue Summary Page

This page shows you the list of available
queues.

The Queue Summary page show the following information:

Queue – The queue name. To see more detailed information on any queue, click the queue instance name.
Status – Describes whether this queue instance is running, suspended (manually or automatically in the case of an error), or waiting for a required resource to become available or a condition be met. If a queue instance is suspended or waiting, you may want to see more queue details.
Used Slots – The number of total slots this queue instance is consuming
Total Slots – The number of slots defined for this queue instance. Slots are the maximum number of jobs that a queue can run simultaneously.

Note –

You do not prioritize jobs using an N1GE queue. You define priorities using the extended policy system of the Sun N1 Grid Engine software. For information on job priorities, see the sge_priority(5) man page and Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System (www.sun.com/blueprints/1005/819-4325.html).

For information on cluster queues, see the Monitoring and Controlling Queues section in the N1GE 6 User's Guide and the qmon man page. For more information on queue states, see the Queue Alerts page.

Viewing Complete Queue Information

The Queue Details page contains complete information for the queue instance that you selected on the Queue Summary page.

Figure 6–2 Queue Details Page

This page shows you the complete
details for a particular queue.

The Queue Details page shows the following information:

Queue – The queue instance name.
Status – Describes whether this queue instance is running, suspended (manually or automatically in the case of an error), or waiting for a required resource to become available or condition be met. See the Queue Alerts page for more information.
Used Slots – The number concurrently executing in the queue instance. The type is number
Total Slots – The maximum number of concurrently executing jobs allowed in the queue instance. The type is number.
Queue Type – The type of queue. Currently one of batch, interactive, parallel, or checkpointing or any combination in a comma separated list. The type is string; the default is batch interactive parallel.
Hostname – The fully qualified host name of the node (type string; template default: host.dom.dom.dom).
Calendar – Specifies the valid calendar for this queue instance or contains NONE (the default). A calendar defines the availability of a queue instance depending on time of day, week, and year. Refer to the calendar_conf man page for details on the N1 Grid Engine calendar facility.
Seq No – The sequence number. This parameter combined with the host's load situation specifies this queue's position within the suitable queue scheduling order. A job is dispatched under consideration of the queue_sort_method (see the sched_conf man page). Regardless of the queue_sort_method setting, qstat reports queue information in the order defined by the value of the seq_no. Set this parameter to a monotonically increasing sequence. The type is number and the default is 0.
Rerun – Defines a default behavior for jobs which are aborted by system crashes or manual violent shutdown (using kill) of the complete Sun N1 Grid Engine system on the queue host (including the sge_shepherd of the jobs and their process hierarchy). As soon as the sge_execd daemon restarts and detects that a job has been aborted for such reasons, it can be restarted if the jobs are restartable. A job may not be restartable, for example, if it updates databases (first reads then writes to the same record of a database/file) because the cancellation of the job may have left the database in an inconsistent state. The type of this parameter is Boolean, so you can specify either TRUE or FALSE. The default is FALSE, that is, do not restart jobs automatically. To overrule the default behavior for the jobs in the queue, the owner of the job can use the- r option of the qsub command.
Min Cpu Interval – The time between two automatic checkpoints in case of transparently checkpointing jobs. The maximum of the time requested by the user (using qsub) and the time defined by the queue configuration is used as checkpoint interval. The checkpoint files may be quite large and writing them to the file system may become expensive. So, users and administrators are advised to choose sufficiently large time intervals. The type of min_cpu_interval is time and the default is 5 minutes which usually is suitable for test purposes only.
s_rt (soft real time) and h_rt (hard real time) resource limit parameters define the real time (also called elapsed or wall clock time) passed since the start of the job. If h_rt is exceeded by a job running in the queue, it is stopped using the SIGKILL signal (see the kill command. If the s_rt is exceeded, the job is first warned by the SIGUSR1 signal which can be caught by the job and finally stopped after the notification time defined in the queue configuration notify parameter has passed.
s_cpu (soft cpu) and h_cpu (hard cpu — the per-job CPU time limit in seconds) resource limit parameters impose a limit on the amount of combined CPU time consumed by all the processes in the job. If h_cpu is exceeded by a job running in the queue, it is stopped by a SIGKILL signal (see the kill command). If s_cpu is exceeded, the job is sent a SIGXCPU signal which can be caught by the job. To warn a job so it can exit gracefully before it is killed, set the s_cpu limit to a lower value than h_cpu. For parallel processes, the limit is applied per slot. The limit is multiplied by the number of slots being used by the job before being applied.
s_vmem (soft virtual memory) – The same as s_data. If both are set the minimum is used and h_vmem (hard virtual memory — This is the same as h_data. If both are set the minimum is used and resource limit parameters impose a limit on the amount of combined virtual memory consumed by all the processes in the job. If h_vmem is exceeded by a job running in the queue, it is topped by a SIGKILL signal. If s_vmem is exceeded, the job is sent a SIGXCPU signal which can be caught by the job. To warn a job so it can exit gracefully before it is killed, Set the s_vmem limit to a lower value than h_vmem. For parallel processes, the limit is applied per slot. The limit is multiplied by the number of slots being used by the job before being applied.
s_core (soft core) - The per-process maximum core file size in bytes
s_data (soft data) – The per-process maximum memory limit in bytes.
h_data (hard data) – The per-job maximum memory limit in bytes.
h_fsize (hard file size) – The total number of disk blocks that this job can create.

These parameters specify per job soft and hard resource limits as implemented by the setrlimit(2) system call. By default, each limit field is set to infinity which means RLIM_INFINITY as described in the setrlimit man page. The value type for the CPU-time limits s_cpu and h_cpu is time. The value type for the other limits is memory.

Note –

Not all systems support the setrlimit command. Also, s_vmem and h_vmem are only available on systems supporting RLIMIT_VMEM (see the setrlimit(2) man page on system hosting the queue).

For more information, see the complex man page.