Sun N1 System Manager 1.3 Grid Engine Provisioning and Monitoring Guide

Chapter 5 Working With N1 Grid Engine Jobs

Each application running on the grid is considered a job. The following sections describe how you can check a job's state as well as it's utilization of resources and it's scheduling policy. This information is displayed in different views of a jobs data including and overview, a utilization view, and an allocation view. You can also see fine-grained information about each job including details about each job's composite tasks.

Checking a Job's State

Use the Jobs Overview tab as a quick way to check a job's State and see some of the factors that might affect its performance. Clicking a job ID displays a Job Details page that provides very detailed information.

Figure 5–1 Jobs Overview Tab

This tab shows you an overview of
all grid jobs.

The fields on the Job Overview tab include:

State – The Job state is indicated by the following letters:
- d (deletion) — Indicates that a job has been deleted (using qdel(1)).
- r (running) — Indicates that a job is about to be executed or is already executing
- R (restarted) — Indicates that the job was restarted. This state can be caused by a job migration or because of one of the reasons described in the -r section of the qsub man page.
- s (suspended) — Shows that an already running job has been suspended (using qmod(1)).
- S (suspended) — Show that an already running job has been suspended because the queue that it belongs to has been suspended.
- t (transferring) — Indicates that a job is about to be executed or is already executing.
- T (threshold) — Show that an already running job has been suspended because at least one suspend threshold of the corresponding queue was exceeded (for more information, see the queue_conf man page) and that the job has been suspended as a consequence.
- w (waiting) — Indicates that the job is suspended pending the availability of a critical resource or specified condition.
See the qstat(1)man page for a detailed explanation about these state conditions. For more information, you can also see Monitoring and Controlling Jobs and Queues in the N1 Grid Engine User manual.
ID – The job ID provides a unique identity for the job and also a method of accessing the Job Details page.
Name – The name of the job. Assigning names to jobs makes them more comprehensible and easier to track than just relying on job IDs.
User – The name of the user who submitted the job.
Project – The name of the project to which the job is assigned as specified in the qsub(1) -P option or by the default project of the submitting user.
Department – The name of the department to which the user belongs. Use the -sul and -su options of qconf command to display the current department definitions).
Priority – The dispatch priority of the job determining its position in the pending jobs list. The dispatch priority is a decimal number with higher values denoting higher priority. The priority value is determined dynamically based on the ticket and urgency policy setup.
Running Time/Pending Time – The time that has elapsed since the job started running or, for the case jobs that are still in the queue, how long the job has been waiting to run.
Task – The currently executing task. Some jobs consist of a single task (the task ID is always 1.). However, parallel jobs and array jobs each consist of more than one task. The tasks are usually numbered in ascending order starting with 1. Depending upon how the job was submitted, sometimes the numbers might skip, 1,3,5. On running jobs, each task runs distinctly and so has its own configuration information, environment, and trace. For details about the task, click the task number to display the Task Details page.

The Job User, Project, and Department are elements that you can use in an Entitlement policy (also known as a Ticket policy) to affect a job's dispatch priority. For example, jobs from one Department can always be entitled to have a higher dispatch priority than those from another Department.

Dispatch Priority is computed from three top-level scheduling policies: Entitlement, Urgency, and Custom (also known as POSIX) . For more detailed information on N1GE scheduling policies and dispatch priority, see the sge_priority man page and Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System (www.sun.com/blueprints/1005/819-4325.html).

Checking Grid Resources

Use the Job Utilization View tab to display information that is relevant to a job's consumption of a grid computing resources as well as other elements that factor into a job's dispatch priority. Unlike the Overview view, only running and suspended jobs appear. In the Utilization view, the columns are as follows:

Figure 5–2 Job Utilization View Tab

This tab shows you the job utilization
view.

State – The Job State is indicated by the following letters:
- d (deletion) – Indicates that a job has been deleted (using qdel).
- r (running) – Indicates that a job is about to be executed or is already executing
- R (restarted) – Indicates that the job was restarted. This can be caused by a job migration or because of one of the reasons described in the -r section of the qsub(1) command.
- s (suspended) – Shows that an already running job has been suspended (using qmod(1))..
- S (suspended) – Show that an already running job has been suspended because the queue that it belongs to has been suspended.
- t (transferring) – Indicates that a job is about to be executed or is already executing.
- T (threshold) – Show that an already running job has been suspended because at least one suspend threshold of the corresponding queue was exceeded (see queue_conf(5)) and that the job has been suspended as a consequence.
- w (waiting) – Indicates that the job is suspended pending the availability of a critical resource or specified condition.
See the qstatman page for a detailed explanation about these state conditions. For more information, you can also see Monitoring and Controlling Jobs and Queues in the N1 Grid Engine User manual.
ID – The job ID provides a unique identity and also a method of accessing the Job Details page.
Name – The name of the job. Assigning names to jobs makes them more comprehensible and easier to track than just relying on job IDs.
Queue – The queue instance which this the job belongs to.
CPU – The amount of CPU time that the job has consumed.
Memory – The amount of memory that the job is using.
Share – The calculated share of the total system to which the job is entitled currently.
Run time – The length of time the job has been running since it was dispatched.
NTickets – The normalized Ticket priority. You can use the Override component of the ticket policy to increase the entitlement of a specific User, Project, or Department. By assigning Override Tickets, you can modify the entitlement without affecting any prioritization assignments of the Urgency policy.
NUrgency – The normalized Urgency priority. Three factors contribute to this priority: the deadline contribution, the wait-time contribution, and the resource requirement contribution.
NPOSIX – The normalized POSIX priority. An administrator can use this value to arbitrarily increase the priority of certain jobs.
Task – The currently executing task. Some jobs consist of a single task, in which case, the task ID is always 1. However, parallel jobs and array jobs each consist of more than one task. The tasks are usually numbered in ascending order starting with 1. Depending upon how the job was submitted, sometimes the numbers might skip, (1,3,5,). On running jobs, each task runs distinctly and so has its own configuration information, environment, and trace. For details about the task, click the task number to display the Task Details page.

Note –

If the CPU usage or memory usage values are blank, the usage information for that job has not yet been reported. Check back at a later time to see if the usage is then reported.

For more information on the meaning of each column, see the QMON man page.

Normalized Priorities

The normalized ticket, urgency, and POSIX priorities are the three top level policies used by the N1GE Scheduler to determine a job's dispatch priority. Each calculate a factor that contributes to the overall priority. In order for these three policy contributions to be added together in a meaningful way, they are each normalized to a number between 0 and 1.

Checking Scheduling Policies

With the Job Allocation View tab, you can see information about the factors that constitute scheduling policies that contribute to the dispatch priority that a job enjoys. You can use this view to determine whether your priority policies are actually in effect and to troubleshoot the components that determine an job's overall priority in the queue.

A job's priority is determined based on three policies:

Ticket policy
Custom (or POSIX) policy
Urgency policy

The first part of the equation, Tickets, tells you the calculations that the scheduler is making in order to implement the entitlement-oriented scheduling policy that has been configured. Tickets provide a window into the inner logical workings of the scheduler. This feature helps you to verify that whatever policy you wanted is in fact being obeyed. It also provides you with a means for diagnosing any problems or unexpected behavior you might be seeing.

From a high level, the number of tickets assigned to a job is directly proportional to the job's entitlement. The higher the number, the greater the entitlement. Jobs with a large entitlement often have a high priority, however, the overall priority is affected by the other two aspects as well unless you have deliberately turned off the urgency and custom policies In that case, only the entitlement ("tickets") policy is active.

The second part of the priority equation is Custom (also called POSIX) priority. An administrator can use this value to arbitrarily increase the priority of certain jobs.

The third part of the priority equation, Urgency, accounts for only the job's individual characteristics, not its owner. The urgency value is derived from the sum of three contributions: the deadline contribution, the wait-time contribution, and the resource requirement contribution.

For more detailed information on N1GE scheduling policies and dispatch priority, see the sge_priority man page and Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System (www.sun.com/blueprints/1005/819-4325.html).

Figure 5–3 Job Allocation View Tab

This tab shows you the resources
allocated for a job.

The Job Allocation View page displays the following information:

State – The Job State is indicated by letters, specifically:
- d (deletion) – Indicates that a job has been deleted (usingqdel(1)).
- r (running) – Indicates that a job is about to be executed or is already executing
- R (restarted) – Indicates that the job was restarted. This can be caused by a job migration or because of one of the reasons described in the -r section of the qsub(1) command.
- s (suspended) – Shows that an already running job has been suspended (using qmod(1)).
- S (suspended) – Show that an already running job has been suspended because the queue that it belongs to has been suspended.
- t (transferring) – Indicates that a job is about to be executed or is already executing.
- T (threshold) – Show that an already running job has been suspended because at least one suspend threshold of the corresponding queue was exceeded (see queue_conf(5)) and that the job has been suspended as a consequence.
- w (waiting) – Indicates that the job is suspended pending the availability of a critical resource or specified condition.
See the qstatman page for a detailed explanation about these state conditions. For more information, you can also see Monitoring and Controlling Jobs and Queues in the N1 Grid Engine User manual.
ID – The job ID provides a unique identity and also a method of accessing the Job Details page.
Name – The name of the job. Assigning names to jobs makes them more comprehensible and easier to track than just relying on job IDs.
Tickets – T he total number of tickets for the job. The more tickets a job has assigned to it, the higher that job's priority. This value is the “raw” number before it is normalized.
Override – The number of Override tickets. By assigning Override tickets, you can modify the entitlement without affecting any prioritization assignments of the Urgency policy.
Func – The number of functional tickets.
Tree – The number of share tree tickets. The share tree defines the long-term resource entitlements of users/projects and of a hierarchy of arbitrary groups made up of them.
Posix – The POSIX priority. This feature provides a way to increase a job's priority. This is the “raw” number before it is normalized.
Urgency – The total urgency for the job made up of the deadline contribution, the wait-time contribution, and the resource requirement contribution. This is the “raw” number before it is normalized.
Res – The resource contribution to the urgency
Wait – The waiting time contribution to the urgency.
Ddln – The deadline contribution to the urgency.
Task – The currently executing task. Some jobs consist of a single task in which case, the task ID is always 1. However, parallel jobs and array jobs each consist of more than one task. The tasks are usually numbered in ascending order starting with 1. Depending upon how the job was submitted, sometimes the numbers might skip like 1,3,5. On running jobs, each task runs distinctly and so has its own configuration information, environment, and trace. For details about the task, click the task number to display the Task Details page.

Note –

You can see the normalized values for Tickets, POSIX, and Urgency using the Job Utilization View tab.

For more information on the meaning of each column, see the qmon man page.

Seeing Detailed Job Information

You can see complete details about a job by selecting the job ID on any of the job views tabs. The Job Details page that appears presents this information in three tables: General, Usage Details, and Schedule Details.

The General table provides details including various properties related to the jobs environment, resource requests, submit options, and so forth.

Figure 5–4 Job Details Page

This page shows you the complete
details for a particular job.

The Usage Details table shows the current resource utilization for that job. If this information is not available, for example, because the job started too recently or the job is still pending, then this table is empty. For jobs with multiple tasks, the usage of each task appears on a separate line.

The Schedule Details table shows the scheduling information for that job.

Most of the fields on this page are self-explanatory. For more information, see the qstat man page.

Seeing Detailed Task Information

The Task Details page contains four tables that provide detailed information about the selected task. This one details page contains information for each task that appears in the three job views tabs. All the information on this page is useful for diagnosing jobs that might be experiencing some kind of problem or issue.

Figure 5–5 Task Details Page

This page shows you the complete
details for a particular job task.

This Task Details page contains tables of information that correspond to a different file from the job spool directory. For more information on the information in the job spool directory, see the N1 Grid Engine 6 Administration manual. The tables are:

Task Summary
Configuration
Environment
Trace

Task Summary Table

The Task Summary table tells you basic information about the job task.

Add Group ID — Contains one line with the additional group ID used to control and monitor the job.
PE Hostfile — A file describing the host setup of a parallel job which contains each involved host, the queues the job was spooled into, and the number of reserved slots (tasks) per host.
Error — Contains an error message in the case of severe errors during the startup of a job. For example, Execd cannot start shepherd.
Shepherd PID — The process ID of the shepherd.
Job PID — The process ID of the job (the shepherd's child process).
Exit Status — The numeric exit code of the job in a single line.