Sun N1 Grid Engine 6.1 Administration Guide

Chapter 5 Managing Policies and the Scheduler

This chapter contains information about grid engine system policies. Topics in this chapter include the following:

Scheduling
Policies

In addition to the background information, this chapter includes detailed instructions on how to accomplish the following tasks:

Administering the Scheduler

This section describes how the grid engine system schedules jobs for execution. The section describes different types of scheduling strategies and explains how to configure the scheduler.

About Scheduling

The grid engine system includes the following job-scheduling activities:

Predispatching decisions. Activities such as eliminating queues because they are full or overloaded, spooling jobs that are currently not eligible for execution, and reserving resources for higher-priority jobs
Dispatching. Deciding a job's importance with respect to other pending jobs and running jobs, sensing the load on all machines in the cluster, and sending the job to a queue on a machine selected according to configured selection criteria
Postdispatch monitoring. Adjusting a job's relative importance as it gets resources and as other jobs with their own relative importance enter or leave the system

The grid engine software schedules jobs across a heterogeneous cluster of computers, based on the following criteria:

The cluster's current load
The jobs' relative importance
The hosts' relative performance
The jobs' resource requirements, for example, CPU, memory, and I/O bandwidth

Decisions about scheduling are based on the strategy for the site and on the instantaneous load characteristics of each computer in the cluster. A site's scheduling strategy is expressed through the grid engine system's configuration parameters. Load characteristics are ascertained by collecting performance data as the system runs.

Scheduling Strategies

The administrator can set up strategies with respect to the following scheduling tasks:

Dynamic resource management. The grid engine system dynamically controls and adjusts the resource entitlements that are allocated to running jobs. In other words, the system modifies their CPU share.
Queue sorting. The software ranks the queues in the cluster according to the order in which the queues should be filled up.
Job sorting. Job sorting determines the order in which the grid engine system attempts to schedule jobs.
Resource reservation and backfilling. Resource reservation reserves resources for jobs, blocking their use by jobs of lower priority. Backfilling enables lower-priority jobs to use blocked resources when using those resources does not interfere with the reservation.

Dynamic Resource Management

The grid engine software uses a weighted combination of the following three ticket-based policies to implement automated job scheduling strategies:

Share-based
Functional (sometimes called Priority)
Override

You can set up the grid engine system to routinely use either a share-based policy, a functional policy, or both. You can combine these policies in any combination. For example, you could give zero weight to one policy and use only the second policy. Or you could give both policies equal weight.

Along with routine policies, administrators can also override share-based and functional scheduling temporarily or, for certain purposes such as express queues, permanently. You can apply an override to one job or to all jobs associated with a user, a department, a project, or a job class (that is, a queue).

In addition to the three policies for mediating among all jobs, the grid engine system sometimes lets users set priorities among the jobs they own. For example, a user might say that jobs one and two are equally important, but that job three is more important than either job one or job two. Users can set their own job priorities if the combination of policies includes the share-based policy, the functional policy, or both. Also, functional tickets must be granted to jobs.

Tickets

The share-based, functional, and override scheduling policies are implemented with tickets. Each policy has a pool of tickets. A policy allocates tickets to jobs as the jobs enter the multimachine grid engine system. Each routine policy that is in force allocates some tickets to each new job. The policy might also reallocate tickets to running jobs at each scheduling interval.

Tickets weight the three policies. For example, if no tickets are allocated to the functional policy, that policy is not used. If the functional ticket pool and the share-based ticket pool have an equal number of tickets, both policies have equal weight in determining a job's importance.

Tickets are allocated to the routine policies at system configuration by grid engine system managers. Managers and operators can change ticket allocations at any time with immediate effect. Additional tickets are injected into the system temporarily to indicate an override. Policies are combined by assignment of tickets. When tickets are allocated to multiple policies, a job gets a portion of each policy's tickets, which indicates the job's importance in each policy in force.

The grid engine system grants tickets to jobs that are entering the system to indicate their importance under each policy in force. At each scheduling interval, each running job can gain tickets, lose tickets, or keep the same number of tickets. For example, a job might gain tickets from an override. A job might lose tickets because it is getting more than its fair share of resources. The number of tickets that a job holds represent the resource share that the grid engine system tries to grant that job during each scheduling interval.

You configure a site's dynamic resource management strategy during installation. First, you allocate tickets to the share-based policy and to the functional policy. You then define the share tree and the functional shares. The share-based ticket allocation and the functional ticket allocation can change automatically at any time. The administrator manually assigns or removes tickets.

Queue Sorting

The following means are provided to determine the order in which the grid engine system attempts to fill up queues:

Load reporting. Administrators can select which load parameters are used to compare the load status of hosts and their queue instances. The wide range of standard load parameters that are available, and an interface for extending this set with site-specific load sensors, are described in Load Parameters.
Load scaling. Load reports from different hosts can be normalized to reflect a comparable situation. See Configuring Execution Hosts With QMON.
Load adjustment. The grid engine software can be configured to automatically correct the last reported load as jobs are dispatched to hosts. The corrected load represents the expected increase in the load situation caused by recently started jobs. This artificial increase of load can be automatically reduced as the load impact of these jobs takes effect.
Sequence number. Queues can be sorted following a strict sequence.

Job Sorting

Before the grid engine system starts to dispatch jobs, the jobs are brought into priority order, highest priority first. The system then attempts to find suitable resources for the jobs in priority sequence.

Without any administrator influence, the order is first-in-first-out (FIFO). The administrator has the following means to control the job order:

Ticket-based job priority. Jobs are always treated according to their relative importance as defined by the number of tickets that the jobs have. Pending jobs are sorted in ticket order. Any change that the administrator applies to the ticket policy also changes the sorting order.
Urgency-based job priority. Jobs can have an urgency value that determines their relative importance. Pending jobs are sorted according to their urgency value. Any change applied to the urgency policy also changes the sorting order.
POSIX priority. You can use the –p option to the qsub command to implement site-specific priority policies. The –p option specifies a range of priorities from –1023 to 1024. The higher the number, the higher the priority. The default priority for jobs is zero.
Maximum number of user or user group jobs. You can restrict the maximum number of jobs that a user or a UNIX user group can run concurrently. This restriction influences the sorting order of the pending job list, because the jobs of users who have not exceeded their limit are given preference.

For each priority type, a weighting factor can be specified. This weighting factor determines the degree to which each type of priority affects overall job priority. To make it easier to control the range of values for each priority type, normalized values are used instead of the raw ticket values, urgency values, and POSIX priority values.

The following formula expresses how a job's priority values are determined:

job_priority = weight_urgency * normalized_urgency_value + 
weight_ticket * normalized_ticket_value + 
weight_priority * normalized_POSIX_priority_value

You can use the qstat command to monitor job priorities:

Use qstat –pri to monitor job priorities overall, including POSIX priority.
Use qstat –ext to monitor job priorities based on the ticket policy.
Use qstat –urg to monitor job priorities based on the urgency policy.
Use qstat –prito diagnose job priority issues when urgency policy, ticket based policies and -p <priority> are used concurrently
Use qstat –explainto diagnose various queue instance based error conditions.

About the Urgency Policy

The urgency policy defines an urgency value for each job. The urgency value is derived from the sum of three contributions:

Resource requirement contribution
Waiting time contribution
Deadline contribution

The resource requirement contribution is derived from the sum of all hard resource requests, one addend for each request.

If the resource request is of the type numeric, the resource request addend is the product of the following three elements:

The resource's urgency value as defined in the complex. For more information, see Configuring Complex Resource Attributes With QMON.
The assumed slot allocation of the job.
The per slot request specified by the qsub –l command.

If the resource request is of the type string, the resource request addend is the resource's urgency value as defined in the complex.

The waiting time contribution is the product of the job's waiting time, in seconds, and the waiting-weight value specified in the Policy Configuration dialog box.

The deadline contribution is zero for jobs without a deadline. For jobs with a deadline, the deadline contribution is the weight-deadline value, which is defined in the Policy Configuration dialog box, divided by the free time, in seconds, until the deadline initiation time.

For information about configuring the urgency policy, see Configuring the Urgency Policy.

Resource Reservation and Backfilling

Resource reservation enables you to reserve system resources for specified pending jobs. When you reserve resources for a job, those resources are blocked from being used by jobs with lower priority.

Jobs can reserve resources depending on criteria such as resource requirements, job priority, waiting time, resource sharing entitlements, and so forth. The scheduler enforces reservations in such a way that jobs with the highest priority get the earliest possible resource assignment. This avoids such well-known problems as “job starvation”.

You can use resource reservation to guarantee that resources are dedicated to jobs in job-priority order.

Consider the following example. Job A is a large pending job, possibly parallel, that requires a large amount of a particular resource. A stream of smaller jobs B(i) require a smaller amount of the same resource. Without resource reservation, a resource assignment for job A cannot be guaranteed, assuming that the stream of B(i) jobs does not stop. The resource cannot be guaranteed even though job A has a higher priority than the B(i) jobs.

With resource reservation, job A gets a reservation that blocks the lower priority jobs B(i). Resources are guaranteed to be available for job A as soon as possible.

Backfilling enables a lower-priority job to use resources that are blocked due to a resource reservation. Backfilling work only if there is a runnable job whose prospective run time is small enough to allow the blocked resource to be used without interfering with the original reservation.

In the example described earlier, a job C, of very short duration, could use backfilling to start before job A.

Because resource reservation causes the scheduler to look ahead, using resource reservation affects system performance. In a small cluster, the effect on performance is negligible when there are only a few pending jobs. In larger clusters, however, and in clusters with many pending jobs, the effect on performance might be significant.

To offset this potential performance degradation, you can limit the overall number of resource reservations that can be made during a scheduling interval. You can limit resource reservation in two ways:

To limit the absolute number of reservations that can be made during a scheduling interval, set the Maximum Reservation parameter on the Scheduler Configuration dialog box. For example, if you set Maximum Reservation to 20, no more than 20 reservations can be made within an interval.
To limit reservation scheduling to only those jobs that are important, use the –R y option of the qsub command. In the example described earlier, there is no need to schedule B(i) job reservations just for the sake of guaranteeing the resource reservation for job A. Job A is the only job that you need to submit with the –R y option.

You can configure the scheduler to monitor how it is influenced by resource reservation. When you monitor the scheduler, information about each scheduling run is recorded in the file sge-root/cell/common/schedule.

The following example shows what schedule monitoring does. Assume that the following sequence of jobs is submitted to a cluster where the global license consumable resource is limited to 5 licenses:

qsub -N L4_RR -R y -l h_rt=30,license=4 -p 100 $SGE_ROOT/examples/jobs/sleeper.sh 20
qsub -N L5_RR -R y -l h_rt-30,license=5        $SGE_ROOT/examples/jobs/sleeper.sh 20
qsub -N L1_RR -R y -l h_rt=31,license=1        $SGE_ROOT/examples/jobs/sleeper.sh 20

Assume that the default priority settings in the scheduler configuration are being used:

weight_priority          1.000000
weight_urgency           0.100000
weight_ticket            0.010000

The –p 100 priority of job L4_RR supersedes the license-based urgency, which results in the following prioritization:

job-ID  prior    name
---------------------
   3127 1.08000 L4_RR
   3128 0.10500 L5_RR
   3129 0.00500 L1_RR

In this case, traces of these jobs can be found in the schedule file for 6 schedule intervals:

::::::::
      3127:1:STARTING:1077903416:30:G:global:license:4.000000
      3127:1:STARTING:1077903416:30:Q:all.q@carc:slots:1.000000
      3128:1:RESERVING:1077903446:30:G:global:license:5.000000
      3128:1:RESERVING:1077903446:30:Q:all.q@bilbur:slots:1.000000
      3129:1:RESERVING:1077903476:31:G:global:license:1.000000
      3129:1:RESERVING:1077903476:31:Q:all.q@es-ergb01-01:slots:1.000000
      ::::::::
      3127:1:RUNNING:1077903416:30:G:global:license:4.000000
      3127:1:RUNNING:1077903416:30:Q:all.q@carc:slots:1.000000
      3128:1:RESERVING:1077903446:30:G:global:license:5.000000
      3128:1:RESERVING:1077903446:30:Q:all.q@es-ergb01-01:slots:1.000000
      3129:1:RESERVING:1077903476:31:G:global:license:1.000000
      3129:1:RESERVING:1077903476:31:Q:all.q@es-ergb01-01:slots:1.000000
      ::::::::
      3128:1:STARTING:1077903448:30:G:global:license:5.000000
      3128:1:STARTING:1077903448:30:Q:all.q@carc:slots:1.000000
      3129:1:RESERVING:1077903478:31:G:global:license:1.000000
      3129:1:RESERVING:1077903478:31:Q:all.q@bilbur:slots:1.000000
      ::::::::
      3128:1:RUNNING:1077903448:30:G:global:license:5.000000
      3128:1:RUNNING:1077903448:30:Q:all.q@carc:slots:1.000000
      3129:1:RESERVING:1077903478:31:G:global:license:1.000000
      3129:1:RESERVING:1077903478:31:Q:all.q@es-ergb01-01:slots:1.000000
      ::::::::
      3129:1:STARTING:1077903480:31:G:global:license:1.000000
      3129:1:STARTING:1077903480:31:Q:all.q@carc:slots:1.000000
      ::::::::
      3129:1:RUNNING:1077903480:31:G:global:license:1.000000
      3129:1:RUNNING:1077903480:31:Q:all.q@carc:slots:1.000000

Each section shows, for a schedule interval, all resource usage that was taken into account. RUNNING entries show usage of jobs that were already running at the start of the interval. STARTING entries show the immediate uses that were decided within the interval. RESERVING entries show uses that are planned for the future, that is, reservations.

The format of the schedule file is as follows:

jobID: The job ID
taskID: The array task ID, or 1 in the case of nonarray jobs
state: Can be RUNNING, SUSPENDED, MIGRATING, STARTING, RESERVING
start-time: Start time in seconds after 1.1.1070
duration: Assumed job duration in seconds
level-char: Can be P (for parallel environment), G (for global), H (for host), or Q (for queue)
object-name: The name of the parallel environment, host, or queue
resource-name: The name of the consumable resource
usage: The resource usage incurred by the job

The line :::::::: marks the beginning of a new schedule interval.

Note –

The schedule file is not truncated. Be sure to turn monitoring off if you do not have an automated procedure that is set up to truncate the file.

What Happens in a Scheduler Interval

The Scheduler schedules work in intervals. Between scheduling actions, the grid engine system keeps information about significant events such as the following:

Job submission
Job completion
Job cancellation
An update of the cluster configuration
Registration of a new machine in the cluster

When scheduling occurs, the scheduler first does the following:

Takes into account all significant events
Sorts jobs and queues according to the administrator's specifications
Takes into account all the jobs' resource requirements
Reserves resources for jobs in a forward-looking schedule

Then the grid engine system does the following tasks, as needed:

Dispatches new jobs
Suspends running jobs
Increases or decreases the resources allocated to running jobs
Maintains the status quo

If share-based scheduling is used, the calculation takes into account the usage that has already occurred for that user or project.

If scheduling is not at least in part share-based, the calculation ranks all the jobs running and waiting to run. The calculation then takes the most important job until the resources in the cluster (CPU, memory, and I/O bandwidth) are used as fully as possible.

Scheduler Monitoring

If the reasons why a job does not get started are unclear to you, run the qalter -w v command for the job. The grid engine software assumes an empty cluster and checks whether any queue that is suitable for the job is available.

Further information can be obtained by running the qstat -j job-id command. This command prints a summary of the job's request profile. The summary also includes the reasons why the job was not scheduled in the last scheduling run. Running the qstat -j command without a job ID summarizes the reasons for all jobs not being scheduled in the last scheduling interval.

Note –

Collection of job scheduling information must be switched on in the scheduler configuration sched_conf(5). Refer to the schedd_job_info parameter description in the sched_conf(5) man page, or to Changing the Scheduler Configuration With QMON.

To retrieve even more detail about the decisions of the scheduler sge_schedd, use the -tsm option of the qconf command. This command forces sge_schedd to write trace output to the file.

Configuring the Scheduler

Refer to Configuring Policy-Based Resource Management With QMON for details on the scheduling administration of resource-sharing policies of the grid engine system. The following sections focus on administering the scheduler configuration sched_conf and related issues.

Default Scheduling

The default scheduling is a first-in-first-out policy. In other words, the first job that is submitted is the first job the scheduler examines in order to dispatch it to a queue. If the first job in the list of pending jobs finds a queue that is suitable and available, that job is started first. A job ranked behind the first job can be started first only if the first job fails to find a suitable free resource.

The default strategy is to select queue instances on the least-loaded host, provided that the queues deliver suitable service for the job's resource requirements. If several suitable queues share the same load, the queue to be selected is unpredictable.

Scheduling Alternatives

You can modify the job scheduling and queue selection strategy in various ways:

Changing the scheduling algorithm
Scaling system load
Selecting queue by sequence number
Selecting queue by share
Restricting the number of jobs per user or per group

The following sections explore these alternatives in detail.

Changing the Scheduling Algorithm

The scheduler configuration parameter algorithm provides a selection for the scheduling algorithm in use. See the sched_conf(5) man page for further information. Currently, default is the only allowed setting.

Scaling System Load

To select the queue to run a job, the grid engine system uses the system load information on the machines that host queue instances. This queue selection scheme builds up a load-balanced situation, thus guaranteeing better use of the available resources in a cluster.

However, the system load may not always tell the truth. For example, if a multi-CPU machine is compared to a single CPU system, the multiprocessor system usually reports higher load figures, because it probably runs more processes. The system load is a measurement strongly influenced by the number of processes trying to get CPU access. But multi-CPU systems are capable of satisfying a much higher load than single-CPU machines. This problem is addressed by processor-number-adjusted sets of load values that are reported by default by sge_execd. Use these load parameters instead of the raw load values to avoid the problem described earlier. See Load Parameters and the sge-root/doc/load_parameters.asc file for details.

Another example of potentially improper interpretation of load values is when systems have marked differences in their performance potential or in their price performance ratio. In both cases, equal load values do not mean that arbitrary hosts can be selected to run a job. In this situation, the administrator should define load scaling factors for the relevant execution hosts and load parameters. See Configuring Execution Hosts With QMON, and related sections.

Note –

The scaled load parameters are also compared against the load threshold lists load-thresholds and migr-load-thresholds. See the queue_conf(5) man page for details.

Another problem associated with load parameters is the need for an application-dependent and site-dependent interpretation of the values and their relative importance. The CPU load might be dominant for a certain type of application that is common at a particular site. By contrast, the memory load might be more important for another site and for the application profile to which the site's compute cluster is dedicated. To address this problem, the grid engine system enables the administrator to specify a load formula in the scheduler configuration file sched_conf. See the sched_conf(5) man page for more details. Site-specific information on resource usage and capacity planning can be taken into account by using site-defined load parameters and consumable resources in the load formula. See the sections Adding Site-Specific Load Parameters) and Consumable Resources.

Finally, the time dependency of load parameters must be taken into account. The load that is imposed by the jobs that are running on a system varies in time. Often the load, for example, the CPU load, requires some amount of time to be reported in the appropriate quantity by the operating system. If a job recently started, the reported load might not provide an accurate representation of the load that the job has imposed on that host. The reported load adapts to the real load over time. But the period of time in which the reported load is too low might lead to an oversubscription of that host. The grid engine system enables the administrator to specify load adjustment factors that are used in the scheduler to compensate for this problem. See the sched_conf(5) man page for detailed information on how to set these load adjustment factors.

Load adjustments are used to virtually increase the measured load after a job is dispatched. In the case of oversubscribed machines, this helps to align with load thresholds. If you do not need load adjustments, you should turn them off. Load adjustments impose additional work on the scheduler in connection with sorting hosts and load thresholds verification.

To disable load adjustments, on the Load Adjustment tab of the Scheduler Configuration dialog box, set the Decay Time to zero, and delete all load adjustment values in the table. See Changing the Scheduler Configuration With QMON.

Selecting Queue by Sequence Number

Another way to change the default scheme for queue selection is to set the global cluster configuration parameter queue_sort_method to seq_no instead of to the default load. In this case, the system load is no longer used as the primary method to select queues. Instead, the sequence numbers that are assigned to the queues by the queue configuration parameter seq_no define a fixed order for queue selection. The queues must be suitable for the considered job, and they must be available. See the queue_conf(5) and sched_conf(5) man pages for more details.

This queue selection policy is useful if the machines that offer batch services at your site are ranked in a monotonous price per job order. For example, a job running on machine A costs 1 unit of money. The same job costs 10 units on machine B. And on machine C the job costs 100 units. Thus the preferred scheduling policy is to first fill up host A and then to use host B. Host C is used only if no alternative remains.

Note –

If you have changed the method of queue selection to seq_no, and the considered queues all share the same sequence number, queues are selected by the default load.

Selecting Queue by Share

The goal of this method is to place jobs so as to attempt to meet the targeted share of global system resources for each job. This method takes into account the resource capability represented by each host in relation to all the system resources. This method tries to balance the percentage of tickets for each host (that is, the sum of tickets for all jobs running on a host) with the percentage of the resource capability that particular host represents for the system. See Configuring Execution Hosts With QMON for instructions on how to define the capacity of a host.

The host's load, although of secondary importance, is also taken into account in the sorting. Choose this sorting method for a site that uses the share-tree policy.

Restricting the Number of Jobs per User or Group

The administrator can assign an upper limit to the number of jobs that any user or any UNIX group can run at any time. In order to enforce this feature, do one of the following:

Set maxujobs or maxgjobs, or both, as described in the sched_conf(5) man page.
On the General Parameters tab of the Scheduler Configuration dialog box, use the Max Jobs/User field to set the maximum number of jobs a user or user group can run concurrently.

Changing the Scheduler Configuration With `QMON`

On the QMON Main Control window, click the Scheduler Configuration button.

The Scheduler Configuration dialog box appears. The dialog box has two tabs:

General Parameters tab
Load Adjustment tab

To change general scheduling parameters, click the General Parameters tab. The General Parameters tab looks like the following figure.

Dialog box titled Scheduler Configuration. Shows General
Parameters tab with parameters you can set. Shows Ok, Cancel, and Help buttons.

Use the General Parameters tab to set the following parameters:

Algorithm. The scheduling algorithm. See Changing the Scheduling Algorithm.
Schedule Interval. The regular time interval between scheduler runs.
Reprioritize Interval. The regular time interval to reprioritize jobs on the execution hosts, based on the current ticket amount for running jobs. To turn reprioritizing off, set this parameter to zero.
Max Jobs/User. The maximum number of jobs that are allowed to run concurrently per user and per UNIX group. See Restricting the Number of Jobs per User or Group.
Sort by. The queue sorting scheme, either sorting by load or sorting by sequence number. See Selecting Queue by Sequence Number.
Job Scheduling Information. Whether job scheduling information is accessible through qstat -j, or whether this information should be collected only for a range of job IDs. You should turn on general collection of job scheduling information only temporarily, in case an extremely high number of jobs are pending.

Scheduler monitoring can help you find out the reason why certain jobs are not dispatched. However, providing this information for all jobs at all times can consume resources. Such information is usually not needed.
Load Formula. The load formula to use to sort hosts and queues.
Flush Submit Seconds. The number of seconds that the scheduler waits after a job is submitted before the scheduler is triggered. To disable the flush after a job is submitted, set this parameter to zero.
Flush Finish Seconds. The number of seconds that the scheduler waits after a job has finished before the scheduler is triggered. To disable the flush after a job has finished, set this parameter to zero.
Maximum Reservation. The maximum number of resource reservations that can be scheduled within a scheduling interval. See Resource Reservation and Backfilling.
Params. Use this setting to specify additional parameters to pass to the scheduler. Params can be PROFILE or MONITOR. If you specify PROFILE, the scheduler logs profiling information that summarizes each scheduling run. If you specify MONITOR, the scheduler records information for each scheduling run in the file sge-root/cell/common/schedule.

By default, the grid engine system schedules job runs in a fixed schedule interval. You can use the Flush Submit Seconds and Flush Finish Seconds parameters to configure immediate scheduling. For more information, see Immediate Scheduling.

To change load adjustment parameters, click the Load Adjustment tab. The Load Adjustment tab looks like the following figure:

Dialog box titled Scheduler Configuration. Shows Load
Adjustment tab with parameters you can set. Shows Ok, Cancel, and Help buttons.

Use the Load Adjustment tab to set the following parameters:

Decay Time. The decay time for the load adjustment.
A table of load adjustment values listing all load and consumable attributes for which an adjustment value is currently defined.

To add load values to the list, click the Load or the Value column heading. A selection list appears with all resource attributes that are attached to the hosts.

The Attribute Selection dialog box is shown in Figure 1–2. To add a resource attribute to the Load column of the Consumable/Fixed Attributes table, select one of the attributes, and then click OK.

To modify an existing value, double-click the Value field.

To delete a resource attribute, select it, and then press Control-D or click mouse button 3. A dialog box asks you to confirm the deletion.

See Scaling System Load for background information. See the sched_conf(5) man page for more details about the scheduler configuration.

Administering Policies

This section describes how to configure policies to manage cluster resources.

The grid engine software orchestrates the delivery of computational power, based on enterprise resource policies that the administrator manages. The system uses these policies to examine available computer resources in the grid. The system gathers these resources, and then it allocates and delivers them automatically, in a way that optimizes usage across the grid.

To enable cooperation in the grid, project owners must do the following:

Negotiate policies
Ensure that policies for manual overrides for unique project requirements are flexible
Automatically monitor and enforce policies

As administrator, you can define high-level usage policies that are customized for your site. Four such policies are available:

Urgency policy – See Configuring the Urgency Policy
Share-based policy – See Configuring the Share-Based Policy
Functional policy – See Configuring the Functional Policy
Override policy – See Configuring the Override Policy

Policy management automatically controls the use of shared resources in the cluster to achieve your goals. High-priority jobs are dispatched preferentially. These jobs receive greater CPU entitlements when they are competing with other, lower-priority jobs. The grid engine software monitors the progress of all jobs. It adjusts their relative priorities correspondingly, and with respect to the goals that you define in the policies.

This policy-based resource allocation grants each user, team, department, and all projects an allocated share of system resources. This allocation of resources extends over a specified period of time, such as a week, a month, or a quarter.

Configuring Policy-Based Resource Management With `QMON`

On the QMON Main Control window, click the Policy Configuration button. The Policy Configuration dialog box appears.

Figure 5–1 Policy Configuration Dialog Box

Dialog box titled Policy Configuration. Shows priority,
urgency, and ticket policy parameters. Shows Refresh, Apply, Done, and Help
buttons.

The Policy Configuration dialog box shows the following information:

Policy Importance Factor
Urgency Policy
Ticket Policy. You can readjust the policy-related tickets.

From this dialog box you can access specific configuration dialog boxes for the three ticket-based policies.

Specifying Policy Priority

Before the grid engine system dispatches jobs, the jobs are brought into priority order, highest priority first. Without any administrator influence, the order is first-in-first-out (FIFO).

On the Policy Configuration dialog box, under Policy Importance Factor, you can specify the relative importance of the three priority types that control the sorting order of jobs:

Priority. Also called POSIX priority. The –p option of the qsub command specifies site-specific priority policies.
Urgency Policy. Jobs can have an urgency value that determines their relative importance. Pending jobs are sorted according to their urgency value.
Ticket Policy. Jobs are always treated according to their relative importance as defined by the number of tickets that the jobs have. Pending jobs are sorted in ticket order.

For more information about job priorities, see Job Sorting.

You can specify a weighting factor for each priority type. This weighting factor determines the degree to which each type of priority affects overall job priority. To make it easier to control the range of values for each priority type, normalized values are used instead of the raw ticket values, urgency values, and POSIX priority values.

The following formula expresses how a job's priority values are determined:

Job priority = Urgency * normalized urgency value + 
Ticket * normalized ticket value + 
Priority * normalized priority value

Urgency, Ticket, and Priority are the three weighting factors you specify under Policy Importance Factor. For example, if you specify Priority as 1, Urgency as 0.1, and Ticket as 0.01, job priority that is specified by the qsub –p command is given the most weight, job priority that is specified by the Urgency Policy is considered next, and job priority that is specified by the Ticket Policy is given the least weight.

Configuring the Urgency Policy

The Urgency Policy defines an urgency value for each job. This urgency value is determined by the sum of the following three contributing elements:

Resource requirement. Each resource attribute defined in the complex can have an urgency value. For information about the setting urgency values for resource attributes, see Configuring Complex Resource Attributes With QMON. Each job request for a resource attribute adds the attribute's urgency value to the total.
Deadline. The urgency value for deadline jobs is determined by dividing the Weight Deadline specified in the Policy Configuration dialog box by the free time, in seconds, until the job's deadline initiation time specified by the qsub –dl command.
Waiting time. The urgency value for a job's waiting time is determined by multiplying the job's waiting time by the Weight Waiting Time specified in the Policy Configuration dialog box. The job's waiting time is measured in seconds.

For details about how the grid engine system arrives at the urgency value total, see About the Urgency Policy.

Configuring Ticket-Based Policies

The tickets that are currently assigned to individual policies are listed under Current Active Tickets. The numbers reflect the relative importance of the policies. The numbers indicate whether a certain policy currently dominates the cluster or whether policies are in balance.

Tickets provide a quantitative measure. For example, you might assign twice as many tickets to the share-based policy as you assign to the functional policy. This means that twice the resource entitlement is allocated to the share-based policy than is allocated to the functional policy. In this sense, tickets behave very much like stock shares.

The total number of all tickets has no particular meaning. Only the relations between policies counts. Hence, total ticket numbers are usually quite high to allow for fine adjustment of the relative importance of the policies.

Under Edit Tickets, you can modify the number of tickets that are allocated to the share tree policy and the functional policy. For details, see Editing Tickets.

Select the Share Override Tickets check box to control the total ticket amount distributed by the override policy. Clear the check box to control the importance of individual jobs relative to the ticket pools that are available for the other policies and override categories. For detailed information, see Sharing Override Tickets.

Select the Share Functional Tickets check box to give a category member a constant entitlement level for the sum of all its jobs. Clear the check box to give each job the same entitlement level, based on its category member's entitlement. For detailed information, see Sharing Functional Ticket Shares.

You can set the maximum number of jobs that can be scheduled in the functional policy. The default value is 200.

You can set the maximum number of pending subtasks that are allowed for each array job. The default value is 50. Use this setting to reduce scheduling overhead.

You can specify the Ticket Policy Hierarchy to resolve certain cases of conflicting policies. The resolving of policy conflicts applies particularly to pending jobs. For detailed information, see Setting the Ticket Policy Hierarchy.

To refresh the information displayed, click Refresh.

To save any changes that you make to the Policy Configuration, click Apply. To close the dialog box without saving changes, click Done.

Editing Tickets

You can edit the total number of share-tree tickets and functional tickets. Override tickets are assigned directly through the override policy configuration. The other ticket pools are distributed automatically among jobs that are associated with the policies and with respect to the actual policy configuration.

Note –

All share-based tickets and functional tickets are always distributed among the jobs associated with these policies. Override tickets might not be applicable to the currently active jobs. Consequently, the active override tickets might be zero, even though the override policy has tickets defined.

Sharing Override Tickets

The administrator assigns tickets to the different members of the override categories, that is, to individual users, projects, departments, or jobs. Consequently, the number of tickets that are assigned to a category member determines how many tickets are assigned to jobs under that category member. For example, the number of tickets that are assigned to user A determines how many tickets are assigned to all jobs of user A.

Note –

The number of tickets that are assigned to the job category does not determine how many tickets are assigned to jobs in that category.

Use the Share Override Tickets check box to set the share_override_tickets parameter of sched_conf(5). This parameter controls how job ticket values are derived from their category member ticket value. When you select the Share Override Tickets check box, the tickets of the category members are distributed evenly among the jobs under this member. If you clear the Share Override Tickets check box, each job inherits the ticket amount defined for its category member. In other words, the category member tickets are replicated for all jobs underneath.

Select the Share Override Tickets check box to control the total ticket amount distributed by the override policy. With this setting, ticket amounts that are assigned to a job can become negligibly small if many jobs are under one category member. For example, ticket amounts might diminish if many jobs belong to one member of the user category.

Clear the Share Override Tickets check box to control the importance of individual jobs relative to the ticket pools that are available for the other policies and override categories. With this setting, the number of jobs that are under a category member does not matter. The jobs always get the same number of tickets. However, the total number of override tickets in the system increases as the number of jobs with a right to receive override tickets increases. Other policies can lose importance in such cases.

Sharing Functional Ticket Shares

The functional policy defines entitlement shares for the functional categories. Then the policy defines shares for all members of each of these categories. The functional policy is thus similar to a two-level share tree. The difference is that a job can be associated with several categories at the same time. The job belongs to a particular user, for instance, but the job can also belong to a project, a department, and a job class.

However, as in the share tree, the entitlement shares that a job receives from a functional category is determined by the following:

The shares that are defined for its corresponding category member (for example, its project)
The shares that are given to the category (project instead of user, department, and so on)

Use the Share Functional Tickets check box to set the share_functional_shares parameter of sched_conf(5). This parameter defines how the category member shares are used to determine the shares of a job. The shares assigned to the category members, such as a particular user or project, can be replicated for each job. Or shares can be distributed among the jobs under the category member.

Selecting the Share Functional Tickets check box means that functional shares are replicated among jobs.
Clearing the Share Functional Tickets check box means that functional shares are distributed among jobs.

Those shares are comparable to stock shares. Such shares have no effect for the jobs that belong to the same category member. All jobs under the same category member have the same number of shares in both cases. But the share number has an effect when comparing the share amounts within the same category. Jobs with many siblings that belong to the same category member receive relatively small share portions if you select the Share Functional Tickets check box. On the other hand, if you clear the Share Functional Tickets check box, all sibling jobs receive the same share amount as their category member.

Select the Share Functional Tickets check box to give a category member a constant entitlement level for the sum of all its jobs. The entitlement of an individual job can get negligibly small, however, if the job has many siblings.

Clear the Share Functional Tickets check box to give each job the same entitlement level, based on its category member's entitlement. The number of job siblings in the system does not matter.

Note –

A category member with many jobs underneath can dominate the functional policy.

Be aware that the setting of share functional shares does not determine the total number of functional tickets that are distributed. The total number is always as defined by the administrator for the functional policy ticket pool. The share functional shares parameter influences only how functional tickets are distributed within the functional policy.

Example 5–1 Functional Policy Example

The following example describes a common scenario where a user wishes to translate the SGE-5.3 Scheduler Option -user_sort true to an N1GE 6.1 Configuration but does not understand the share override functional policy ticket feature.

For a plain user-based equal share, you configure your global configuration sge_conf(5) with

-enforce_user auto

-auto_user_fshare 100

Then you use -weight_tickets_functional 10000 in the scheduler configuration sched_conf(5). This action causes the functional policy to be used for user-based equal share scheduling with 100 shares for each user.

Tuning Scheduling Run Time

Pending jobs are sorted according to the number of tickets that each job has, as described in Job Sorting. The scheduler reports the number of tickets each pending job has to the master daemon sge_qmaster. However, on systems with very large numbers of jobs, you might want to turn off ticket reporting. When you turn off ticket reporting, you disable ticket-based job priority. The sort order of jobs is based only on the time each job is submitted.

To turn off the reporting of pending job tickets to sge_qmaster, clear the Report Pending Job Tickets check box on the Policy Configuration dialog box. Doing so sets the report_pjob_tickets parameter of sched_conf(5) to false.

Setting the Ticket Policy Hierarchy

Ticket policy hierarchy provides the means to resolve certain cases of conflicting ticket policies. The resolving of ticket policy conflicts applies particularly to pending jobs.

Such cases can occur in combination with the share-based policy and the functional policy. With both policies, assigning priorities to jobs that belong to the same leaf-level entities is done on a first-come-first-served basis. Leaf-level entities include:

User leaves in the share tree
Project leaves in the share tree
Any member of the following categories in the functional policy: user, project, department, or queue

Members of the job category are not included among leaf-level entities. So, for example, the first job of the same user gets the most, the second gets the next most, the third next, and so on.

A conflict can occur if another policy mandates an order that is different. So, for example, the override policy might define the third job as the most important, whereas the first job that is submitted should come last.

A policy hierarchy might gives the override policy higher priority over the share-tree policy or the functional policy. Such a policy hierarchy ensures that high-priority jobs under the override policy get more entitlements than jobs in the other two policies. Such jobs must belong to the same leaf level entity (user or project) in the share tree.

The Ticket Policy Hierarchy can be a combination of up to three letters. These letters are the first letters of the names of the following three ticket policies:

S – Share-based
F – Functional
O – Override

Use these letters to establish a hierarchy of ticket policies. The first letter defines the top policy. The last letter defines the bottom of the hierarchy. Policies that are not listed in the policy hierarchy do not influence the hierarchy. However, policies that are not listed in the hierarchy can still be a source for tickets of jobs. However, those tickets do not influence the ticket calculations in other policies. All tickets of all policies are added up for each job to define its overall entitlement.

The following examples describe two settings and how they influence the order of the pending jobs.

policy_hierarchy=OS

The override policy assigns the appropriate number of tickets to each pending job.
The number of tickets determines the entitlement assignment in the share tree in case two jobs belong to the same user or to the same leaf-level project. Then the share tree tickets are calculated for the pending jobs.
The tickets from the override policy and from the share-tree policy are added together, along with all other active policies not in the hierarchy. The job with the highest resulting number of tickets has the highest entitlement.

policy_hierarchy=OF

The override policy assigns the appropriate number of tickets to each pending job. Then the tickets from the override policy are added up.
The resulting number of tickets influences the entitlement assignment in the functional policy in case two jobs belong to the same functional category member. Based on this entitlement assignment, the functional tickets are calculated for the pending jobs.
The resulting value is added to the ticket amount from the override policy. The job with the highest resulting number of tickets has the highest entitlement.

All combinations of the three letters are theoretically possible, but only a subset of the combinations are meaningful or have practical relevance. The last letter should always be S or F, because only those two policies can be influenced due to their characteristics described in the examples.

The following form is recommended for policy_hierarchy settings:

[O][S|F]

If the override policy is present, O should occur as the first letter only, because the override policy can only influence. The share-based policy and the functional policy can only be influenced. Therefore S or F should occur as the last letter.

Configuring the Share-Based Policy

Share-based scheduling grants each user and project its allocated share of system resources during an accumulation period such as a week, a month, or a quarter. Share-based scheduling is also called share tree scheduling. It constantly adjusts each user's and project's potential resource share for the near term, until the next scheduling interval. Share-based scheduling is defined for user or for project, or for both.

Share-based scheduling ensures that a defined share is guaranteed to the instances that are configured in the share tree over time. Jobs that are associated with share-tree branches where fewer resources were consumed in the past than anticipated are preferred when the system dispatches jobs. At the same time, full resource usage is guaranteed, because unused share proportions are still available for pending jobs associated with other share-tree branches.

By giving each user or project its targeted share as far as possible, groups of users or projects also get their targeted share. Departments or divisions are examples of such groups. Fair share for all entities is attainable only when every entity that is entitled to resources contends for those resources during the accumulation period. If a user, a project, or a group does not submit jobs during a given period, the resources are shared among those who do submit jobs.

Share-based scheduling is a feedback scheme. The share of the system to which any user or user-group, or project or project-group, is entitled is a configuration parameter. The share of the system to which any job is entitled is based on the following factors:

The share allocated to the job's user or project
The accumulated past usage for each user and user group, and for each project and project group. This usage is adjusted by a decay factor. “Old” usage has less impact.

The grid engine software keeps track of how much usage users and projects have already received. At each scheduling interval, the Scheduler adjusts all jobs' share of resources. Doing so ensures that all users, user groups, projects, and project groups get close to their fair share of the system during the accumulation period. In other words, resources are granted or are denied in order to keep everyone more or less at their targeted share of usage.

The Half-Life Factor

Half-life is how fast the system “forgets” about a user's resource consumption. The administrator decides whether to penalize a user for high resource consumption, be it six months ago or six days ago. The administrator also decides how to apply the penalty. On each node of the share tree, grid engine software maintains a record of users' resource consumption.

With this record, the system administrator can decide how far to look back to determine a user's underusage or overusage when setting up a share-based policy. The resource usage in this context is the mathematical sum of all the computer resources that are consumed over a “sliding window of time.”

The length of this window is determined by a “half-life” factor, which in the grid engine system is an internal decay function. This decay function reduces the impact of accrued resource consumption over time. A short half-life quickly lessens the impact of resource overconsumption. A longer half-life gradually lessens the impact of resource overconsumption.

This half-life decay function is a specified unit of time. For example, consider a half-life of seven days that is applied to a resource consumption of 1,000 units. This half-life decay factor results in the following usage “penalty” adjustment over time.

500 after 7 days
250 after 14 days
125 after 21 days
62.5 after 28 days

The half-life-based decay diminishes the impact of a user's resource consumption over time, until the effect of the penalty is negligible.

Note –

Override tickets that a user receives are not subjected to a past usage penalty, because override tickets belong to a different policy system. The decay function is a characteristic of the share-tree policy only.

Compensation Factor

Sometimes the comparison shows that actual usage is well below targeted usage. In such a case, the adjusting of a user's share or a project's share of resource can allow a user to dominate the system. Such an adjustment is based on the goal of reaching target share. This domination might not be desirable.

The compensation factor enables an administrator to limit how much a user or a project can dominate the resources in the near term.

For example, a compensation factor of two limits a user's or project's current share to twice its targeted share. Assume that a user or a project should get 20 percent of the system resources over the accumulation period. If the user or project currently gets much less, the maximum it can get in the near term is only 40 percent.

The share-based policy defines long-term resource entitlements of users or projects as per the share tree. When combined with the share-based policy, the compensation factor makes automatic adjustments in entitlements.

If a user or project is either under or over the defined target entitlement, the grid engine system compensates. The system raises or lowers that user's or project's entitlement for a short term over or under the long-term target. This compensation is calculated by a share tree algorithm.

The compensation factor provides an additional mechanism to control the amount of compensation that the grid engine system assigns. The additional compensation factor (CF) calculation is carried out only if the following conditions are true:

Short-term-entitlement is greater than long-term-entitlement multiplied by the CF
The CF is greater than 0

If either condition is not true, or if both conditions are not true, the compensation as defined and implemented by the share-tree algorithm is used.

The smaller the value of the CF, the greater is its effect. If the value is greater than 1, the grid engine system's compensation is limited. The upper limit for compensation is calculated as long-term-entitlement multiplied by the CF. And as defined earlier, the short-term entitlement must exceed this limit before anything happens based on the compensation factor.

If the CF is 1, the grid engine system compensates in the same way as with the raw share-tree algorithm. So a value of one has an effect that is similar to a value of zero. The only difference is an implementation detail. If the CF is one, the CF calculations are carried out without an effect. If the CF is zero, the calculations are suppressed.

If the value is less than 1, the grid engine system overcompensates. Jobs receive much more compensation than they are entitled to based on the share-tree algorithm. Jobs also receive this overcompensation earlier, because the criterion for activating the compensation is met at lower short-term entitlement values. The activating criterion is short-term-entitlement > long-term-entitlement * CF.

Hierarchical Share Tree

The share-based policy is implemented through a hierarchical share tree. The share tree specifies, for a moving accumulation period, how system resources are to be shared among all users and projects. The length of the accumulation period is determined by a configurable decay constant. The grid engine system bases a job's share entitlement on the degree to which each parent node in the share tree reaches its accumulation limit. A job's share entitlement is based on its leaf node share allocation, which in turn depends on the allocations of its parent nodes. All jobs associated with a leaf node split the associated shares.

The entitlement derived from the share tree is combined with other entitlements, such as entitlements from a functional policy, to determine a job's net entitlement. The share tree is allotted the total number of tickets for share-based scheduling. This number determines the weight of share-based scheduling among the four scheduling policies.

The share tree is defined during installation. The share tree can be altered at any time. When the share tree is edited, the new share allocations take effect at the next scheduling interval.

Configuring the Share-Tree Policy With `QMON`

On the QMON Policy Configuration dialog box (Figure 5–1), click Share Tree Policy. The Share Tree Policy dialog box appears.

Node Attributes

Under Node Attributes, the attributes of the selected node are displayed:

Identifier. A user, project, or agglomeration name.
Shares. The number of shares that are allocated to this user or project.

Note –
Shares define relative importance. They are not percentages. Shares also do not have quantitative meaning. The specification of hundreds or even thousands of shares is generally a good idea, as high numbers allow fine tuning of importance relationships.
Level Percentage. This node's portion of the total shares at the level of the same parent node in the tree. The number of this node's shares divided by the sum of its and its sibling's shares.
Total Percentage. This node's portion of the total shares in the entire share tree. The long-term targeted resource share of the node.
Actual Resource Usage. The percentage of all the resources in the system that this node has consumed so far in the accumulation period. The percentage is expressed in relation to all nodes in the share tree.
Targeted Resource Usage. Same as Actual Resource Usage, but only taking the currently active nodes in the share tree into account. Active nodes have jobs in the system. In the short term, the grid engine system attempts to balance the entitlement among active nodes.
Combined Usage. The total usage for the node. Combined Usage is the sum of the usage that is accumulated at this node. Leaf nodes accumulate the usage of all jobs that run under them. Inner nodes accumulate the usage of all descendant nodes. Combined Usage includes CPU, memory, and I/O usage according to the ratio specified under Share Tree Policy Parameters. Combined usage is decayed at the half-life decay rate that is specified by the parameters.

When a user node or a project node is removed and then added back, the user's or project's usage is retained. A node can be added back either at the same place or at a different place in the share tree. You can zero out that usage before you add the node back to the share tree. To do so, first remove the node from the users or projects configured in the grid engine system. Then add the node back to the users or projects there.

Users or projects that were not in the share tree but that ran jobs have nonzero usage when added to the share tree. To zero out usage when you add such users or projects to the tree, first remove them from the users or projects configured in the grid engine system. Then add them to the tree.

To add an interior node under the selected node, click Add Node. A blank Node Info window appears, where you can enter the node's name and number of shares. You can enter any node name or share number.

To add a leaf node under the selected node, click Add Leaf. A blank Node Info window appears, where you can enter the node's name and number of shares. The node's name must be an existing grid engine user (Configuring User Objects With QMON) or project (Defining Projects)

The following rules apply when you are adding a leaf node:

All nodes have a unique path in share tree.
A project is not referenced more than once in share tree.
A user appears only once in a project subtree.
A user appears only once outside of a project subtree.
A user does not appear as a nonleaf node.
All leaf nodes in a project subtree reference a known user or the reserved name default. See a detailed description of this special user in About the Special User default.
Project subtrees do not have subprojects.
All leaf nodes not in a project subtree reference a known user or known project.
All user leaf nodes in a project subtree have access to the project.

To edit the selected node, click Modify. A Node Info window appears. The window displays the mode's name and its number of shares.

To cut or copy the selected node to a buffer, click Cut or Copy. To Paste under the selected node the contents of the most recently cut or copied node, click Paste.

To delete the selected node and all its descendents, click Delete.

To clear the entire share-tree hierarchy, click Clear Usage. Clear the hierarchy when the share-based policy is aligned to a budget and needs to start from scratch at the beginning of each budget term. The Clear Usage facility also is handy when setting up or modifying test N1 Grid Engine 6.1 software environments.

QMON periodically updates the information displayed in the Share Tree Policy dialog box. Click Refresh to force the display to refresh immediately.

To save all the node changes that you make, click Apply. To close the dialog box without saving changes, click Done.

To search the share tree for a node name, click Find, and then type a search string. Node names are indicated which begin with the case sensitive search string. Click Find Next to find the next occurrence of the search string.

Click Help to open the online help system.

Share Tree Policy Parameters

To display the Share Tree Policy Parameters, click the arrow at the right of the Node Attributes.

CPU [%] slider — This slider's setting indicates what percentage of Combined Usage CPU is. When you change this slider, the MEM and I/O sliders change to compensate for the change in CPU percentage.
MEM [%] slider — This slider's setting indicates what percentage of Combined Usage memory is. When you change this slider, the CPU and I/O sliders change to compensate for the change in MEM percentage.
I/O [%] slider — This slider's setting indicates what percentage of Combined Usage I/O is. When you change this slider, the CPU and MEM sliders change to compensate for the change in I/O percentage.

Note –
CPU [%], MEM [%], and I/O [%] always add up to 100%
Lock Symbol — When a lock is open, the slider that it guards can change freely. The slider can change either because the slider was moved or because it is compensating for another slider's being moved.

When a lock is closed, the slider that it guards cannot change. If two locks are closed and one lock is open, no sliders can be changed.
Half-life — Use this field to specify the half-life for usage. Usage is decayed during each scheduling interval so that any particular contribution to accumulated usage has half the value after a duration of half-life.
Days/Hours selection menu — Select whether half-life is to be measured in days or hours.
Compensation Factor — This field accepts a positive integer for the compensation factor. Reasonable values are in the range between 2 and 10.

The actual usage of a user or project can be far below its targeted usage. The compensation factor prevents such users or projects from dominating resources when they first get those resources. See Compensation Factor for more information.

About the Special User `default`

You can use the special user default to reduce the amount of share-tree maintenance for sites with many users. Under the share-tree policy, a job's priority is determined based on the node the job maps to in the share tree. Users who are not explicitly named in the share tree are mapped to the default node, if it exists.

The specification of a single default node allows for a simple share tree to be created. Such a share tree makes user-based fair sharing possible.

You can use the default user also in cases where the same share entitlement is assigned to most users. Same share entitlement is also known as equal share scheduling.

The default user configures all user entries under the default node, giving the same share amount to each user. Each user who submits jobs receives the same share entitlement as that configured for the default user. To activate the facility for a particular user, you must add this user to the list of grid engine users.

The share tree displays “virtual” nodes for all users who are mapped to the default node. The display of virtual nodes enables you to examine the usage and the fair-share scheduling parameters for users who are mapped to the default node.

You can also use the default user for “hybrid” share trees, where users are subordinated under projects in the share tree. The default user can be a leaf node under a project node.

The short-term entitlements of users vary according to differences in the amount of resources that the users consume. However, long-term entitlements of users remain the same.

You might want to assign lower or higher entitlements to some users while maintaining the same long-term entitlement for all other users. To do so, configure a share tree with individual user entries next to the default user for those users with special entitlements.

In Example A, all users submitting to Project A get equal long-term entitlements. The users submitting to Project B only contribute to the accumulated resource consumption of Project B. Entitlements of Project B users are not managed.

Example 5–2 Example A

Diagram shows project nodes A and B. Project A has the
default user as a leaf node. Project B has no leaf nodes.

Compare Example A with Example B:

Example 5–3 Example B

Diagram shows project nodes A and B. Project A has the
default user as a leaf node. Project B has three leaf nodes: default, and
Users A and B.

In Example B, treatment for Project A is the same as for Example A. But all default users who submit jobs to Project B, except users A and B, receive equal long-term resource entitlements. Default users have 20 shares. User A, with 10 shares, receives half the entitlement of the default users. User B, with 40 shares, receives twice the entitlement as the default users.

Configuring the Share-Based Policy From the Command Line

Note –

Use QMON to configure the share tree policy, because a hierarchical tree is well-suited for graphical display and for editing. However, if you need to integrate share tree modifications in shell scripts, for example, you can use the qconf command and its options.

To configure the share-based policy from the command line, use the qconf command with appropriate options.

The qconf options -astree, -mstree, -dstree, and -sstree, enable you to do the following:
- Add a new share tree
- Modify an existing share tree
- Delete a share tree
- Display the share tree configuration
See the qconf(1) man page for details about these options. The share_tree(5) man page contains a description of the format of the share tree configuration.
The -astnode, -mstnode, -dstnode, and -sstnode options do not address the entire share tree, but only a single node. The node is referenced as path through all parent nodes down the share tree, similar to a directory path. The options enable you to add, modify, delete, and display a node. The information contained in a node includes its name and the attached shares.
The weighting of the usage parameters CPU, memory, and I/O are contained in the scheduler configuration as usage_weight. The weighting of the half-life is contained in the scheduler configuration as halftime. The compensation factor is contained in the scheduler configuration as compensation_factor. You can access the scheduler configuration from the command line by using the -msconf and the -ssconf options of qconf. See the sched_conf(5) man page for details about the format.

How to Create Project-Based Share-Tree Scheduling

The objective of this setup is to guarantee a certain share assignment of all the cluster resources to different projects over time.

Specify the number of share-tree tickets (for example, 1000000) in the scheduler configuration.

See Configuring Policy-Based Resource Management With QMON, and the sched_conf(5) man page.

(Optional) Add one user for each scheduling-relevant user.

See Configuring User Objects With QMON, and the user(5) man page.

Add one project for each scheduling-relevant project.

See Defining Projects With QMON, and the project(5) man page.

Use QMON to set up a share tree that reflects the structure of all scheduling-relevant projects as nodes.

See Configuring the Share-Tree Policy With QMON.

Assign share tree shares to the projects.

For example, if you are creating project-based share-tree scheduling with first-come, first-served scheduling among jobs of the same project, a simple structure might look like the following:

If you are creating project-based share-tree scheduling with equal shares for each user, a simple structure might look like the following:

If you are creating project-based share-tree scheduling with individual user shares in each project, add users as leaves to their projects. Then assign individual shares. A simple structure might look like the following:

If you want to assign individual shares to only a few users, designate the user default in combination with individual users below a project node. For example, you can condense the tree illustrated previously into the following:

Configuring the Functional Policy

Functional scheduling is a nonfeedback scheme for determining a job's importance. Functional scheduling associates a job with the submitting user, project, department, and job class. Functional scheduling is sometimes called priority scheduling. The functional policy setup ensures that a defined share is guaranteed to each user, project, or department at any time. Jobs of users, projects, or departments that have used fewer resources than anticipated are preferred when the system dispatches jobs to idle resources.

At the same time, full resource usage is guaranteed, because unused share proportions are distributed among those users, projects, and departments that need the resources. Past resource consumption is not taken into account.

Functional policy entitlement to system resources is combined with other entitlements in determining a job's net entitlement. For example, functional policy entitlement might be combined with share-based policy entitlement.

The total number of tickets that are allotted to the functional policy determines the weight of functional scheduling among the three scheduling policies. During installation, the administrator divides the total number of functional tickets among the functional categories of user, department, project, job, and job class.

Functional Shares

Functional shares are assigned to every member of each functional category: user, department, project, job, and job class. These shares indicate what proportion of the tickets for a category each job associated with a member of the category is entitled to. For example, user davidson has 200 shares, and user donlee has 100. A job submitted by davidson is entitled to twice as many user-functional-tickets as donlee's job, no matter how many tickets there are.

The functional tickets that are allotted to each category are shared among all the jobs that are associated with a particular category.

Configuring the Functional Share Policy With `QMON`

At the bottom of the QMON Policy Configuration dialog box, click Functional Policy. The Functional Policy dialog box appears.

Dialog box titled Functional Policy. Shows category list,
category members, and ratio of tickets. Shows Refresh, Apply, Done, and Help
buttons.

Function Category List

Select the functional category for which you are defining functional shares: user, project, department, or job.

Functional Shares Table

The table under Functional Shares is scrollable. The table displays the following information:

A list of the members of the category currently selected from the Function Category list.
The number of functional shares for each member of the category. Shares are used as a convenient indication of the relative importance of each member of the functional category. You can edit this field.
The percentage of the functional share allocation for this category of functional ticket that this number of functional shares represents. This field is a feedback device and is not editable.

QMON periodically updates the information displayed in the Functional Policy dialog box. Click Refresh to force the display to refresh immediately.

To save all node changes that you make, click Apply. To close the dialog box without saving changes, click Done.

Changing Functional Configurations

Click the jagged arrow above the Functional Shares table to open a configuration dialog box.

For User functional shares, the User Configuration dialog box appears. Use the User tab to switch to the appropriate mode for changing the configuration of grid engine users. See Configuring User Objects With QMON.
For Department functional shares, the User Configuration dialog box appears. Use the Userset tab to switch to the appropriate mode for changing the configuration of departments that are represented as usersets. See Defining Usersets As Projects and Departments.
For Project functional shares, the Project Configuration dialog box appears. See Defining Projects With QMON.
For Job functional shares, the Job Control dialog box appears. See Monitoring and Controlling Jobs With QMON in Sun N1 Grid Engine 6.1 User’s Guide.

Ratio Between Sorts of Functional Tickets

To display the Ratio Between Sorts Of Functional Tickets, click the arrow at the right of the Functional Shares table .

User [%], Department [%], Project [%], Job [%] and Job Class [%] always add up to 100%.

When you change any of the sliders, all other unlocked sliders change to compensate for the change.

When a lock is open, the slider that it guards can change freely. The slider can change either because it is moved or because the moving of another slider causes this slider to change. When a lock is closed, the slider that it guards cannot change. If four locks are closed and one lock is open, no sliders can change.

User slider – Indicates the percentage of the total functional tickets to allocate to the users category
Departments slider – Indicates the percentage of the total functional tickets to allocate to the departments category
Project slider – Indicates the percentage of the total functional tickets to allocate to the projects category
Job slider – Indicates the percentage of the total functional tickets to allocate to the jobs category

Configuring the Functional Share Policy From the Command Line

Note –

You can assign functional shares to jobs only using QMON. No command-line interface is available for this function.

To configure the functional share policy from the command line, use the qconf command with the appropriate options.

Use the qconf -muser command to configure the user category. The -muser option modifies the fshare parameter of the user entry file. See the user(5) man page for information about the user entry file.
Use the qconf -mu command to configure the department category. The -mu option modifies the fshare parameter of the access list file. See the access_list(5) man page for information about the access list file, which is used to represent departments.
Use the qconf -mprj command to configure the project category. The -mprj option modifies the fshare parameter of the project entry file. See the project(5) man page for information about the project entry file.
Use the qconf -mq command to configure the job class category. The -mq option modifies the fshare parameter of the queue configuration file. See the queue_conf(5) man page for information about the queue configuration file, which is used to represent job classes.
The weighting between different categories is defined in the scheduler configuration sched_conf and can be changed using qconf -msconf. The parameters to change are weight_user, weight_department, weight_project, weight_job, and weight_jobclass. The parameter values range between 0 and 1, and the total sum of parameters must add up to 1.

How to Create User-Based, Project-Based, and Department-Based Functional Scheduling

Use this setup to create a certain share assignment of all the resources in the cluster to different users, projects, or departments. First-come, first-served scheduling is used among jobs of the same user, project, or department.

In the Scheduler Configuration dialog box, select the Share Functional Tickets check box.

See Sharing Functional Ticket Shares, and the sched_conf(5) man page.

Specify the number of functional tickets (for example, 1000000) in the scheduler configuration.

See Configuring Policy-Based Resource Management With QMON, and the sched_conf(5) man page.

Add scheduling-relevant items:
- Add one user for each scheduling-relevant user.
  
  See Configuring User Objects With QMON, and the user(5) man page.
- Add one project for each scheduling-relevant project.
  
  See Defining Projects With QMON, and the project(5) man page.
- Add each scheduling-relevant department.

Assign functional shares to each user, project, or department.

See Configuring User Access Lists With QMON, and the access_list(5) man page.

Assign the shares as a percentage of the whole. Examples follow:

For users:
- UserA (10)
- UserB (20)
- UserC (20)
- UserD (20)
For projects:
- ProjectA (55)
- ProjectB (45)
For departments:
- DepartmentA (90)
- DepartmentB (5)
- DepartmentC (5)

Configuring the Override Policy

Override scheduling enables a grid engine system manager or operator to dynamically adjust the relative importance of one job or of all jobs that are associated with a user, a department, a project, or a job class. This adjustment adds tickets to the specified job, user, department, project, or job class. By adding override tickets, override scheduling increases the total number of tickets that a user, department, project, or job has. As a result, the overall share of resources is increased.

The addition of override tickets also increases the total number of tickets in the system. These additional tickets deflate the value of every job's tickets.

You can use override tickets for the following two purposes:

To temporarily override the share-based policy or the functional policy without having to change the configuration of these policies.
To establish resource entitlement levels with an associated fixed amount of tickets. The establishment of entitlement levels is appropriate for scenarios like high, medium, or low job classes, or high, medium, or low priority classes.

Override tickets that are assigned directly to a job go away when the job finishes. All other tickets are inflated back to their original value. Override tickets that are assigned to users, departments, projects, and job classes remain in effect until the administrator explicitly removes the tickets.

The Policy Configuration dialog box displays the current number of override tickets that are active in the system.

Note –

Override entries remain in the Override dialog box. These entries can influence subsequent work if they are not explicitly deleted by the administrator when they are no longer needed.

Configuring the Override Policy With `QMON`

At the bottom of the QMON Policy Configuration dialog box, click Override Policy. The Override Policy dialog box appears.

Dialog box titled Override Policy. Shows category list
and category members. Shows Refresh, Apply, Done, and Help buttons.

Override Category List

Select the category for which you are defining override tickets: user, project, department, or job.

Override Table

The override table is scrollable. It displays the following information:

A list of the members of the category for which you are defining tickets. The categories are user, project, department, job, and job class.
The number of override tickets for each member of the category. This field is editable.

QMON periodically updates the information that is displayed in the Override Policy dialog box. Click Refresh to force the display to refresh immediately.

To save all override changes that you make, click Apply. To close the dialog box without saving changes, click Done.

Changing Override Configurations

Click the jagged arrow above the override table to open a configuration dialog box.

For User override tickets, the User Configuration dialog box appears. Use the User tab to switch to the appropriate mode for changing the configuration of grid engine users. See Configuring User Objects With QMON.
For Department override tickets, the User Configuration dialog box appears. Use the Userset tab to switch to the appropriate mode for changing the configuration of departments that are represented as usersets. See Defining Usersets As Projects and Departments.
For Project override tickets, the Project Configuration dialog box appears. See Defining Projects With QMON.
For Job override tickets, the Job Control dialog box appears. See Monitoring and Controlling Jobs With QMON in Sun N1 Grid Engine 6.1 User’s Guide.

Configuring the Override Policy From the Command Line

Note –

You can assign override tickets to jobs only using QMON. No command line interface is available for this function.

To configure the override policy from the command line, use the qconf command with the appropriate options.

Use the qconf -muser command to configure the user category. The -muser option modifies the oticket parameter of the user entry file. See the user(5) man page for information about the user entry file.
Use the qconf -mu command to configure the department category. The -mu option modifies the oticket parameter of the access list file. See the access_list(5) man page for information about the access list file, which is used to represent departments.
Use the qconf -mprj command to configure the project category. The -mprj option modifies the oticket parameter of the project entry file. See the project(5) man page for information about the project entry file.
Use the qconf -mq command to configure the job class category. The -mq option modifies the oticket parameter of the queue configuration file. See the queue_conf(5) man page for information about the queue configuration file, which is used to represent job classes.

Chapter 5 Managing Policies and the Scheduler

Administering the Scheduler

About Scheduling

Scheduling Strategies

Dynamic Resource Management

Tickets

Queue Sorting

Job Sorting

About the Urgency Policy

Resource Reservation and Backfilling

What Happens in a Scheduler Interval

Scheduler Monitoring

Configuring the Scheduler

Default Scheduling

Scheduling Alternatives

Changing the Scheduling Algorithm

Scaling System Load

Selecting Queue by Sequence Number

Selecting Queue by Share

Restricting the Number of Jobs per User or Group

Changing the Scheduler Configuration With QMON

Administering Policies

Configuring Policy-Based Resource Management With QMON

Figure 5–1 Policy Configuration Dialog Box

Specifying Policy Priority

Configuring the Urgency Policy

Configuring Ticket-Based Policies

Editing Tickets

Sharing Override Tickets

Sharing Functional Ticket Shares

Example 5–1 Functional Policy Example

Tuning Scheduling Run Time

Setting the Ticket Policy Hierarchy

Configuring the Share-Based Policy

The Half-Life Factor

Compensation Factor

Hierarchical Share Tree

Configuring the Share-Tree Policy With QMON

Node Attributes

Share Tree Policy Parameters

About the Special User default

Example 5–2 Example A

Example 5–3 Example B

Configuring the Share-Based Policy From the Command Line

How to Create Project-Based Share-Tree Scheduling

Configuring the Functional Policy

Functional Shares

Configuring the Functional Share Policy With QMON

Function Category List

Functional Shares Table

Changing Functional Configurations

Ratio Between Sorts of Functional Tickets

Configuring the Functional Share Policy From the Command Line

How to Create User-Based, Project-Based, and Department-Based Functional Scheduling

Configuring the Override Policy

Configuring the Override Policy With QMON

Override Category List

Override Table

Changing Override Configurations

Configuring the Override Policy From the Command Line

Changing the Scheduler Configuration With `QMON`

Configuring Policy-Based Resource Management With `QMON`

Configuring the Share-Tree Policy With `QMON`

About the Special User `default`

Configuring the Functional Share Policy With `QMON`

Configuring the Override Policy With `QMON`