Sun N1 Grid Engine 6.1 Administration Guide

Chapter 5 Managing Policies and the Scheduler

This chapter contains information about grid engine system policies. Topics in this chapter include the following:

In addition to the background information, this chapter includes detailed instructions on how to accomplish the following tasks:

Administering the Scheduler

This section describes how the grid engine system schedules jobs for execution. The section describes different types of scheduling strategies and explains how to configure the scheduler.

About Scheduling

The grid engine system includes the following job-scheduling activities:

The grid engine software schedules jobs across a heterogeneous cluster of computers, based on the following criteria:

Decisions about scheduling are based on the strategy for the site and on the instantaneous load characteristics of each computer in the cluster. A site's scheduling strategy is expressed through the grid engine system's configuration parameters. Load characteristics are ascertained by collecting performance data as the system runs.

Scheduling Strategies

The administrator can set up strategies with respect to the following scheduling tasks:

Dynamic Resource Management

The grid engine software uses a weighted combination of the following three ticket-based policies to implement automated job scheduling strategies:

You can set up the grid engine system to routinely use either a share-based policy, a functional policy, or both. You can combine these policies in any combination. For example, you could give zero weight to one policy and use only the second policy. Or you could give both policies equal weight.

Along with routine policies, administrators can also override share-based and functional scheduling temporarily or, for certain purposes such as express queues, permanently. You can apply an override to one job or to all jobs associated with a user, a department, a project, or a job class (that is, a queue).

In addition to the three policies for mediating among all jobs, the grid engine system sometimes lets users set priorities among the jobs they own. For example, a user might say that jobs one and two are equally important, but that job three is more important than either job one or job two. Users can set their own job priorities if the combination of policies includes the share-based policy, the functional policy, or both. Also, functional tickets must be granted to jobs.

Tickets

The share-based, functional, and override scheduling policies are implemented with tickets. Each policy has a pool of tickets. A policy allocates tickets to jobs as the jobs enter the multimachine grid engine system. Each routine policy that is in force allocates some tickets to each new job. The policy might also reallocate tickets to running jobs at each scheduling interval.

Tickets weight the three policies. For example, if no tickets are allocated to the functional policy, that policy is not used. If the functional ticket pool and the share-based ticket pool have an equal number of tickets, both policies have equal weight in determining a job's importance.

Tickets are allocated to the routine policies at system configuration by grid engine system managers. Managers and operators can change ticket allocations at any time with immediate effect. Additional tickets are injected into the system temporarily to indicate an override. Policies are combined by assignment of tickets. When tickets are allocated to multiple policies, a job gets a portion of each policy's tickets, which indicates the job's importance in each policy in force.

The grid engine system grants tickets to jobs that are entering the system to indicate their importance under each policy in force. At each scheduling interval, each running job can gain tickets, lose tickets, or keep the same number of tickets. For example, a job might gain tickets from an override. A job might lose tickets because it is getting more than its fair share of resources. The number of tickets that a job holds represent the resource share that the grid engine system tries to grant that job during each scheduling interval.

You configure a site's dynamic resource management strategy during installation. First, you allocate tickets to the share-based policy and to the functional policy. You then define the share tree and the functional shares. The share-based ticket allocation and the functional ticket allocation can change automatically at any time. The administrator manually assigns or removes tickets.

Queue Sorting

The following means are provided to determine the order in which the grid engine system attempts to fill up queues:

Job Sorting

Before the grid engine system starts to dispatch jobs, the jobs are brought into priority order, highest priority first. The system then attempts to find suitable resources for the jobs in priority sequence.

Without any administrator influence, the order is first-in-first-out (FIFO). The administrator has the following means to control the job order:

For each priority type, a weighting factor can be specified. This weighting factor determines the degree to which each type of priority affects overall job priority. To make it easier to control the range of values for each priority type, normalized values are used instead of the raw ticket values, urgency values, and POSIX priority values.

The following formula expresses how a job's priority values are determined:


job_priority = weight_urgency * normalized_urgency_value + 
weight_ticket * normalized_ticket_value + 
weight_priority * normalized_POSIX_priority_value

You can use the qstat command to monitor job priorities:

About the Urgency Policy

The urgency policy defines an urgency value for each job. The urgency value is derived from the sum of three contributions:

The resource requirement contribution is derived from the sum of all hard resource requests, one addend for each request.

If the resource request is of the type numeric, the resource request addend is the product of the following three elements:

If the resource request is of the type string, the resource request addend is the resource's urgency value as defined in the complex.

The waiting time contribution is the product of the job's waiting time, in seconds, and the waiting-weight value specified in the Policy Configuration dialog box.

The deadline contribution is zero for jobs without a deadline. For jobs with a deadline, the deadline contribution is the weight-deadline value, which is defined in the Policy Configuration dialog box, divided by the free time, in seconds, until the deadline initiation time.

For information about configuring the urgency policy, see Configuring the Urgency Policy.

Resource Reservation and Backfilling

Resource reservation enables you to reserve system resources for specified pending jobs. When you reserve resources for a job, those resources are blocked from being used by jobs with lower priority.

Jobs can reserve resources depending on criteria such as resource requirements, job priority, waiting time, resource sharing entitlements, and so forth. The scheduler enforces reservations in such a way that jobs with the highest priority get the earliest possible resource assignment. This avoids such well-known problems as “job starvation”.

You can use resource reservation to guarantee that resources are dedicated to jobs in job-priority order.

Consider the following example. Job A is a large pending job, possibly parallel, that requires a large amount of a particular resource. A stream of smaller jobs B(i) require a smaller amount of the same resource. Without resource reservation, a resource assignment for job A cannot be guaranteed, assuming that the stream of B(i) jobs does not stop. The resource cannot be guaranteed even though job A has a higher priority than the B(i) jobs.

With resource reservation, job A gets a reservation that blocks the lower priority jobs B(i). Resources are guaranteed to be available for job A as soon as possible.

Backfilling enables a lower-priority job to use resources that are blocked due to a resource reservation. Backfilling work only if there is a runnable job whose prospective run time is small enough to allow the blocked resource to be used without interfering with the original reservation.

In the example described earlier, a job C, of very short duration, could use backfilling to start before job A.

Because resource reservation causes the scheduler to look ahead, using resource reservation affects system performance. In a small cluster, the effect on performance is negligible when there are only a few pending jobs. In larger clusters, however, and in clusters with many pending jobs, the effect on performance might be significant.

To offset this potential performance degradation, you can limit the overall number of resource reservations that can be made during a scheduling interval. You can limit resource reservation in two ways:

You can configure the scheduler to monitor how it is influenced by resource reservation. When you monitor the scheduler, information about each scheduling run is recorded in the file sge-root/cell/common/schedule.

The following example shows what schedule monitoring does. Assume that the following sequence of jobs is submitted to a cluster where the global license consumable resource is limited to 5 licenses:


qsub -N L4_RR -R y -l h_rt=30,license=4 -p 100 $SGE_ROOT/examples/jobs/sleeper.sh 20
qsub -N L5_RR -R y -l h_rt-30,license=5        $SGE_ROOT/examples/jobs/sleeper.sh 20
qsub -N L1_RR -R y -l h_rt=31,license=1        $SGE_ROOT/examples/jobs/sleeper.sh 20

Assume that the default priority settings in the scheduler configuration are being used:


weight_priority          1.000000
weight_urgency           0.100000
weight_ticket            0.010000

The –p 100 priority of job L4_RR supersedes the license-based urgency, which results in the following prioritization:


job-ID  prior    name
---------------------
   3127 1.08000 L4_RR
   3128 0.10500 L5_RR
   3129 0.00500 L1_RR

In this case, traces of these jobs can be found in the schedule file for 6 schedule intervals:


::::::::
      3127:1:STARTING:1077903416:30:G:global:license:4.000000
      3127:1:STARTING:1077903416:30:Q:all.q@carc:slots:1.000000
      3128:1:RESERVING:1077903446:30:G:global:license:5.000000
      3128:1:RESERVING:1077903446:30:Q:all.q@bilbur:slots:1.000000
      3129:1:RESERVING:1077903476:31:G:global:license:1.000000
      3129:1:RESERVING:1077903476:31:Q:all.q@es-ergb01-01:slots:1.000000
      ::::::::
      3127:1:RUNNING:1077903416:30:G:global:license:4.000000
      3127:1:RUNNING:1077903416:30:Q:all.q@carc:slots:1.000000
      3128:1:RESERVING:1077903446:30:G:global:license:5.000000
      3128:1:RESERVING:1077903446:30:Q:all.q@es-ergb01-01:slots:1.000000
      3129:1:RESERVING:1077903476:31:G:global:license:1.000000
      3129:1:RESERVING:1077903476:31:Q:all.q@es-ergb01-01:slots:1.000000
      ::::::::
      3128:1:STARTING:1077903448:30:G:global:license:5.000000
      3128:1:STARTING:1077903448:30:Q:all.q@carc:slots:1.000000
      3129:1:RESERVING:1077903478:31:G:global:license:1.000000
      3129:1:RESERVING:1077903478:31:Q:all.q@bilbur:slots:1.000000
      ::::::::
      3128:1:RUNNING:1077903448:30:G:global:license:5.000000
      3128:1:RUNNING:1077903448:30:Q:all.q@carc:slots:1.000000
      3129:1:RESERVING:1077903478:31:G:global:license:1.000000
      3129:1:RESERVING:1077903478:31:Q:all.q@es-ergb01-01:slots:1.000000
      ::::::::
      3129:1:STARTING:1077903480:31:G:global:license:1.000000
      3129:1:STARTING:1077903480:31:Q:all.q@carc:slots:1.000000
      ::::::::
      3129:1:RUNNING:1077903480:31:G:global:license:1.000000
      3129:1:RUNNING:1077903480:31:Q:all.q@carc:slots:1.000000

Each section shows, for a schedule interval, all resource usage that was taken into account. RUNNING entries show usage of jobs that were already running at the start of the interval. STARTING entries show the immediate uses that were decided within the interval. RESERVING entries show uses that are planned for the future, that is, reservations.

The format of the schedule file is as follows:

jobID

The job ID

taskID

The array task ID, or 1 in the case of nonarray jobs

state

Can be RUNNING, SUSPENDED, MIGRATING, STARTING, RESERVING

start-time

Start time in seconds after 1.1.1070

duration

Assumed job duration in seconds

level-char

Can be P (for parallel environment), G (for global), H (for host), or Q (for queue)

object-name

The name of the parallel environment, host, or queue

resource-name

The name of the consumable resource

usage

The resource usage incurred by the job

The line :::::::: marks the beginning of a new schedule interval.


Note –

The schedule file is not truncated. Be sure to turn monitoring off if you do not have an automated procedure that is set up to truncate the file.


What Happens in a Scheduler Interval

The Scheduler schedules work in intervals. Between scheduling actions, the grid engine system keeps information about significant events such as the following:

When scheduling occurs, the scheduler first does the following:

Then the grid engine system does the following tasks, as needed:

If share-based scheduling is used, the calculation takes into account the usage that has already occurred for that user or project.

If scheduling is not at least in part share-based, the calculation ranks all the jobs running and waiting to run. The calculation then takes the most important job until the resources in the cluster (CPU, memory, and I/O bandwidth) are used as fully as possible.

Scheduler Monitoring

If the reasons why a job does not get started are unclear to you, run the qalter -w v command for the job. The grid engine software assumes an empty cluster and checks whether any queue that is suitable for the job is available.

Further information can be obtained by running the qstat -j job-id command. This command prints a summary of the job's request profile. The summary also includes the reasons why the job was not scheduled in the last scheduling run. Running the qstat -j command without a job ID summarizes the reasons for all jobs not being scheduled in the last scheduling interval.


Note –

Collection of job scheduling information must be switched on in the scheduler configuration sched_conf(5). Refer to the schedd_job_info parameter description in the sched_conf(5) man page, or to Changing the Scheduler Configuration With QMON.


To retrieve even more detail about the decisions of the scheduler sge_schedd, use the -tsm option of the qconf command. This command forces sge_schedd to write trace output to the file.

Configuring the Scheduler

Refer to Configuring Policy-Based Resource Management With QMON for details on the scheduling administration of resource-sharing policies of the grid engine system. The following sections focus on administering the scheduler configuration sched_conf and related issues.

Default Scheduling

The default scheduling is a first-in-first-out policy. In other words, the first job that is submitted is the first job the scheduler examines in order to dispatch it to a queue. If the first job in the list of pending jobs finds a queue that is suitable and available, that job is started first. A job ranked behind the first job can be started first only if the first job fails to find a suitable free resource.

The default strategy is to select queue instances on the least-loaded host, provided that the queues deliver suitable service for the job's resource requirements. If several suitable queues share the same load, the queue to be selected is unpredictable.

Scheduling Alternatives

You can modify the job scheduling and queue selection strategy in various ways:

The following sections explore these alternatives in detail.

Changing the Scheduling Algorithm

The scheduler configuration parameter algorithm provides a selection for the scheduling algorithm in use. See the sched_conf(5) man page for further information. Currently, default is the only allowed setting.

Scaling System Load

To select the queue to run a job, the grid engine system uses the system load information on the machines that host queue instances. This queue selection scheme builds up a load-balanced situation, thus guaranteeing better use of the available resources in a cluster.

However, the system load may not always tell the truth. For example, if a multi-CPU machine is compared to a single CPU system, the multiprocessor system usually reports higher load figures, because it probably runs more processes. The system load is a measurement strongly influenced by the number of processes trying to get CPU access. But multi-CPU systems are capable of satisfying a much higher load than single-CPU machines. This problem is addressed by processor-number-adjusted sets of load values that are reported by default by sge_execd. Use these load parameters instead of the raw load values to avoid the problem described earlier. See Load Parameters and the sge-root/doc/load_parameters.asc file for details.

Another example of potentially improper interpretation of load values is when systems have marked differences in their performance potential or in their price performance ratio. In both cases, equal load values do not mean that arbitrary hosts can be selected to run a job. In this situation, the administrator should define load scaling factors for the relevant execution hosts and load parameters. See Configuring Execution Hosts With QMON, and related sections.


Note –

The scaled load parameters are also compared against the load threshold lists load-thresholds and migr-load-thresholds. See the queue_conf(5) man page for details.


Another problem associated with load parameters is the need for an application-dependent and site-dependent interpretation of the values and their relative importance. The CPU load might be dominant for a certain type of application that is common at a particular site. By contrast, the memory load might be more important for another site and for the application profile to which the site's compute cluster is dedicated. To address this problem, the grid engine system enables the administrator to specify a load formula in the scheduler configuration file sched_conf. See the sched_conf(5) man page for more details. Site-specific information on resource usage and capacity planning can be taken into account by using site-defined load parameters and consumable resources in the load formula. See the sections Adding Site-Specific Load Parameters) and Consumable Resources.

Finally, the time dependency of load parameters must be taken into account. The load that is imposed by the jobs that are running on a system varies in time. Often the load, for example, the CPU load, requires some amount of time to be reported in the appropriate quantity by the operating system. If a job recently started, the reported load might not provide an accurate representation of the load that the job has imposed on that host. The reported load adapts to the real load over time. But the period of time in which the reported load is too low might lead to an oversubscription of that host. The grid engine system enables the administrator to specify load adjustment factors that are used in the scheduler to compensate for this problem. See the sched_conf(5) man page for detailed information on how to set these load adjustment factors.

Load adjustments are used to virtually increase the measured load after a job is dispatched. In the case of oversubscribed machines, this helps to align with load thresholds. If you do not need load adjustments, you should turn them off. Load adjustments impose additional work on the scheduler in connection with sorting hosts and load thresholds verification.

To disable load adjustments, on the Load Adjustment tab of the Scheduler Configuration dialog box, set the Decay Time to zero, and delete all load adjustment values in the table. See Changing the Scheduler Configuration With QMON.

Selecting Queue by Sequence Number

Another way to change the default scheme for queue selection is to set the global cluster configuration parameter queue_sort_method to seq_no instead of to the default load. In this case, the system load is no longer used as the primary method to select queues. Instead, the sequence numbers that are assigned to the queues by the queue configuration parameter seq_no define a fixed order for queue selection. The queues must be suitable for the considered job, and they must be available. See the queue_conf(5) and sched_conf(5) man pages for more details.

This queue selection policy is useful if the machines that offer batch services at your site are ranked in a monotonous price per job order. For example, a job running on machine A costs 1 unit of money. The same job costs 10 units on machine B. And on machine C the job costs 100 units. Thus the preferred scheduling policy is to first fill up host A and then to use host B. Host C is used only if no alternative remains.


Note –

If you have changed the method of queue selection to seq_no, and the considered queues all share the same sequence number, queues are selected by the default load.


Selecting Queue by Share

The goal of this method is to place jobs so as to attempt to meet the targeted share of global system resources for each job. This method takes into account the resource capability represented by each host in relation to all the system resources. This method tries to balance the percentage of tickets for each host (that is, the sum of tickets for all jobs running on a host) with the percentage of the resource capability that particular host represents for the system. See Configuring Execution Hosts With QMON for instructions on how to define the capacity of a host.

The host's load, although of secondary importance, is also taken into account in the sorting. Choose this sorting method for a site that uses the share-tree policy.

Restricting the Number of Jobs per User or Group

The administrator can assign an upper limit to the number of jobs that any user or any UNIX group can run at any time. In order to enforce this feature, do one of the following:

Changing the Scheduler Configuration With QMON

On the QMON Main Control window, click the Scheduler Configuration button.

The Scheduler Configuration dialog box appears. The dialog box has two tabs:

To change general scheduling parameters, click the General Parameters tab. The General Parameters tab looks like the following figure.

Dialog box titled Scheduler Configuration. Shows General
Parameters tab with parameters you can set. Shows Ok, Cancel, and Help buttons.

Use the General Parameters tab to set the following parameters:

By default, the grid engine system schedules job runs in a fixed schedule interval. You can use the Flush Submit Seconds and Flush Finish Seconds parameters to configure immediate scheduling. For more information, see Immediate Scheduling.

To change load adjustment parameters, click the Load Adjustment tab. The Load Adjustment tab looks like the following figure:

Dialog box titled Scheduler Configuration. Shows Load
Adjustment tab with parameters you can set. Shows Ok, Cancel, and Help buttons.

Use the Load Adjustment tab to set the following parameters:

See Scaling System Load for background information. See the sched_conf(5) man page for more details about the scheduler configuration.

Administering Policies

This section describes how to configure policies to manage cluster resources.

The grid engine software orchestrates the delivery of computational power, based on enterprise resource policies that the administrator manages. The system uses these policies to examine available computer resources in the grid. The system gathers these resources, and then it allocates and delivers them automatically, in a way that optimizes usage across the grid.

To enable cooperation in the grid, project owners must do the following:

As administrator, you can define high-level usage policies that are customized for your site. Four such policies are available:

Policy management automatically controls the use of shared resources in the cluster to achieve your goals. High-priority jobs are dispatched preferentially. These jobs receive greater CPU entitlements when they are competing with other, lower-priority jobs. The grid engine software monitors the progress of all jobs. It adjusts their relative priorities correspondingly, and with respect to the goals that you define in the policies.

This policy-based resource allocation grants each user, team, department, and all projects an allocated share of system resources. This allocation of resources extends over a specified period of time, such as a week, a month, or a quarter.

Configuring Policy-Based Resource Management With QMON

On the QMON Main Control window, click the Policy Configuration button. The Policy Configuration dialog box appears.

Figure 5–1 Policy Configuration Dialog Box

Dialog box titled Policy Configuration. Shows priority,
urgency, and ticket policy parameters. Shows Refresh, Apply, Done, and Help
buttons.

The Policy Configuration dialog box shows the following information:

From this dialog box you can access specific configuration dialog boxes for the three ticket-based policies.

Specifying Policy Priority

Before the grid engine system dispatches jobs, the jobs are brought into priority order, highest priority first. Without any administrator influence, the order is first-in-first-out (FIFO).

On the Policy Configuration dialog box, under Policy Importance Factor, you can specify the relative importance of the three priority types that control the sorting order of jobs:

For more information about job priorities, see Job Sorting.

You can specify a weighting factor for each priority type. This weighting factor determines the degree to which each type of priority affects overall job priority. To make it easier to control the range of values for each priority type, normalized values are used instead of the raw ticket values, urgency values, and POSIX priority values.

The following formula expresses how a job's priority values are determined:


Job priority = Urgency * normalized urgency value + 
Ticket * normalized ticket value + 
Priority * normalized priority value

Urgency, Ticket, and Priority are the three weighting factors you specify under Policy Importance Factor. For example, if you specify Priority as 1, Urgency as 0.1, and Ticket as 0.01, job priority that is specified by the qsub –p command is given the most weight, job priority that is specified by the Urgency Policy is considered next, and job priority that is specified by the Ticket Policy is given the least weight.

Configuring the Urgency Policy

The Urgency Policy defines an urgency value for each job. This urgency value is determined by the sum of the following three contributing elements:

For details about how the grid engine system arrives at the urgency value total, see About the Urgency Policy.

Configuring Ticket-Based Policies

The tickets that are currently assigned to individual policies are listed under Current Active Tickets. The numbers reflect the relative importance of the policies. The numbers indicate whether a certain policy currently dominates the cluster or whether policies are in balance.

Tickets provide a quantitative measure. For example, you might assign twice as many tickets to the share-based policy as you assign to the functional policy. This means that twice the resource entitlement is allocated to the share-based policy than is allocated to the functional policy. In this sense, tickets behave very much like stock shares.

The total number of all tickets has no particular meaning. Only the relations between policies counts. Hence, total ticket numbers are usually quite high to allow for fine adjustment of the relative importance of the policies.

Under Edit Tickets, you can modify the number of tickets that are allocated to the share tree policy and the functional policy. For details, see Editing Tickets.

Select the Share Override Tickets check box to control the total ticket amount distributed by the override policy. Clear the check box to control the importance of individual jobs relative to the ticket pools that are available for the other policies and override categories. For detailed information, see Sharing Override Tickets.

Select the Share Functional Tickets check box to give a category member a constant entitlement level for the sum of all its jobs. Clear the check box to give each job the same entitlement level, based on its category member's entitlement. For detailed information, see Sharing Functional Ticket Shares.

You can set the maximum number of jobs that can be scheduled in the functional policy. The default value is 200.

You can set the maximum number of pending subtasks that are allowed for each array job. The default value is 50. Use this setting to reduce scheduling overhead.

You can specify the Ticket Policy Hierarchy to resolve certain cases of conflicting policies. The resolving of policy conflicts applies particularly to pending jobs. For detailed information, see Setting the Ticket Policy Hierarchy.

To refresh the information displayed, click Refresh.

To save any changes that you make to the Policy Configuration, click Apply. To close the dialog box without saving changes, click Done.

Editing Tickets

You can edit the total number of share-tree tickets and functional tickets. Override tickets are assigned directly through the override policy configuration. The other ticket pools are distributed automatically among jobs that are associated with the policies and with respect to the actual policy configuration.


Note –

All share-based tickets and functional tickets are always distributed among the jobs associated with these policies. Override tickets might not be applicable to the currently active jobs. Consequently, the active override tickets might be zero, even though the override policy has tickets defined.


Sharing Override Tickets

The administrator assigns tickets to the different members of the override categories, that is, to individual users, projects, departments, or jobs. Consequently, the number of tickets that are assigned to a category member determines how many tickets are assigned to jobs under that category member. For example, the number of tickets that are assigned to user A determines how many tickets are assigned to all jobs of user A.


Note –

The number of tickets that are assigned to the job category does not determine how many tickets are assigned to jobs in that category.


Use the Share Override Tickets check box to set the share_override_tickets parameter of sched_conf(5). This parameter controls how job ticket values are derived from their category member ticket value. When you select the Share Override Tickets check box, the tickets of the category members are distributed evenly among the jobs under this member. If you clear the Share Override Tickets check box, each job inherits the ticket amount defined for its category member. In other words, the category member tickets are replicated for all jobs underneath.

Select the Share Override Tickets check box to control the total ticket amount distributed by the override policy. With this setting, ticket amounts that are assigned to a job can become negligibly small if many jobs are under one category member. For example, ticket amounts might diminish if many jobs belong to one member of the user category.

Clear the Share Override Tickets check box to control the importance of individual jobs relative to the ticket pools that are available for the other policies and override categories. With this setting, the number of jobs that are under a category member does not matter. The jobs always get the same number of tickets. However, the total number of override tickets in the system increases as the number of jobs with a right to receive override tickets increases. Other policies can lose importance in such cases.

Sharing Functional Ticket Shares

The functional policy defines entitlement shares for the functional categories. Then the policy defines shares for all members of each of these categories. The functional policy is thus similar to a two-level share tree. The difference is that a job can be associated with several categories at the same time. The job belongs to a particular user, for instance, but the job can also belong to a project, a department, and a job class.

However, as in the share tree, the entitlement shares that a job receives from a functional category is determined by the following:

Use the Share Functional Tickets check box to set the share_functional_shares parameter of sched_conf(5). This parameter defines how the category member shares are used to determine the shares of a job. The shares assigned to the category members, such as a particular user or project, can be replicated for each job. Or shares can be distributed among the jobs under the category member.

Those shares are comparable to stock shares. Such shares have no effect for the jobs that belong to the same category member. All jobs under the same category member have the same number of shares in both cases. But the share number has an effect when comparing the share amounts within the same category. Jobs with many siblings that belong to the same category member receive relatively small share portions if you select the Share Functional Tickets check box. On the other hand, if you clear the Share Functional Tickets check box, all sibling jobs receive the same share amount as their category member.

Select the Share Functional Tickets check box to give a category member a constant entitlement level for the sum of all its jobs. The entitlement of an individual job can get negligibly small, however, if the job has many siblings.

Clear the Share Functional Tickets check box to give each job the same entitlement level, based on its category member's entitlement. The number of job siblings in the system does not matter.


Note –

A category member with many jobs underneath can dominate the functional policy.


Be aware that the setting of share functional shares does not determine the total number of functional tickets that are distributed. The total number is always as defined by the administrator for the functional policy ticket pool. The share functional shares parameter influences only how functional tickets are distributed within the functional policy.


Example 5–1 Functional Policy Example

The following example describes a common scenario where a user wishes to translate the SGE-5.3 Scheduler Option -user_sort true to an N1GE 6.1 Configuration but does not understand the share override functional policy ticket feature.

For a plain user-based equal share, you configure your global configuration sge_conf(5) with

-enforce_user auto

-auto_user_fshare 100

Then you use -weight_tickets_functional 10000 in the scheduler configuration sched_conf(5). This action causes the functional policy to be used for user-based equal share scheduling with 100 shares for each user.


Tuning Scheduling Run Time

Pending jobs are sorted according to the number of tickets that each job has, as described in Job Sorting. The scheduler reports the number of tickets each pending job has to the master daemon sge_qmaster. However, on systems with very large numbers of jobs, you might want to turn off ticket reporting. When you turn off ticket reporting, you disable ticket-based job priority. The sort order of jobs is based only on the time each job is submitted.

To turn off the reporting of pending job tickets to sge_qmaster, clear the Report Pending Job Tickets check box on the Policy Configuration dialog box. Doing so sets the report_pjob_tickets parameter of sched_conf(5) to false.

Setting the Ticket Policy Hierarchy

Ticket policy hierarchy provides the means to resolve certain cases of conflicting ticket policies. The resolving of ticket policy conflicts applies particularly to pending jobs.

Such cases can occur in combination with the share-based policy and the functional policy. With both policies, assigning priorities to jobs that belong to the same leaf-level entities is done on a first-come-first-served basis. Leaf-level entities include:

Members of the job category are not included among leaf-level entities. So, for example, the first job of the same user gets the most, the second gets the next most, the third next, and so on.

A conflict can occur if another policy mandates an order that is different. So, for example, the override policy might define the third job as the most important, whereas the first job that is submitted should come last.

A policy hierarchy might gives the override policy higher priority over the share-tree policy or the functional policy. Such a policy hierarchy ensures that high-priority jobs under the override policy get more entitlements than jobs in the other two policies. Such jobs must belong to the same leaf level entity (user or project) in the share tree.

The Ticket Policy Hierarchy can be a combination of up to three letters. These letters are the first letters of the names of the following three ticket policies:

Use these letters to establish a hierarchy of ticket policies. The first letter defines the top policy. The last letter defines the bottom of the hierarchy. Policies that are not listed in the policy hierarchy do not influence the hierarchy. However, policies that are not listed in the hierarchy can still be a source for tickets of jobs. However, those tickets do not influence the ticket calculations in other policies. All tickets of all policies are added up for each job to define its overall entitlement.

The following examples describe two settings and how they influence the order of the pending jobs.


policy_hierarchy=OS
  1. The override policy assigns the appropriate number of tickets to each pending job.

  2. The number of tickets determines the entitlement assignment in the share tree in case two jobs belong to the same user or to the same leaf-level project. Then the share tree tickets are calculated for the pending jobs.

  3. The tickets from the override policy and from the share-tree policy are added together, along with all other active policies not in the hierarchy. The job with the highest resulting number of tickets has the highest entitlement.


policy_hierarchy=OF
  1. The override policy assigns the appropriate number of tickets to each pending job. Then the tickets from the override policy are added up.

  2. The resulting number of tickets influences the entitlement assignment in the functional policy in case two jobs belong to the same functional category member. Based on this entitlement assignment, the functional tickets are calculated for the pending jobs.

  3. The resulting value is added to the ticket amount from the override policy. The job with the highest resulting number of tickets has the highest entitlement.

All combinations of the three letters are theoretically possible, but only a subset of the combinations are meaningful or have practical relevance. The last letter should always be S or F, because only those two policies can be influenced due to their characteristics described in the examples.

The following form is recommended for policy_hierarchy settings:


[O][S|F]

If the override policy is present, O should occur as the first letter only, because the override policy can only influence. The share-based policy and the functional policy can only be influenced. Therefore S or F should occur as the last letter.

Configuring the Share-Based Policy

Share-based scheduling grants each user and project its allocated share of system resources during an accumulation period such as a week, a month, or a quarter. Share-based scheduling is also called share tree scheduling. It constantly adjusts each user's and project's potential resource share for the near term, until the next scheduling interval. Share-based scheduling is defined for user or for project, or for both.

Share-based scheduling ensures that a defined share is guaranteed to the instances that are configured in the share tree over time. Jobs that are associated with share-tree branches where fewer resources were consumed in the past than anticipated are preferred when the system dispatches jobs. At the same time, full resource usage is guaranteed, because unused share proportions are still available for pending jobs associated with other share-tree branches.

By giving each user or project its targeted share as far as possible, groups of users or projects also get their targeted share. Departments or divisions are examples of such groups. Fair share for all entities is attainable only when every entity that is entitled to resources contends for those resources during the accumulation period. If a user, a project, or a group does not submit jobs during a given period, the resources are shared among those who do submit jobs.

Share-based scheduling is a feedback scheme. The share of the system to which any user or user-group, or project or project-group, is entitled is a configuration parameter. The share of the system to which any job is entitled is based on the following factors:

The grid engine software keeps track of how much usage users and projects have already received. At each scheduling interval, the Scheduler adjusts all jobs' share of resources. Doing so ensures that all users, user groups, projects, and project groups get close to their fair share of the system during the accumulation period. In other words, resources are granted or are denied in order to keep everyone more or less at their targeted share of usage.

The Half-Life Factor

Half-life is how fast the system “forgets” about a user's resource consumption. The administrator decides whether to penalize a user for high resource consumption, be it six months ago or six days ago. The administrator also decides how to apply the penalty. On each node of the share tree, grid engine software maintains a record of users' resource consumption.

With this record, the system administrator can decide how far to look back to determine a user's underusage or overusage when setting up a share-based policy. The resource usage in this context is the mathematical sum of all the computer resources that are consumed over a “sliding window of time.”

The length of this window is determined by a “half-life” factor, which in the grid engine system is an internal decay function. This decay function reduces the impact of accrued resource consumption over time. A short half-life quickly lessens the impact of resource overconsumption. A longer half-life gradually lessens the impact of resource overconsumption.

This half-life decay function is a specified unit of time. For example, consider a half-life of seven days that is applied to a resource consumption of 1,000 units. This half-life decay factor results in the following usage “penalty” adjustment over time.

The half-life-based decay diminishes the impact of a user's resource consumption over time, until the effect of the penalty is negligible.


Note –

Override tickets that a user receives are not subjected to a past usage penalty, because override tickets belong to a different policy system. The decay function is a characteristic of the share-tree policy only.


Compensation Factor

Sometimes the comparison shows that actual usage is well below targeted usage. In such a case, the adjusting of a user's share or a project's share of resource can allow a user to dominate the system. Such an adjustment is based on the goal of reaching target share. This domination might not be desirable.

The compensation factor enables an administrator to limit how much a user or a project can dominate the resources in the near term.

For example, a compensation factor of two limits a user's or project's current share to twice its targeted share. Assume that a user or a project should get 20 percent of the system resources over the accumulation period. If the user or project currently gets much less, the maximum it can get in the near term is only 40 percent.

The share-based policy defines long-term resource entitlements of users or projects as per the share tree. When combined with the share-based policy, the compensation factor makes automatic adjustments in entitlements.

If a user or project is either under or over the defined target entitlement, the grid engine system compensates. The system raises or lowers that user's or project's entitlement for a short term over or under the long-term target. This compensation is calculated by a share tree algorithm.

The compensation factor provides an additional mechanism to control the amount of compensation that the grid engine system assigns. The additional compensation factor (CF) calculation is carried out only if the following conditions are true:

If either condition is not true, or if both conditions are not true, the compensation as defined and implemented by the share-tree algorithm is used.

The smaller the value of the CF, the greater is its effect. If the value is greater than 1, the grid engine system's compensation is limited. The upper limit for compensation is calculated as long-term-entitlement multiplied by the CF. And as defined earlier, the short-term entitlement must exceed this limit before anything happens based on the compensation factor.

If the CF is 1, the grid engine system compensates in the same way as with the raw share-tree algorithm. So a value of one has an effect that is similar to a value of zero. The only difference is an implementation detail. If the CF is one, the CF calculations are carried out without an effect. If the CF is zero, the calculations are suppressed.

If the value is less than 1, the grid engine system overcompensates. Jobs receive much more compensation than they are entitled to based on the share-tree algorithm. Jobs also receive this overcompensation earlier, because the criterion for activating the compensation is met at lower short-term entitlement values. The activating criterion is short-term-entitlement > long-term-entitlement * CF.

Hierarchical Share Tree

The share-based policy is implemented through a hierarchical share tree. The share tree specifies, for a moving accumulation period, how system resources are to be shared among all users and projects. The length of the accumulation period is determined by a configurable decay constant. The grid engine system bases a job's share entitlement on the degree to which each parent node in the share tree reaches its accumulation limit. A job's share entitlement is based on its leaf node share allocation, which in turn depends on the allocations of its parent nodes. All jobs associated with a leaf node split the associated shares.

The entitlement derived from the share tree is combined with other entitlements, such as entitlements from a functional policy, to determine a job's net entitlement. The share tree is allotted the total number of tickets for share-based scheduling. This number determines the weight of share-based scheduling among the four scheduling policies.

The share tree is defined during installation. The share tree can be altered at any time. When the share tree is edited, the new share allocations take effect at the next scheduling interval.

Configuring the Share-Tree Policy With QMON

On the QMON Policy Configuration dialog box (Figure 5–1), click Share Tree Policy. The Share Tree Policy dialog box appears.

Dialog box titled Share Tree Policy. Shows the share
tree, Node Attributes, and parameters. Shows buttons for maintaining the share
tree policy.

Node Attributes

Under Node Attributes, the attributes of the selected node are displayed:

When a user node or a project node is removed and then added back, the user's or project's usage is retained. A node can be added back either at the same place or at a different place in the share tree. You can zero out that usage before you add the node back to the share tree. To do so, first remove the node from the users or projects configured in the grid engine system. Then add the node back to the users or projects there.

Users or projects that were not in the share tree but that ran jobs have nonzero usage when added to the share tree. To zero out usage when you add such users or projects to the tree, first remove them from the users or projects configured in the grid engine system. Then add them to the tree.

To add an interior node under the selected node, click Add Node. A blank Node Info window appears, where you can enter the node's name and number of shares. You can enter any node name or share number.

To add a leaf node under the selected node, click Add Leaf. A blank Node Info window appears, where you can enter the node's name and number of shares. The node's name must be an existing grid engine user (Configuring User Objects With QMON) or project (Defining Projects)

The following rules apply when you are adding a leaf node:

To edit the selected node, click Modify. A Node Info window appears. The window displays the mode's name and its number of shares.

To cut or copy the selected node to a buffer, click Cut or Copy. To Paste under the selected node the contents of the most recently cut or copied node, click Paste.

To delete the selected node and all its descendents, click Delete.

To clear the entire share-tree hierarchy, click Clear Usage. Clear the hierarchy when the share-based policy is aligned to a budget and needs to start from scratch at the beginning of each budget term. The Clear Usage facility also is handy when setting up or modifying test N1 Grid Engine 6.1 software environments.

QMON periodically updates the information displayed in the Share Tree Policy dialog box. Click Refresh to force the display to refresh immediately.

To save all the node changes that you make, click Apply. To close the dialog box without saving changes, click Done.

To search the share tree for a node name, click Find, and then type a search string. Node names are indicated which begin with the case sensitive search string. Click Find Next to find the next occurrence of the search string.

Click Help to open the online help system.

Share Tree Policy Parameters

To display the Share Tree Policy Parameters, click the arrow at the right of the Node Attributes.

About the Special User default

You can use the special user default to reduce the amount of share-tree maintenance for sites with many users. Under the share-tree policy, a job's priority is determined based on the node the job maps to in the share tree. Users who are not explicitly named in the share tree are mapped to the default node, if it exists.

The specification of a single default node allows for a simple share tree to be created. Such a share tree makes user-based fair sharing possible.

You can use the default user also in cases where the same share entitlement is assigned to most users. Same share entitlement is also known as equal share scheduling.

The default user configures all user entries under the default node, giving the same share amount to each user. Each user who submits jobs receives the same share entitlement as that configured for the default user. To activate the facility for a particular user, you must add this user to the list of grid engine users.

The share tree displays “virtual” nodes for all users who are mapped to the default node. The display of virtual nodes enables you to examine the usage and the fair-share scheduling parameters for users who are mapped to the default node.

You can also use the default user for “hybrid” share trees, where users are subordinated under projects in the share tree. The default user can be a leaf node under a project node.

The short-term entitlements of users vary according to differences in the amount of resources that the users consume. However, long-term entitlements of users remain the same.

You might want to assign lower or higher entitlements to some users while maintaining the same long-term entitlement for all other users. To do so, configure a share tree with individual user entries next to the default user for those users with special entitlements.

In Example A, all users submitting to Project A get equal long-term entitlements. The users submitting to Project B only contribute to the accumulated resource consumption of Project B. Entitlements of Project B users are not managed.


Example 5–2 Example A

Diagram shows project nodes A and B. Project A has the
default user as a leaf node. Project B has no leaf nodes.

Compare Example A with Example B:


Example 5–3 Example B

Diagram shows project nodes A and B. Project A has the
default user as a leaf node. Project B has three leaf nodes: default, and
Users A and B.

In Example B, treatment for Project A is the same as for Example A. But all default users who submit jobs to Project B, except users A and B, receive equal long-term resource entitlements. Default users have 20 shares. User A, with 10 shares, receives half the entitlement of the default users. User B, with 40 shares, receives twice the entitlement as the default users.

Configuring the Share-Based Policy From the Command Line


Note –

Use QMON to configure the share tree policy, because a hierarchical tree is well-suited for graphical display and for editing. However, if you need to integrate share tree modifications in shell scripts, for example, you can use the qconf command and its options.


To configure the share-based policy from the command line, use the qconf command with appropriate options.

ProcedureHow to Create Project-Based Share-Tree Scheduling

The objective of this setup is to guarantee a certain share assignment of all the cluster resources to different projects over time.

  1. Specify the number of share-tree tickets (for example, 1000000) in the scheduler configuration.

    See Configuring Policy-Based Resource Management With QMON, and the sched_conf(5) man page.

  2. (Optional) Add one user for each scheduling-relevant user.

    See Configuring User Objects With QMON, and the user(5) man page.

  3. Add one project for each scheduling-relevant project.

    See Defining Projects With QMON, and the project(5) man page.

  4. Use QMON to set up a share tree that reflects the structure of all scheduling-relevant projects as nodes.

    See Configuring the Share-Tree Policy With QMON.

  5. Assign share tree shares to the projects.

    For example, if you are creating project-based share-tree scheduling with first-come, first-served scheduling among jobs of the same project, a simple structure might look like the following:

    Diagram showing root node with 2 projects. Project A
has 75 shares, Project B has 25 shares.

    If you are creating project-based share-tree scheduling with equal shares for each user, a simple structure might look like the following:

    Diagram showing root node with 2 projects. Each project
has the default user defined. Each user has 10 shares.

    If you are creating project-based share-tree scheduling with individual user shares in each project, add users as leaves to their projects. Then assign individual shares. A simple structure might look like the following:

    Diagram showing root node with 2 projects. Each project
has 3 users defined. Each user has 10 shares.

    If you want to assign individual shares to only a few users, designate the user default in combination with individual users below a project node. For example, you can condense the tree illustrated previously into the following:

    Diagram showing root node, 2 projects. Project A has
the default user (5 shares), and User2 (90 shares). Project B has the default
user (30 shares).

Configuring the Functional Policy

Functional scheduling is a nonfeedback scheme for determining a job's importance. Functional scheduling associates a job with the submitting user, project, department, and job class. Functional scheduling is sometimes called priority scheduling. The functional policy setup ensures that a defined share is guaranteed to each user, project, or department at any time. Jobs of users, projects, or departments that have used fewer resources than anticipated are preferred when the system dispatches jobs to idle resources.

At the same time, full resource usage is guaranteed, because unused share proportions are distributed among those users, projects, and departments that need the resources. Past resource consumption is not taken into account.

Functional policy entitlement to system resources is combined with other entitlements in determining a job's net entitlement. For example, functional policy entitlement might be combined with share-based policy entitlement.

The total number of tickets that are allotted to the functional policy determines the weight of functional scheduling among the three scheduling policies. During installation, the administrator divides the total number of functional tickets among the functional categories of user, department, project, job, and job class.

Functional Shares

Functional shares are assigned to every member of each functional category: user, department, project, job, and job class. These shares indicate what proportion of the tickets for a category each job associated with a member of the category is entitled to. For example, user davidson has 200 shares, and user donlee has 100. A job submitted by davidson is entitled to twice as many user-functional-tickets as donlee's job, no matter how many tickets there are.

The functional tickets that are allotted to each category are shared among all the jobs that are associated with a particular category.

Configuring the Functional Share Policy With QMON

At the bottom of the QMON Policy Configuration dialog box, click Functional Policy. The Functional Policy dialog box appears.

Dialog box titled Functional Policy. Shows category list,
category members, and ratio of tickets. Shows Refresh, Apply, Done, and Help
buttons.

Function Category List

Select the functional category for which you are defining functional shares: user, project, department, or job.

Functional Shares Table

The table under Functional Shares is scrollable. The table displays the following information:

QMON periodically updates the information displayed in the Functional Policy dialog box. Click Refresh to force the display to refresh immediately.

To save all node changes that you make, click Apply. To close the dialog box without saving changes, click Done.

Changing Functional Configurations

Click the jagged arrow above the Functional Shares table to open a configuration dialog box.

Ratio Between Sorts of Functional Tickets

To display the Ratio Between Sorts Of Functional Tickets, click the arrow at the right of the Functional Shares table .

User [%], Department [%], Project [%], Job [%] and Job Class [%] always add up to 100%.

When you change any of the sliders, all other unlocked sliders change to compensate for the change.

When a lock is open, the slider that it guards can change freely. The slider can change either because it is moved or because the moving of another slider causes this slider to change. When a lock is closed, the slider that it guards cannot change. If four locks are closed and one lock is open, no sliders can change.

Configuring the Functional Share Policy From the Command Line


Note –

You can assign functional shares to jobs only using QMON. No command-line interface is available for this function.


To configure the functional share policy from the command line, use the qconf command with the appropriate options.

ProcedureHow to Create User-Based, Project-Based, and Department-Based Functional Scheduling

Use this setup to create a certain share assignment of all the resources in the cluster to different users, projects, or departments. First-come, first-served scheduling is used among jobs of the same user, project, or department.

  1. In the Scheduler Configuration dialog box, select the Share Functional Tickets check box.

    See Sharing Functional Ticket Shares, and the sched_conf(5) man page.

  2. Specify the number of functional tickets (for example, 1000000) in the scheduler configuration.

    See Configuring Policy-Based Resource Management With QMON, and the sched_conf(5) man page.

  3. Add scheduling-relevant items:

  4. Assign functional shares to each user, project, or department.

    See Configuring User Access Lists With QMON, and the access_list(5) man page.

    Assign the shares as a percentage of the whole. Examples follow:

    For users:

    • UserA (10)

    • UserB (20)

    • UserC (20)

    • UserD (20)

    For projects:

    • ProjectA (55)

    • ProjectB (45)

    For departments:

    • DepartmentA (90)

    • DepartmentB (5)

    • DepartmentC (5)

Configuring the Override Policy

Override scheduling enables a grid engine system manager or operator to dynamically adjust the relative importance of one job or of all jobs that are associated with a user, a department, a project, or a job class. This adjustment adds tickets to the specified job, user, department, project, or job class. By adding override tickets, override scheduling increases the total number of tickets that a user, department, project, or job has. As a result, the overall share of resources is increased.

The addition of override tickets also increases the total number of tickets in the system. These additional tickets deflate the value of every job's tickets.

You can use override tickets for the following two purposes:

Override tickets that are assigned directly to a job go away when the job finishes. All other tickets are inflated back to their original value. Override tickets that are assigned to users, departments, projects, and job classes remain in effect until the administrator explicitly removes the tickets.

The Policy Configuration dialog box displays the current number of override tickets that are active in the system.


Note –

Override entries remain in the Override dialog box. These entries can influence subsequent work if they are not explicitly deleted by the administrator when they are no longer needed.


Configuring the Override Policy With QMON

At the bottom of the QMON Policy Configuration dialog box, click Override Policy. The Override Policy dialog box appears.

Dialog box titled Override Policy. Shows category list
and category members. Shows Refresh, Apply, Done, and Help buttons.

Override Category List

Select the category for which you are defining override tickets: user, project, department, or job.

Override Table

The override table is scrollable. It displays the following information:

QMON periodically updates the information that is displayed in the Override Policy dialog box. Click Refresh to force the display to refresh immediately.

To save all override changes that you make, click Apply. To close the dialog box without saving changes, click Done.

Changing Override Configurations

Click the jagged arrow above the override table to open a configuration dialog box.

Configuring the Override Policy From the Command Line


Note –

You can assign override tickets to jobs only using QMON. No command line interface is available for this function.


To configure the override policy from the command line, use the qconf command with the appropriate options.