Configuring the Share-Based Policy (Sun N1 Grid Engine 6.1 Administration Guide)

Sun N1 Grid Engine 6.1 Administration Guide

Configuring the Share-Based Policy

Share-based scheduling grants each user and project its allocated share of system resources during an accumulation period such as a week, a month, or a quarter. Share-based scheduling is also called share tree scheduling. It constantly adjusts each user's and project's potential resource share for the near term, until the next scheduling interval. Share-based scheduling is defined for user or for project, or for both.

Share-based scheduling ensures that a defined share is guaranteed to the instances that are configured in the share tree over time. Jobs that are associated with share-tree branches where fewer resources were consumed in the past than anticipated are preferred when the system dispatches jobs. At the same time, full resource usage is guaranteed, because unused share proportions are still available for pending jobs associated with other share-tree branches.

By giving each user or project its targeted share as far as possible, groups of users or projects also get their targeted share. Departments or divisions are examples of such groups. Fair share for all entities is attainable only when every entity that is entitled to resources contends for those resources during the accumulation period. If a user, a project, or a group does not submit jobs during a given period, the resources are shared among those who do submit jobs.

Share-based scheduling is a feedback scheme. The share of the system to which any user or user-group, or project or project-group, is entitled is a configuration parameter. The share of the system to which any job is entitled is based on the following factors:

The share allocated to the job's user or project
The accumulated past usage for each user and user group, and for each project and project group. This usage is adjusted by a decay factor. “Old” usage has less impact.

The grid engine software keeps track of how much usage users and projects have already received. At each scheduling interval, the Scheduler adjusts all jobs' share of resources. Doing so ensures that all users, user groups, projects, and project groups get close to their fair share of the system during the accumulation period. In other words, resources are granted or are denied in order to keep everyone more or less at their targeted share of usage.

The Half-Life Factor

Half-life is how fast the system “forgets” about a user's resource consumption. The administrator decides whether to penalize a user for high resource consumption, be it six months ago or six days ago. The administrator also decides how to apply the penalty. On each node of the share tree, grid engine software maintains a record of users' resource consumption.

With this record, the system administrator can decide how far to look back to determine a user's underusage or overusage when setting up a share-based policy. The resource usage in this context is the mathematical sum of all the computer resources that are consumed over a “sliding window of time.”

The length of this window is determined by a “half-life” factor, which in the grid engine system is an internal decay function. This decay function reduces the impact of accrued resource consumption over time. A short half-life quickly lessens the impact of resource overconsumption. A longer half-life gradually lessens the impact of resource overconsumption.

This half-life decay function is a specified unit of time. For example, consider a half-life of seven days that is applied to a resource consumption of 1,000 units. This half-life decay factor results in the following usage “penalty” adjustment over time.

500 after 7 days
250 after 14 days
125 after 21 days
62.5 after 28 days

The half-life-based decay diminishes the impact of a user's resource consumption over time, until the effect of the penalty is negligible.

Note –

Override tickets that a user receives are not subjected to a past usage penalty, because override tickets belong to a different policy system. The decay function is a characteristic of the share-tree policy only.

Compensation Factor

Sometimes the comparison shows that actual usage is well below targeted usage. In such a case, the adjusting of a user's share or a project's share of resource can allow a user to dominate the system. Such an adjustment is based on the goal of reaching target share. This domination might not be desirable.

The compensation factor enables an administrator to limit how much a user or a project can dominate the resources in the near term.

For example, a compensation factor of two limits a user's or project's current share to twice its targeted share. Assume that a user or a project should get 20 percent of the system resources over the accumulation period. If the user or project currently gets much less, the maximum it can get in the near term is only 40 percent.

The share-based policy defines long-term resource entitlements of users or projects as per the share tree. When combined with the share-based policy, the compensation factor makes automatic adjustments in entitlements.

If a user or project is either under or over the defined target entitlement, the grid engine system compensates. The system raises or lowers that user's or project's entitlement for a short term over or under the long-term target. This compensation is calculated by a share tree algorithm.

The compensation factor provides an additional mechanism to control the amount of compensation that the grid engine system assigns. The additional compensation factor (CF) calculation is carried out only if the following conditions are true:

Short-term-entitlement is greater than long-term-entitlement multiplied by the CF
The CF is greater than 0

If either condition is not true, or if both conditions are not true, the compensation as defined and implemented by the share-tree algorithm is used.

The smaller the value of the CF, the greater is its effect. If the value is greater than 1, the grid engine system's compensation is limited. The upper limit for compensation is calculated as long-term-entitlement multiplied by the CF. And as defined earlier, the short-term entitlement must exceed this limit before anything happens based on the compensation factor.

If the CF is 1, the grid engine system compensates in the same way as with the raw share-tree algorithm. So a value of one has an effect that is similar to a value of zero. The only difference is an implementation detail. If the CF is one, the CF calculations are carried out without an effect. If the CF is zero, the calculations are suppressed.

If the value is less than 1, the grid engine system overcompensates. Jobs receive much more compensation than they are entitled to based on the share-tree algorithm. Jobs also receive this overcompensation earlier, because the criterion for activating the compensation is met at lower short-term entitlement values. The activating criterion is short-term-entitlement > long-term-entitlement * CF.

Hierarchical Share Tree

The share-based policy is implemented through a hierarchical share tree. The share tree specifies, for a moving accumulation period, how system resources are to be shared among all users and projects. The length of the accumulation period is determined by a configurable decay constant. The grid engine system bases a job's share entitlement on the degree to which each parent node in the share tree reaches its accumulation limit. A job's share entitlement is based on its leaf node share allocation, which in turn depends on the allocations of its parent nodes. All jobs associated with a leaf node split the associated shares.

The entitlement derived from the share tree is combined with other entitlements, such as entitlements from a functional policy, to determine a job's net entitlement. The share tree is allotted the total number of tickets for share-based scheduling. This number determines the weight of share-based scheduling among the four scheduling policies.

The share tree is defined during installation. The share tree can be altered at any time. When the share tree is edited, the new share allocations take effect at the next scheduling interval.

Configuring the Share-Tree Policy With `QMON`

On the QMON Policy Configuration dialog box (Figure 5–1), click Share Tree Policy. The Share Tree Policy dialog box appears.

Node Attributes

Under Node Attributes, the attributes of the selected node are displayed:

Identifier. A user, project, or agglomeration name.
Shares. The number of shares that are allocated to this user or project.

Note –
Shares define relative importance. They are not percentages. Shares also do not have quantitative meaning. The specification of hundreds or even thousands of shares is generally a good idea, as high numbers allow fine tuning of importance relationships.
Level Percentage. This node's portion of the total shares at the level of the same parent node in the tree. The number of this node's shares divided by the sum of its and its sibling's shares.
Total Percentage. This node's portion of the total shares in the entire share tree. The long-term targeted resource share of the node.
Actual Resource Usage. The percentage of all the resources in the system that this node has consumed so far in the accumulation period. The percentage is expressed in relation to all nodes in the share tree.
Targeted Resource Usage. Same as Actual Resource Usage, but only taking the currently active nodes in the share tree into account. Active nodes have jobs in the system. In the short term, the grid engine system attempts to balance the entitlement among active nodes.
Combined Usage. The total usage for the node. Combined Usage is the sum of the usage that is accumulated at this node. Leaf nodes accumulate the usage of all jobs that run under them. Inner nodes accumulate the usage of all descendant nodes. Combined Usage includes CPU, memory, and I/O usage according to the ratio specified under Share Tree Policy Parameters. Combined usage is decayed at the half-life decay rate that is specified by the parameters.

When a user node or a project node is removed and then added back, the user's or project's usage is retained. A node can be added back either at the same place or at a different place in the share tree. You can zero out that usage before you add the node back to the share tree. To do so, first remove the node from the users or projects configured in the grid engine system. Then add the node back to the users or projects there.

Users or projects that were not in the share tree but that ran jobs have nonzero usage when added to the share tree. To zero out usage when you add such users or projects to the tree, first remove them from the users or projects configured in the grid engine system. Then add them to the tree.

To add an interior node under the selected node, click Add Node. A blank Node Info window appears, where you can enter the node's name and number of shares. You can enter any node name or share number.

To add a leaf node under the selected node, click Add Leaf. A blank Node Info window appears, where you can enter the node's name and number of shares. The node's name must be an existing grid engine user (Configuring User Objects With QMON) or project (Defining Projects)

The following rules apply when you are adding a leaf node:

All nodes have a unique path in share tree.
A project is not referenced more than once in share tree.
A user appears only once in a project subtree.
A user appears only once outside of a project subtree.
A user does not appear as a nonleaf node.
All leaf nodes in a project subtree reference a known user or the reserved name default. See a detailed description of this special user in About the Special User default.
Project subtrees do not have subprojects.
All leaf nodes not in a project subtree reference a known user or known project.
All user leaf nodes in a project subtree have access to the project.

To edit the selected node, click Modify. A Node Info window appears. The window displays the mode's name and its number of shares.

To cut or copy the selected node to a buffer, click Cut or Copy. To Paste under the selected node the contents of the most recently cut or copied node, click Paste.

To delete the selected node and all its descendents, click Delete.

To clear the entire share-tree hierarchy, click Clear Usage. Clear the hierarchy when the share-based policy is aligned to a budget and needs to start from scratch at the beginning of each budget term. The Clear Usage facility also is handy when setting up or modifying test N1 Grid Engine 6.1 software environments.

QMON periodically updates the information displayed in the Share Tree Policy dialog box. Click Refresh to force the display to refresh immediately.

To save all the node changes that you make, click Apply. To close the dialog box without saving changes, click Done.

To search the share tree for a node name, click Find, and then type a search string. Node names are indicated which begin with the case sensitive search string. Click Find Next to find the next occurrence of the search string.

Click Help to open the online help system.

Share Tree Policy Parameters

To display the Share Tree Policy Parameters, click the arrow at the right of the Node Attributes.

CPU [%] slider — This slider's setting indicates what percentage of Combined Usage CPU is. When you change this slider, the MEM and I/O sliders change to compensate for the change in CPU percentage.
MEM [%] slider — This slider's setting indicates what percentage of Combined Usage memory is. When you change this slider, the CPU and I/O sliders change to compensate for the change in MEM percentage.
I/O [%] slider — This slider's setting indicates what percentage of Combined Usage I/O is. When you change this slider, the CPU and MEM sliders change to compensate for the change in I/O percentage.

Note –
CPU [%], MEM [%], and I/O [%] always add up to 100%
Lock Symbol — When a lock is open, the slider that it guards can change freely. The slider can change either because the slider was moved or because it is compensating for another slider's being moved.

When a lock is closed, the slider that it guards cannot change. If two locks are closed and one lock is open, no sliders can be changed.
Half-life — Use this field to specify the half-life for usage. Usage is decayed during each scheduling interval so that any particular contribution to accumulated usage has half the value after a duration of half-life.
Days/Hours selection menu — Select whether half-life is to be measured in days or hours.
Compensation Factor — This field accepts a positive integer for the compensation factor. Reasonable values are in the range between 2 and 10.

The actual usage of a user or project can be far below its targeted usage. The compensation factor prevents such users or projects from dominating resources when they first get those resources. See Compensation Factor for more information.

About the Special User `default`

You can use the special user default to reduce the amount of share-tree maintenance for sites with many users. Under the share-tree policy, a job's priority is determined based on the node the job maps to in the share tree. Users who are not explicitly named in the share tree are mapped to the default node, if it exists.

The specification of a single default node allows for a simple share tree to be created. Such a share tree makes user-based fair sharing possible.

You can use the default user also in cases where the same share entitlement is assigned to most users. Same share entitlement is also known as equal share scheduling.

The default user configures all user entries under the default node, giving the same share amount to each user. Each user who submits jobs receives the same share entitlement as that configured for the default user. To activate the facility for a particular user, you must add this user to the list of grid engine users.

The share tree displays “virtual” nodes for all users who are mapped to the default node. The display of virtual nodes enables you to examine the usage and the fair-share scheduling parameters for users who are mapped to the default node.

You can also use the default user for “hybrid” share trees, where users are subordinated under projects in the share tree. The default user can be a leaf node under a project node.

The short-term entitlements of users vary according to differences in the amount of resources that the users consume. However, long-term entitlements of users remain the same.

You might want to assign lower or higher entitlements to some users while maintaining the same long-term entitlement for all other users. To do so, configure a share tree with individual user entries next to the default user for those users with special entitlements.

In Example A, all users submitting to Project A get equal long-term entitlements. The users submitting to Project B only contribute to the accumulated resource consumption of Project B. Entitlements of Project B users are not managed.

Example 5–2 Example A

Diagram shows project nodes A and B. Project A has the
default user as a leaf node. Project B has no leaf nodes.

Compare Example A with Example B:

Example 5–3 Example B

Diagram shows project nodes A and B. Project A has the
default user as a leaf node. Project B has three leaf nodes: default, and
Users A and B.

In Example B, treatment for Project A is the same as for Example A. But all default users who submit jobs to Project B, except users A and B, receive equal long-term resource entitlements. Default users have 20 shares. User A, with 10 shares, receives half the entitlement of the default users. User B, with 40 shares, receives twice the entitlement as the default users.

Configuring the Share-Based Policy From the Command Line

Note –

Use QMON to configure the share tree policy, because a hierarchical tree is well-suited for graphical display and for editing. However, if you need to integrate share tree modifications in shell scripts, for example, you can use the qconf command and its options.

To configure the share-based policy from the command line, use the qconf command with appropriate options.

The qconf options -astree, -mstree, -dstree, and -sstree, enable you to do the following:
- Add a new share tree
- Modify an existing share tree
- Delete a share tree
- Display the share tree configuration
See the qconf(1) man page for details about these options. The share_tree(5) man page contains a description of the format of the share tree configuration.
The -astnode, -mstnode, -dstnode, and -sstnode options do not address the entire share tree, but only a single node. The node is referenced as path through all parent nodes down the share tree, similar to a directory path. The options enable you to add, modify, delete, and display a node. The information contained in a node includes its name and the attached shares.
The weighting of the usage parameters CPU, memory, and I/O are contained in the scheduler configuration as usage_weight. The weighting of the half-life is contained in the scheduler configuration as halftime. The compensation factor is contained in the scheduler configuration as compensation_factor. You can access the scheduler configuration from the command line by using the -msconf and the -ssconf options of qconf. See the sched_conf(5) man page for details about the format.

How to Create Project-Based Share-Tree Scheduling

The objective of this setup is to guarantee a certain share assignment of all the cluster resources to different projects over time.

Specify the number of share-tree tickets (for example, 1000000) in the scheduler configuration.

See Configuring Policy-Based Resource Management With QMON, and the sched_conf(5) man page.

(Optional) Add one user for each scheduling-relevant user.

See Configuring User Objects With QMON, and the user(5) man page.

Add one project for each scheduling-relevant project.

See Defining Projects With QMON, and the project(5) man page.

Use QMON to set up a share tree that reflects the structure of all scheduling-relevant projects as nodes.

See Configuring the Share-Tree Policy With QMON.

Assign share tree shares to the projects.

For example, if you are creating project-based share-tree scheduling with first-come, first-served scheduling among jobs of the same project, a simple structure might look like the following:

If you are creating project-based share-tree scheduling with equal shares for each user, a simple structure might look like the following:

If you are creating project-based share-tree scheduling with individual user shares in each project, add users as leaves to their projects. Then assign individual shares. A simple structure might look like the following:

If you want to assign individual shares to only a few users, designate the user default in combination with individual users below a project node. For example, you can condense the tree illustrated previously into the following: