Solaris Resource Manager 1.3 System Administration Guide

Chapter 6 SHR Scheduler

The Solaris Resource Manager SHR scheduler is used to control the allocation of the CPU resource. The concept of shares allows administrators to easily control relative entitlements to CPU resources for users, groups, and applications. The concept of shares is analogous to that of shares in a company; what matters is not how many you have, but how many compared with other shareholders.

Technical Description

There are four attributes per lnode associated with the Solaris Resource Manager CPU scheduler: cpu.shares, cpu.myshares, cpu.usage, and cpu.accrue. The output of liminfo(1SRM) displays these attributes and other useful values.

Solaris Resource Manager scheduling is implemented using the SHR scheduling class. This includes support for the nice(1), priocntl(1), renice(1), and dispadmin(1M) commands. At the system-call level, SHR is compatible with the TS scheduling class.

Shares

A user's cpu.shares attribute is used to apportion CPU entitlement with respect to the user's parent and active peers. A user's cpu.myshares attribute is meaningful only if the user has child users who are active; it is used to determine the proportion of CPU entitlement with respect to them.

For example, if users A and B are the only children of parent P, and A, B, and P each have one share each within group P (that is, A and B have cpu.shares set to 1, while P has cpu.myshares set to 1), then they each have a CPU entitlement of one-third of the total entitlement of the group.

Thus, the actual CPU entitlement of a user depends on the parent's relative entitlement. This, in turn, depends on the relative values of cpu.shares of the parent to the parent's peers and to the cpu.myshares of the grandparent, and so on up the scheduling tree.

For system management reasons, processes attached to the root lnode are not subject to the shares attributes. Any process attached to the root lnode is always given almost all the CPU resources it requests.

It is important that no CPU-intensive processes be attached to the root lnode, since that would severely impact the execution of other processes. To avoid this, the following precautions should be taken:

Not all group headers in the scheduling tree need to represent actual users who run processes, and in these cases it is not necessary to allocate them a share of CPU. Such lnodes can be indicated by setting their cpu.myshares attribute to zero. The cpu.accrue attribute in such a group header still includes all charges levied on all members of its group.

Allocated Share

The cpu.shares and cpu.myshares attributes determine each active lnode's current allocated share of CPU, as a percentage. The shares of inactive users make no difference to allocated share. If only one user is active, that user will have 100 percent of the available CPU resource. If there are only two active users with equal shares in the same group, each will have allocated shares of 50 percent. See Calculation of Allocated Share for more information on how the allocated share is calculated.

Usage and Decay

The cpu.usage attribute increases whenever a process attached to the lnode is charged for a CPU tick. The usage attribute value exponentially decays at a rate determined by the usage decay global Solaris Resource Manager parameter. The usage decay rate (described by a half-life in seconds) is set by the srmadm(1MSRM) command.

Although all processes have an lnode regardless of their current scheduling class, those outside the SHR scheduling class are never charged.

Accrued Usage

The accrued usage attribute increases by the same amount as the usage attribute, but is not decayed. It therefore represents the total accumulated usage for all processes that have been attached to the lnode and its members since the attribute was last reset.

Effective Share

An lnode's allocated share, together with its cpu.usage attribute, determines its current effective share. The Solaris Resource Manager scheduler adjusts the priorities of all processes attached to an lnode so that their rate of work is proportional to the lnode's effective share, and inversely proportional to the number of runnable processes attached to it.

Per-Process Share Priority (sharepri)

Each process attached to an lnode has internal data, specific to Solaris Resource Manager, that is maintained by the operating system kernel. The most important of these values for scheduling purposes is the sharepri value. At any time, the processes with the lowest sharepri values will be the most eligible to be scheduled for running on a CPU.

Sample Share Allocation

Scheduling Tree Structure

The following points relate to the structure of the scheduling tree, which is an area requiring special consideration by the central administrator:

Description of Tree

The tree shown below defines a structure consisting of several group headers and several ordinary users. The top of the tree is the root user. A group header lnode is shown with two integers, which represent the values of its cpu.shares and cpu.myshares attributes, respectively. A leaf lnode is shown with a single integer, which represents the value of its cpu.shares attribute only.

Figure 6-1 Scheduling Tree Structure

Diagram shows group headers with their cpu.shares and cpu.myshares attribute values. Leaf nodes below headers are shown with their cpu.shares only.

Calculation of Allocated Share

Using the previous figure as an example, nodes A, C, and N currently have processes attached to them. At the topmost level, the CPU would only need to be shared between A and M since there are no processes for W or any member of scheduling group W. The ratio of shares between A and M is 3:1, so the allocated share at the topmost level would be 75 percent to group A, and 25 percent to group M.

The 75 percent allocated to group A would then be shared between its active users (A and C), in the ratio of their shares within group A (that is, 1:2). Note that the myshares attribute is used when determining A's shares with respect to its children. User A would therefore get one third of the group's allocated share, and C would get the remaining two thirds. The whole of the allocation for group M would go to lnode N since it is the only lnode with processes.

The overall distribution of allocated share of available CPU would therefore be 0.25 for A, 0.5 for C, and 0.25 for N.

Further suppose that the A, C, and N processes are all continually demanding CPU and that the system has at most two CPUs. In this case, Solaris Resource Manager will schedule them so that the individual processes receive these percentages of total available CPU:

The rate of progress of the individual processes is controlled so that the target for each lnode is met. On a system with more than two CPUs and only these six runnable processes, the C process will be unable to consume the 50 percent entitlement, and the residue is shared in proportion between A and N.

Solaris Resource Manager and the Solaris nice Facility

The nice facility in the Solaris environment allows a user to reduce the priority of a process so that normal processes will not be slowed by non-urgent ones. With Solaris Resource Manager, the incentive for users to use this facility is a reduced charge rate for CPU time used at a lower priority.

Solaris Resource Manager implements this effect by allowing the central administrator to bias the sharepri decay rate for processes which have applied nice. The pridecay global Solaris Resource Manager parameter in the srmadm(1MSRM) command is used to set the decay rates for the priorities of processes with normal and maximum nice values. The rates for all intervening nice values are interpolated between them and similarly extrapolated to the minimum nice value. For example, the priority (for example, sharepri) for normal processes may be decayed with a half-life of 2 seconds, while the priority of processes with a maximum nice value may be decayed with a half-life of 60 seconds.

The effect is that processes using nice to reduce their priority get a smaller share of CPU than other processes on the same lnode. Under Solaris Resource Manager nice has little influence on execution rates for processes on different lnodes unless the queue of runnable processes exceeds the number of CPUs.

Solaris Resource Manager treats processes with a maximum nice value (for example, those started with a nice -19 command) specially. Such processes will only be granted CPU ticks if no other process requests them and they would otherwise be idle.

For information on nice, see nice(1) and nice(2SRM). For information on the relationship of Solaris Resource Manager to other resource control features, see Differences Between Solaris Resource Manager and Similar Products.

Dynamic Reconfiguration

The dynamic reconfiguration (DR) feature of Sun Enterprise servers enables users to dynamically add and delete system boards, which contain hardware resources such as processors, memory, and I/O devices. Solaris Resource Manager keeps track of the available processor resources for scheduling purposes and appropriately handles the changes, fairly redistibuting currently available processor resources among eligible users and processes.

Because Solaris Resource Manager controls only the virtual memory sizes of processes, not the physical memory used by processes and users, the effect of a DR operation on memory has no impact on Solaris Resource Manager's memory-limit checking.

srmidle Lnode

The idle lnode (srmidle) is the lnode assigned by the central administrator to charge for all the kernel's idle CPU costs. At installation, srmidle was created with a UID of 41. The srmidle lnode should have zero shares, to ensure that the processes attached to it are run only when no other processes are active. The srmidle lnode is assigned using the srmadm command.

At boot time, the default idle lnode is the root lnode. At transition to multi-user mode, the init.d script will set the idle lnode to that of the account srmidle if such an account exists. This behavior can be customized by specifying a different lnode to use in the /etc/init.d/init.srm script.

If the idle lnode is not root, then it must be a direct child of root.

srmother Lnode

The other lnode (srmother) is the lnode assigned by the system administrator as the default parent lnode for new users created after the initial install (where root is the default parent lnode). The srmother lnode, which is created automatically by the system at installation time and cannot be changed, has a default value of 1 share, to ensure that lnodes attached to it will have access to the CPU. The srmother lnode was created with a UID of 43.

The srmother lnode should have no resource limits, a CPU share of 1 or more, and no special privileges.

srmlost Lnode

Under Solaris Resource Manager, the setuid(2SRM) system call has the side effect of attaching the calling process to a new lnode. If the change of attachment fails, typically because the new lnode does not exist, the process is attached instead to the lost lnode (srmlost), which was created when you installed Solaris Resource Manager. If this attachment also fails or no srmlost lnode has been nominated, then the setuid function is unaffected and the process continues on its current lnode.

The init.srm script sets the srmlost lnode during the transition to multi-user mode. This behavior can be overridden by specifying an lnode to use in the /etc/init.d/init.srm file. To avoid security breaches, the srmlost lnode should have a CPU share of 1, and no special privileges. If you alter the values, consider the requirements for this user when making the change.

The srmlost lnode was created with a UID of 42.