Solaris Resource Manager 1.3 System Administration Guide

Chapter 2 Normal Operations

This chapter describes Solaris Resource Manager principles of operation and introduces key concepts. Workload Configuration provides an example to reinforce the descriptions and to illustrate a simple hierarchy. (A more complex hierarchy example is presented in Chapter 10, Advanced Usage.)

Limit Node Overview

Solaris Resource Manager is built around a fundamental addition to the Solaris kernel called the lnode (limit node). Lnodes correspond to UNIX user IDs (UIDs) and can represent individual users, groups of users, applications, and special requirements. Lnodes are indexed by UID; they are used to record resource allocation policies and accrued resource usage data by processes at the user, group, and application levels.

Although an lnode is identified by UID, it is separate from the credentials that affect permissions. The credential structure determines whether a process can read, write, and modify a file. The lnode structure is used to track the resource limits and usage.

In certain cases, a user might want to use a different set of limits. This is accomplished by using srmuser(1SRM) to attach to a different lnode. Note that this change does not affect the credential structure, which is still associated with the original UID, and that the process still retains the same permissions.

Resource Management

Hierarchical Structure

The Solaris Resource Manager management model organizes lnodes into a hierarchical structure called the scheduling tree. The scheduling tree is organized by UID: each lnode references the UID of the lnode's parent in the tree.

Each sub-tree of the scheduling tree is called a scheduling group, and the user at the root of a scheduling group is the group header. (The root user is the group header of the entire scheduling tree.) Setting the flag flag.admin delegates the ability to manage resource policies within the scheduling group to the group header.

Lnodes are initially created by parsing the UID file. After Solaris Resource Manager has been installed, the lnode administration command (limadm(1MSRM)) is used to create additional lnodes and assign them to parents. The scheduling tree data is stored in a flat file database that can be modified as required using limadm.

Although a UID used by an lnode does not have to correspond to a system account, with an entry in the system password map, it is recommended that a system account be created for the UID of every lnode. In the case of a non-leaf lnode (one with subordinate lnodes below it in the hierarchy), it is possible for the account associated with the lnode to be purely administrative; no one ever logs in to it. However, it is equally possible that it is the lnode of a real user who does log in and run processes attached to this non-leaf lnode.

Note that Solaris Resource Manager scheduling groups and group headers have nothing to do with the system groups defined in the /etc/group database. Each lnode in the scheduling tree, including group headers, corresponds to a real system user with a unique UID.

Hierarchical Limits

If a hierarchical limit is assigned to a group header, it applies to the usage of that user plus the total usage of all members of the scheduling group. This allows limits to be placed on entire groups, as well as on individual members. Resources are allocated to the group header, who can allocate them to users or groups of users that belong to the same group.

Processes

Every process is attached to an lnode. The init process is always attached to the root lnode. When processes are created by the fork(2) system call, they are attached to the same lnode as their parent. Processes may be re-attached to any lnode using a Solaris Resource Manager system call, given sufficient privilege. Privileges are set by the central system administrator or by users with the correct administrative permissions enabled.

Resource Control

The Solaris Resource Manager software provides control of the following system resources: CPU (rate of processor) usage, virtual memory, physical memory (Solaris 8 only), number of processes, number of concurrent logins of a user and/or a scheduling group, and terminal connect-time.

Table 2-1 Solaris Resource Manager Functions


System Resource	Allocation Policy	Control	Measurement	Usage Data
CPU Usage	Yes (per user ID)	Yes	Yes (per user ID)	Yes
Virtual Memory	Yes (per-user, per-process)	Yes (per-user, per-process)	Yes (per-user, per-process)	Yes
Physical Memory (Solaris 8 Only)	Yes	Yes	Yes	Yes
No. of Processes	Yes	Yes	Yes	Yes
User/Scheduling Group Logins	Yes	Yes	Yes	Yes
Connect-Time	Yes	Yes	Yes	Yes

Solaris Resource Manager keeps track of the usage of each resource by each user. Users may be assigned hard limits on resource usage for all resources except CPU. A hard limit will cause resource consumption attempts to fail if the user allows the usage to reach the limit. Hard limits are directly enforced by either the kernel or the software that is responsible for managing the respective resource.

A limit value of zero indicates no limit. All limit attributes of the root lnode should be left set to zero.

Generally all system resources can be divided into one of two classes: fixed (or non-renewable) resources and renewable resources. However, in Solaris Resource Manager 1.3 used in the Solaris 8 environment, a third class is introduced: the soft limit. The way in which resident set size (RSS) soft limits are indirectly enforced by the resource cap enforcement daemon means that the usage can temporarily exceed the limit. See Chapter 8, Physical Memory Management Using the Resource Capping Daemon for additional information.

Solaris Resource Manager manages fixed and renewable resources differently.

Fixed Resources

Fixed or non-renewable resources are those which are available in a finite quantity. Examples include virtual memory, number of processes, concurrent logins of a user and/or a scheduling group, and connect-time. A fixed resource can be consumed (allocated) and relinquished (deallocated), but no other entity can use the resource before the owner deallocates it. Solaris Resource Manager employs a usage and limit model to control the amount of fixed resources used. Usage is defined as the current resource being used, and limit is the maximum level of usage that Solaris Resource Manager permits.

Renewable Resources

Renewable resources are those which are in continuous supply, such as CPU time. Renewable resources can only be consumed, and, once consumed, cannot be reclaimed. At any one time, a renewable resource will have limited availability and if not used at that time, it will no longer be available. (An analogy is sunlight. There is only a certain amount arriving from the sun at any given instant, but more will surely be coming for the next few million years.) For this reason, renewable resources can be reassigned to other users without explicit reallocation to ensure no waste.

Solaris Resource Manager employs a usage, limit, and decay rate to control a user's rate of consumption of a renewable resource. Usage is defined as the total resource used, with a limit set on the ratio of usages in comparison to other users in the group. Decay rate refers to the period by which historical usage is discounted. The next resource quantum, for example, clock tick, will be allocated to the active lnode with the lowest decayed total usage value in relation to its allocated share. The decayed usage value is a measure of the total usage over time less some portion of historical usage determined by a half-life decay model.

CPU Resource Management

The allocation of the renewable CPU resource is controlled using a fair share scheduler called the Solaris Resource Manager SHR scheduler.

Scheduler Methodology

Each lnode is assigned a number of CPU shares. The processes associated with each lnode are allocated CPU resources in proportion to the total number of outstanding active shares (active means that the lnode has running processes attached). Only active lnodes are considered for an allocation of the resource, because only they have active processes running and need CPU time.

As a process consumes CPU ticks, the CPU usage attribute of its lnode increases. The scheduler regularly adjusts the priorities of all processes to force the relative ratios of CPU usages to converge on the relative ratios of CPU shares for all active lnodes at their respective levels. In this way, users can expect to receive at least their entitlements of CPU service in the long run, regardless of the behavior of other users.

The scheduler is hierarchical because it also ensures that groups receive their group entitlements independently of the behavior of the members. Solaris Resource Manager SHR scheduler is a long-term scheduler; it ensures that all users and applications receive a fair share over the course of the scheduler term. This means that when a light user starts to request the CPU, that user will receive commensurately more resource than heavy users until their comparative usages are in line with their relative "fair" share allocation. The more you use over your entitlement now, the less you will receive in the future.

Additionally, Solaris Resource Manager has a decay period, set by the system administrator, that does not track past usage. The decay model is one of half-life decay, where 50 percent of the resource has been decayed away within one half-life. This ensures that steady, even users are not penalized by short-term, process-intensive users. The half-life decay period sets the responsiveness, or term, of the scheduler; the default value is 120 seconds. A long half-life favors even usage, typical of longer batch jobs, while a short half-life favors interactive users. Shorter values tend to provide more even response across the system, at the expense of slightly less accuracy in computing and maintaining system-wide resource allocation. Regardless of administrative settings, the scheduler tries to prevent resource starvation and ensure reasonable behavior, even in extreme situations.

Scheduler Advantages

The primary advantage of the Solaris Resource Manager SHR scheduler over the standard Solaris scheduler is that it schedules users or applications rather than individual processes. Every process associated with an lnode is subject to a set of limits. For the simple case of one user running a single active process, this is the same as subjecting each process to the limits listed in the corresponding lnode. When more than one process is attached to an lnode, as when members of a group each run multiple processes, all of the processes are collectively subject to the listed limits. This means that users or applications cannot consume CPU at a greater rate than their entitlements allow, regardless of how many concurrent processes they run. The method for assigning entitlements as a number of shares is simple and understandable, and the effect of changing a user's shares is predictable.

Another advantage of the SHR scheduler is that while it manages the scheduling of individual threads (technically, in Solaris, the scheduled entity is a lightweight process (LWP)), it also apportions CPU resources between users.

These concepts are illustrated by the following equation:

The new_SRM_priority is then mapped to the system priority. The higher the Solaris Resource Manager priority, the lower the system priority, and vice versa. Every decay period, CPU_usage is reduced by half and incremented with the most current usage.

Each user also has a set of flags, which are boolean-like variables used to enable or disable selective system privileges, such as login. Flags can be set individually per user, or be inherited from a parent lnode.

A user's usages, limits, and flags can be read by any user, but they can be altered only by users with the correct administrative privileges.

Eliminating CPU Waste

Solaris Resource Manager never wastes CPU availability. No matter how low a user's allocation, that user is always given all the available CPU if there are no competing users. One consequence of this is that users may notice performance that is less smooth than usual. If a user with a very low effective share is running an interactive process without any competition, the job will appear to run quickly. However, as soon as another user with a greater effective share demands CPU time, it will be given to that user in preference to the first user, so the first user will notice a marked job slow-down. Nevertheless, Solaris Resource Manager goes to some lengths to ensure that legitimate users are not cut off and unable to do any work. All processes being scheduled by Solaris Resource Manager (except those with a maximum nice value) will be allocated CPU regularly by the scheduler. There is also logic to prevent a new user that has just logged in from being given an arithmetically "fair" but excessively large proportion of the CPU to the detriment of existing users.

Virtual Memory (Per-User and Per-Process Limits)

Virtual memory is managed using a fixed resource model. The virtual memory limit applies to the sum of the memory sizes of all processes attached to the lnode. In addition, there is a per-process virtual memory limit that restricts the total size of the process's virtual address space size, including all code, data, stack, file mappings, and shared libraries. Both limits are hierarchical. Limiting virtual memory is useful for avoiding virtual memory starvation. For example, Solaris Resource Manager will stop an application that is leaking memory from consuming unwarranted amounts of virtual memory to the detriment of all users. Instead, such a process only starves itself or, at worse, others in its resource group.

Physical Memory

If you are using Solaris Resource Manager 1.3 in the Solaris 8 operating environment, you can regulate the resource consumption of physical memory by collections of processes attached to an lnode or project. See Chapter 8, Physical Memory Management Using the Resource Capping Daemon for information on this functionality.

Number of Processes

The number of processes that users can run simultaneously is controlled using a fixed resource model with hierarchical limits.

Terminal and Login Connect-Time Limits

The system administrator and group header can set terminal login privileges, number of logins, and connect-time limits, which are enforced hierarchically by Solaris Resource Manager. As a user approaches a connect-time limit, warning messages are sent to the user's terminal. When the limit is reached, the user is notified, then forcibly logged out after a short grace period.

Solaris Resource Manager progressively decays past usage of connect-time so that only the most recent usage is significant. The system administrator sets a half-life parameter that controls the rate of decay. A long half-life favors even usage, while a short half-life favors interactive users.

User Administration

The central system administrator (or superuser) can create and remove user lnodes. This administrator may alter the limits, usages, and flags of any user, including its own, and also set administrative privileges for any lnode, including assigning administrative privileges selectively to users.

A sub-administrator can be granted these privileges by the setting of the flag.uselimadm flag. A sub-administrator can execute any limadm command as user root, and can be thought of as an assistant to the superuser.

A user granted hierarchical administrative privilege by the setting of the flag.admin flag is called a group administrator. A group administrator can modify the lnodes of users within the sub-tree of which they are the group header, and manage the group's resource allocation and scheduling policy. Group administrators cannot alter their own limits or flags, and cannot circumvent their own flags or limits by altering flags or usages within their group.

Measurement

Resource usage information is visible to the administrators of the system, and provides two views of resource usage information: that based on each user, and a workload view of resource usage.

Usage Data Overview

The Solaris Resource Manager system maintains information (primarily current and accrued resource usage) that can be used by administrators to conduct comprehensive system resource accounting. No accounting programs are supplied as part of Solaris Resource Manager, but its utility programs provide a base for the development of a customized resource accounting system.

For more information on setting up accounting procedures, see Chapter 9, Usage Data.

Workload Configuration

The key to effective resource management using Solaris Resource Manager is a well-designed resource hierarchy. Solaris Resource Manager uses the lnode tree to implement the resource hierarchy.

Mapping the Workload to the Lnode Hierarchy

Each node in the lnode tree maps to a UID in the password map, which means that workloads must be mapped to align with entries in the password map. In some cases, additional users may need to be created to cater to the leaf nodes in the hierarchy. These special users do not actually run processes or jobs, but act as an administration point for the leaf node.

A Simple Flat Hierarchy

This simple hierarchy was constructed to control the processing resources of two users, Chuck and Mark. Both of these users consume large amounts of CPU resources at various points, and thus affect one other at different times during the day.

To resolve this, a single-level hierarchy is constructed, and equal shares of CPU are allocated to each user.

Figure 2-1 A Simple Flat Solaris Resource Manager Hierarchy

Diagram shows an example flat hierarchy where each user gets 50 CPU shares out of 100 shares.

This simple hierarchy is established using the limadm command to make Chuck and Mark children of the root share group:

# limadm set sgroup=root chuck
# limadm set sgroup=root mark

To allocate 50 percent of the resources to each user, give each user the same number of CPU shares. (For simplicity, in this example 50 shares have been allocated to each user, but allocating 1 share to each would achieve the same result.) The limadm command is used to allocate the shares:

# limadm set cpu.shares=50 chuck
# limadm set cpu.shares=50 mark

Use the liminfo command to view the changes to the lnode associated with Chuck:

# liminfo -c chuck
Login name:                  chuck       Uid (Real,Eff):         2001 (-,-)     
Sgroup (uid):             root (0)       Gid (Real,Eff):          200 (-,-)     

Shares:                         50       Myshares:                        1     
Share:                          41 %     E-share:                         0 %   
Usage:                           0       Accrued usage:                   0     

Mem usage:                       0 B     Term usage:                     0s     
Mem limit:                       0 B     Term accrue:                    0s     
Proc mem limit:                  0 B     Term limit:                     0s     
Mem accrue:                      0 B.s 

Processes:                       0       Current logins:                  0     
Process limit:                   0     

Last used: Tue Oct 4 15:04:20 1998
Directory: /users/chuck
Name:       Hungry user     
Shell:      /bin/csh     

Flags:

The fields displayed by the liminfo command are explained in A Typical Application Server. Also refer to the liminfo(1SRM) man page for more information on liminfo fields.