Solaris Resource Manager 1.3 System Administration Guide

Chapter 10 Advanced Usage

This chapter further explores ways to prioritize and manage dissimilar applications on the system. An example that illustrates these capabilities and other key concepts associated with the Solaris Resource Manager software is provided. The last part of this chapter discusses how to configure Solaris Resource Manager in a Sun Cluster 3.0 12/01 (and later update) environment.

Batch Workloads

Most commercial installations have a requirement for batch processing. Batch processing is typically done at night, after the daily online workload has diminished. This is usually the practice for two reasons: to consolidate the day's transactions into reports, and to prevent batch workloads from impacting the online load.

See Examples for an illustration of a hierarchy to control the environment in which batch jobs are run; this section also covers the Solaris Resource Manager commands used in the process.

Resources Used by Batch Workloads

Batch workloads are a hybrid between online transaction processing (OLTP) and decision support system (DSS) workloads, and their effect on the system lies somewhere between the two. A batch workload can consist of many repetitive transactions to a database, with some heavy computational work for each. A simple example would be the calculation of total sales for the day. In this case, the batch process would retrieve every sales transaction for the day from the database, extract the sales amount, and maintain a running sum.

Batch processing typically places high demands on both processor and I/O resources, since a large amount of CPU is required for the batch process and the database, and a large number of I/Os are generated from the backend database for each transaction retrieved.

A batch workload is controlled by effectively limiting the rate of consumption of both CPU and I/O. Solaris Resource Manager allows fine-grained resource control of CPU, but I/O resources must be managed by allocating different I/O devices to each workload.

Two methods are typically used to isolate batch resource impact:

Make a copy of the database on a separate system and run the batch and reporting workloads on that separate system. (Note, however, that in most situations, the batch process updates parts of the online database and cannot be separated from it.)
Use CPU resource control.

Because the amount of I/O generated from a batch workload is proportional to the amount of CPU consumed, limits on CPU cycles can be used to indirectly control the I/O rate of the batch workload. Note, however, that care must be taken to ensure that excessive I/O is not generated on workloads that have very light CPU requirements.

Problems Associated With Batch Processing

By definition, a batch workload is a workload that runs unconstrained, and it will attempt to complete in the shortest time possible. This means that batch is the worst resource consumer, because it will take all the resources it needs until it is constrained by a system bottleneck (generally the smallest dataflow point in the system).

Batch presents two problems for system managers; it can impact other batch jobs running concurrently, and it can never be run together with the online portion of the workload during business hours.

Even if the batch jobs are scheduled to run during off-hours, for example, from 12:00 a.m. to 6:00 a.m., a system problem or a day of high sales could cause the batch workload to spill over into business hours. Although not quite as bad as downtime, having a batch workload still running at 10:30 a.m. the next day could make online customers wait several minutes for each transaction, ultimately leading to fewer transactions.

Using resource allocation will limit the amount of resources available to the batch workloads and constrain them in a controlled manner.

Consolidation

Solaris Resource Manager permits system resources to be allocated at a system wide level, sometimes in proportion with business arrangements of machine usage between departments. Examples shows how the Solaris Resource Manager hierarchy is used to achieve this.

Virtual Memory and Databases

The Solaris Resource Manager product also provides the capability to limit the amount of virtual memory used by users and workloads. This capability does not manage physical memory; rather it effectively restricts the amount of global swap space that is consumed by each user.

When a user or workload reaches the virtual memory limit set for their lnode, the system returns a memory allocation error to the application; that is, calls to malloc() fail. This error code is reported to the application as though the application had run out of swap space.

Few applications respond well to memory allocation errors. Thus, it is risky to ever let a database server reach its virtual memory limit. In the event that the limit is reached, the database engine might crash, resulting in a corrupted database.

Virtual memory limits should be set high so they are not reached under normal circumstances. Additionally, the virtual memory limit can be used to place a ceiling over the entire database server, which will stop a failing database with a memory leak from affecting other databases or workloads on the system.

Managing NFS

NFS, Sun's distributed computing environment, runs as kernel threads and uses the kernel scheduling class SYS. Since scheduling allocation for NFS is not managed by the Solaris Resource Manager SHR class, no CPU resource control of NFS is possible. The Solaris Resource Manager product's ability to allocate processor resources may be reduced on systems offering extensive NFS service.

NFS can, however, be controlled by using network port resource management. For example, Solaris Bandwidth Manager can be used to control the number of NFS packets on the server. NFS can also be managed in some cases by using processor sets to limit the number of CPUs available in the system class.

Managing Web Servers

The Solaris Resouce Manager software can be used to manage resources on web servers by controlling the amount of CPU and virtual memory. Three basic topologies are used on systems that host web servers.

Resource Management of a Consolidated Web Server

A single web server can be managed by controlling the amount of resource that the entire web server can use. This is useful in an environment in which a web server is being consolidated with other workloads. This is the most basic form of resource management, and simply prevents other workloads from impacting the performance of the web server, and vice versa. For example, if a CGI script in the web server runs out of control with a memory leak, the entire system will not run out of swap space; only the web server will be affected.

In this example, a web server is allocated 20 shares, which means that it is guaranteed at least 20 percent of the processor resources should the database place excessive demands on the processor.

Diagram shows that web server is guaranteed its percentage of processor resources even if another application places excessive demands on the CPU.

See Putting on a Web Front-end Process for an additional web server example.

Finer-Grained Resource Management of a Single Web Server

There are often requirements to use resource management to control behavior within a single web server. For example, a single web server can be shared between many users, each with their own cgi-bin programs.

An error in a single cgi-bin program can cause the entire web server to run slow, or in the case of a memory leak, could even bring down the web server. To prevent this from happening, per-process limits can be used.

Diagram shows use of per-process limits within a single web server.

Resource Management of Multiple Virtual Web Servers

Single machines are often used to host multiple virtual web servers in a consolidated fashion. In this case, multiple instances of the httpd web server process exist, and there is far greater opportunity to exploit resource control through Solaris Resource Manager.

It is possible to run each web server as a different UNIX UID by setting a parameter in the web server configuration file. This effectively attatches each web server to a different lnode in the Solaris Resource Manager hierarchy.

For example, the Sun WebServer^TM has the following parameters in the configuration file /etc/http/httpd.conf:

# Server parameters
 server  {
   server_root                   "/var/http/"
   server_user                   "webserver1"
   mime_file                     "/etc/http/mime.types"
   mime_default_type             text/nlain
   acl_enable                    "yes"
   acl_file                      "/etc/http/access.acl"
   acl_delegate_depth            3
   cache_enable                  "yes"
   cache_small_file_cache_size   8                       # megabytes
   cache_large_file_cache_size   256                     # megabytes
   cache_max_file_size           1                       # megabytes
   cache_verification_time       10                      # seconds
   comment                       "Sun WebServer Default Configuration"

   # The following are the server wide aliases

   map   /cgi-bin/               /var/http/cgi-bin/              cgi
   map   /sws-icons/             /var/http/demo/sws-icons/
   map   /admin/                 /usr/http/admin/

 # To enable viewing of server stats via command line,
 # uncomment the following line
   map   /sws-stats              dummy                           stats
 }

By configuring each web server to run as a different UNIX UID, you can set different limits on each web server. This is particularly useful for both control and accounting for resource usage on a machine hosting many web servers.

In this case, you can make use of many or all of the Solaris Resource Manager resource controls and limits:

Shares [cpu.shares]: The cpu.shares can be used to proportionally allocate resources to the different web servers.
Mem limit [memory.limit]: The memory.limit can be used to limit the amount of virtual memory that the web server can use, which will prevent any one web server from causing another to fail due to memory allocation.
Proc mem limit [memory.plimit]: The per-process memory limit can be used to limit the amount of virtual memory a single cgi-bin process can use, which will stop any cgi-bin process from bringing down its respective web server.
Process limit [process.limit]: The maximum total number of processes allowed to attach to a web server can effectively limit the number of concurrent cgi-bin processes.

The Role and Effect of Processor Sets

Even with the Solaris Resource Manager software in effect, processor sets can still play an important role in resource allocation. There might be cases in which a system must have hard limits applied to the resource policies. For example, a company may purchase a single 24-processor system, and then host two different business units from the same machine. Each of the business units pays for a proportion of the machine, 40 percent and 60 percent, for example. In this scenario, the administrator might want to establish that the business that pays for 40 percent of the machine never gets more than that share.

With processor sets, it is possible to divide the workloads into 40 percent and 60 percent by allocating 10 processors to the unit with 40 percent, and 14 processors to the unit with 60 percent.

When using processor sets with the Solaris Resource Manager product, it is important to understand the interaction between these two technologies. In some circumstances, the net effect might be different than anticipated.

A Simple Example

The following illustration shows a simple combination of Solaris Resource Manager and processor sets. In this example, processor sets and Solaris Resource Manager CPU shares are mixed.

Graphic description provided in context. Refer to two paragraphs that immediately follow graphic.

User 1 has 25 Solaris Resource Manager shares and is restricted to processor set A (1 CPU). User 2 has 75 Solaris Resource Manager shares and is restricted to processor set B (1 CPU).

In this example, user 2 will consume its entire processor set (50 percent of the system). Because user 2 is only using 50 percent (rather than its allocated 75 percent), user 1 is able to use the remaining 50 percent. In summary, each user will be granted 50 percent of the system.

A More Complex Example

The following example shows a more complex scenario in which processor sets and Solaris Resource Manager CPU shares are mixed.

Users 1 and 3 have 10 Solaris Resource Manager shares each and are restricted to processor set A (1 CPU). User 2 has 80 Solaris Resource Manager shares and is restricted to processor set B (1 CPU).

In this example, user 2 will consume its entire processor set (50 percent of the system). Because user 2 is only using 50 percent (rather than its allocated 80 percent), users 1 and 3 are able to use the remaining 50 percent. This will mean that users 1 and 3 get 25 percent of the system, even though they are allocated only 10 shares each.

A Scenario to Avoid

The following scenario should be avoided.

In this scenario, one user has processes in both processor sets. User 1 has 20 Solaris Resource Manager shares and has processes in each processor set. User 2 has 80 Solaris Resource Manager shares and is restricted to processor set B (1 CPU).

In this example, user 1's first process will consume its entire processor set (50 percent of the system). Since user 2 is allowed 80 shares, user 2's process will consume its entire processor set (50 percent). Thus, user 1's second process will get no share of the CPU.

Examples

The examples in this section demonstrate Solaris Resource Manager functions used to control system resources and allocation, and to display information.

Server Consolidation

The first example illustrates these commands:

liminfo: Prints user attributes and limits information for one or more users to a terminal window
limadm: Changes limit attributes or deletes limits database entries for a list of users
srmadm: Displays or sets operation modes and system-wide Solaris Resource Manager tunable parameters
srmstat: Displays lnode activity information

Consider the case of consolidating two servers, each running a database application, onto a single machine. Simply running both applications on the single machine results in a working system. Without Solaris Resource Manager, the Solaris system allocates resources to the applications on an equal-use basis, and does not protect one application from competing demands by the other application. However, Solaris Resource Manager provides mechanisms that keep the applications from suffering resource starvation. With Solaris Resource Manager, this is accomplished by starting each database attached to lnodes referring to the databases, db1 and db2. To do this, three new administrative placeholder users must be created, for example, databases, db1, and db2. These are added to the limits database; since lnodes correspond to UNIX UIDs, these must also be added to the passwd file (or password map, if the system is using a name service such as NIS or NIS+). Assuming that the UIDs are added to the passwd file or password map, the placeholder users db1 and db2 are assigned to the databases lnode group with the commands:

# limadm set sgroup=0 databases
# limadm set sgroup=databases db1 db2

which assumes that /usr/srm/bin is in the user's path.

Figure 10-1 Server Consolidation

Diagram illustrates the consolidation of two servers, each running a database application, onto a single machine.

Because there are no other defined groups, the databases group currently has full use of the machine. Two lnodes associated with the databases are running, and the processes that run the database applications are attached to the appropriate lnodes with the srmuser command in the startup script for the database instances.

# srmuser db1 /usr/bin/database1/init.db1
# srmuser db2 /usr/bin/database2/init.db2

When either database, db1 or db2, is started up, use the srmuser command to ensure that the database is attached to the correct lnode and charged correctly (srmuser does not affect the ownership of the process to do this). To run the above command, a user must have the UNIX permissions required to run init.db1 and the administrative permission to attach processes to the lnode db1. As users log in and use the databases, activities performed by the databases are accrued to the lnodes db1 and db2.

By using the default allocation of one share to each lnode, the usage in the databases group will average out over time to ensure that the databases, db1 and db2, receive equal allocation of the machine. Specifically, there is one share outstanding-to the databases group-and databases owns it. Each of the lnodes db1 and db2 are also granted the default allocation of one share. Within the databases group, there are two shares outstanding, so db1 and db2 get equal allocation out of databases' resources (in this simple example, there are no competing allocations, so databases has access to the entire system).

If it turns out that activity on Database1 requires 60 percent of the machine's CPU capacity and Database2 requires 20 percent of the capacity, the administrator can specify that the system provide at least this much (assuming that the application demands it) by increasing the number of cpu.shares allocated to db1:

# limadm set cpu.shares=3 db1

There are now four shares outstanding in the databases group; db1 has three, and db2 has one. This change is effected immediately upon execution of the above command. There will be a period of settling when the lnode db1 (Database1) will actually receive more than its entitled 60 percent of the machine resource, as Solaris Resource Manager works to average the usage over the course of time. However, depending on the decay global parameter, this period will not last long.

To monitor this activity at any point, use the commands liminfo (see A Typical Application Server) and srmstat, in separate windows. Note that srmstat provides a regularly updating display. For additional information on srmstat, see srmstat(1SRM).

You now have a machine running with two database applications, one receiving 75 percent of the resource and the other receiving 25 percent. Remember that root is the top-level group header user. Processes running as root thus have access to the entire system, if they so request. Accordingly, additional lnodes should be created for running backups, daemons, and other scripts so that the root processes cannot possibly take over the whole machine, as they might if run in the traditional manner.

Adding a Computational Batch Application User

This example introduces the following command:

srmkill: Kills all the active processes attached to an lnode

The Finance department owns the database system, but Joe, a user from Engineering, has to run a computational batch job and would like to use Finance's machine during off hours when the system is generally idle. The Finance department dictates that Joe's job is less important than the databases, and agrees to run his work only if it will not interfere with the system's primary job. To enforce this policy, add a new group (batch) to the lnode database, and add Joe to the new batch group of the server's lnode hierarchy:

# limadm set cpu.shares=20 databases
# limadm set cpu.shares=1 batch
# limadm set cpu.shares=1 joe
# limadm set sgroup=batch joe

Figure 10-2 Adding a Computation Batch Application

Diagram shows addition of a new group called batch to the lnode database and server hierarchy, and addition of user Joe to the new batch group.

This command sequence changes the allocation of shares so that the databases group has 20 shares, while the batch group has just one. This specifies that members of the batch group (only Joe) will use at most 1/21 of the machine if the databases group is active. The databases group receives 20/21, or 95.2 percent, more than the 60% + 20% = 80% previously determined to be sufficient to handle the database work. If the databases are not requesting their full allocation, Joe will receive more than his 4.8 percent allocation. If the databases are completely inactive, Joe's allocation might reach 100 percent. When the number of outstanding shares allocated to databases is increased from 1 to 20, there is no need to make any changes to the allocation of shares for db1 and db2. Within the databases group, there are still four shares outstanding, allocated in the 3:1 ratio. Different levels of the scheduling tree are totally independent; what matters is the ratio of shares between peer groups.

Despite these assurances, the Finance department further wants to ensure that Joe is not even able to log in during prime daytime hours. This can be accomplished by putting some login controls on the batch group. Since the controls are sensitive to time of day, run a script that only permits the batch group to log in at specific times. For example, this could be implemented with crontab entries, such as:

0 6 * * * /usr/srm/bin/limadm set flag.nologin=set batch 
0 18 * * * /usr/srm/bin/limadm set flag.nologin=clear batch

At 6:00 a.m., batch does not have permission to log in, but at 18:00 (6 p.m.), the limitation is removed.

An even stricter policy can be implemented by adding another line to the crontab entry:

01 6 * * * /usr/srm/bin/srmkill joe

This uses the srmkill(1MSRM) command to kill any processes attached to the lnode Joe at 6:01 a.m. This will not be necessary if the only resources that the job requires are those controlled by Solaris Resource Manager. This action could be useful if Joe's job could reasonably tie up other resources that would interfere with normal work. An example would be a job that holds a key database lock or dominates an I/O channel.

Joe can now log in and run his job only at night. Because Joe (and the entire batch group) has significantly fewer shares than the other applications, his application will run with less than 5 percent of the machine. Similarly, nice(1) can be used to reduce the priority of processes attached to this job, so it runs at lower priority than other jobs running with equal Solaris Resource Manager shares.

At this point, the Finance department has ensured that its database applications have sufficient access to this system and will not interfere with each other's work. The department has also accommodated Joe's overnight batch processing loads, while ensuring that his work also will not interfere with the department's mission-critical processing.

Putting on a Web Front-end Process

Assume a decision has been made to put a web front-end on Database1, but limit this application to no more than 10 users at a time. Use the process limits function to do this.

First, create a new lnode called ws1. By starting the Webserver application under the ws1 lnode, you can control the number of processes that are available to it, and hence the number of active http sessions.

Figure 10-3 Adding a Web Front-end Process

Diagram shows adding a web front-end process under the db1 lnode.

Since Webserver is part of the Database1 application, you might want to give it a share of the db1 lnode and allow it to compete with Database1 for resources. Allocate 60 percent of compute resources to the Webserver and 40 percent to the Database1 application itself:

# limadm set cpu.shares=6 ws1
# limadm set sgroup=db1 ws1
# limadm set cpu.myshares=4 db1
# srmuser ws1 /etc/bin/Webserver1/init.webserver

The last line starts up the Webserver and charges the application to the ws1 lnode. Note that for Database1, the cpu.myshares have been allocated at 4. This sets the ratio of shares for which db1 will compete with its child process, Webserver, at a ratio of 4:6.

Note -

cpu.shares shows the ratio for resource allocation at the peer level in a hierarchy, while cpu.myshares shows the ratio for resource allocation at the parent:children level when the parent is actively running applications. Solaris Resource Manager allocates resources based on the ratio of outstanding shares of all active lnodes at their respective levels, where "respective level" includes the my.shares of the group parent and all children.

To control the number of processes that Webserver can run, put a process limit on the ws1 lnode. The example uses 20 since a Webserver query will typically spawn 2 processes, so this in fact limits the number of active Webserver queries to 10:

# limadm set process.limit=20 ws1

Another application has now been added to the scheduling tree, as a leaf node under an active lnode. To distribute the CPU resource between the active parent and child, use cpu.myshares to allocate some portion of the available resource to the parent and some to the child. Process limits are used to limit the number of active sessions on an lnode.

Adding More Users Who Have Special Memory Requirements

This example implements the resource control mechanisms CPU sharing, process limits, and login controls, and it addresses display tools for printing lnodes and showing active lnodes.

srmadm: Administers Solaris Resource Manager
limreport: Outputs information on selected users
limdaemon: Directs daemon to send messages when any limits are reached

Another user, Sally, has also asked to use the machine at night, for her application. Since her application is CPU-intensive, to ensure that Joe's application does not suffer, put a limit on Sally's usage of virtual memory, in terms of both her total usage and her "per-process" usage:

# limadm set memory.limit=50M sally
# limadm set memory.plimit=25M sally

Figure 10-4 Adding More Users

Diagram shows adding more users with specific memory limits.

If and when Sally's application tries to exceed either her total virtual memory limit or process memory limit, the limdaemon command will notify Sally and the system administrator, through the console, that the limit has been exceeded.

Use the limreport command to generate a report of who is on the system and their usages to date. A typical use of limreport is to see who is using the machine at any time and how they fit within the hierarchy of users:

% limreport 'flag.real' - uid sgroup lname cpu.shares cpu.usage |sort +1n +0n

Note -

limreport has several parameters. In this example, a check is made on "flag.real" (only looking for "real" lnodes/UIDs); the dash (-) is used to indicate that the default best guess for the output format should be used, and the list "uid sgroup lname cpu.shares cpu.usage" indicates limreport should output these five parameters for each lnode with flag.real set to TRUE. Output is piped to a UNIX primary sort on the second column and secondary sort on the first column to provide a simple report of who is using the server.

Anyone with the correct path and permissions can check on the status of Solaris Resource Manager at any time using the command srmadm show. This will output a formatted report of the current operation state of Solaris Resource Manager and its main configuration parameters. This is useful to verify that Solaris Resource Manager is active and all the controlling parameters are active. It also shows the values of global parameters such as the decay rate and location of the Solaris Resource Manager data store.

It is possible to run Solaris Resource Manager without limits active and without CPU scheduling active, which can be useful at startup for debugging and for initially configuring the Solaris Resource Manager product:

# srmadm set share=n:limits=n

Sharing a Machine Across Departments

A different development group would like to purchase an upgrade for this machine (more processors and memory) in exchange for gaining access to the system when it is idle. Both groups should benefit. To set this up, establish a new group called development at the same level as databases and batch. Allocate development 33 percent of the machine since they have added 50 percent more CPU power and memory to the original system.

Figure 10-5 Sharing a Machine, Step 1

Diagram illustrates sharing a machine. Context provided in surrounding text.

The Development group has hundreds of users. To avoid being involved in the distribution of that group's resources, use the administration flag capability of Solaris Resource Manager to enable the Development system administrator to allocate their resources. You set up limits at the operations and development level as agreed jointly and then you each do the work required to control your own portions of the machine.

To add the new level into the hierarchy, add the group operations as a new lnode, and change the parent group of batch and databases to operations:

# limadm set sgroup=operations batch databases

To set the administration flag:

# limadm set flag.admin=set operations development

Since under normal circumstances all servers have daemons and backup processes to be run, these should be added on a separate high-level lnode.

Note -

Do not use the root lnode, since it has no limits.

Figure 10-6 Sharing a Machine, Step 2

Diagram continues the example of sharing a machine. Context provided in surrounding paragraphs.

As seen in the examples, you can use Solaris Resource Manager to consolidate several different types of users and applications on the same machine. By the judicious use of CPU share controls, virtual memory limits, process limits, and login controls, you can ensure that these diverse applications receive only the resources that they need. The limits ensure that no application or user is going to adversely impact any other user's or group of users' application. The Solaris Resource Manager product supports simple reporting tools that show users and system administrators exactly what is happening at any given moment, and over the course of time. The report generation capability can be used to show the breakdown of resource usage across applications and groups for capacity planning and billing purposes.

A Typical Application Server

This output would be displayed from a liminfo listing of db1 at the end of the example in the previous section. Typing:

# liminfo db1

produces:

Figure 10-7 `liminfo` Listing

liminfo output for db1 as constructed in the previous Examples section.

The remainder of this section describes the liminfo output produced in Figure 9-7. Refer to liminfo(1SRM) and srm(5SRM) for more information on the fields described below.

The first two lines of output from the liminfo command relate to aspects of the lnode UID and its position in the lnode tree:

Login name

The login name and initial GID from the password map that corresponds to the UID of the attached lnode. Every lnode is associated with a system UID. A system account should be created for the UID of every lnode. In this instance, a placeholder UID, db1, is used for Database1.

Note that the default PAM configuration under Solaris Resource Manager creates an lnode for any user who logs in without one. By default, lnodes created by the superuser or by a user with the uselimadm flag set are created with the lnode srmother as their parent, or if that does not exist, with the root lnode as their parent. The parent of an lnode can be changed with the command generally used to revise lnode attributes, limadm.

Uid

The UID of the lnode attached to the current process. Normally, this will be the same as that of the real UID of the process (the logged in user), but in some circumstances (described later) it may differ.

Gid

The GID of the lnode attached to the current process.

R,Euid and R,Egid

The real and effective UID and GID of the current process. This is the same information that is provided by the standard system id(1M) command. It is not strictly related to Solaris Resource Manager, but it is displayed for convenience. These fields are not displayed if liminfo is displaying information on a user other than the default (that is, if a login name or UID was given as an argument).

Sgroup (uid) [sgroup]

The name and UID of the parent lnode in the lnode tree hierarchy. This will be blank for the root lnode. Many Solaris Resource Manager features depend on the position of an lnode within the tree hierarchy, so it is useful for a user to trace successive parent lnodes back to the root of the tree.

After the blank line, the next three lines of the liminfo display show fields that relate to CPU scheduling:

Shares [cpu.shares]: This is the number of shares of CPU entitlement allocated to this user. It is only directly comparable to other users with the same parent lnode, and to the Myshares value of the parent lnode itself. Administrators might normally set the shares of all users within a particular scheduling group to the same value (giving those users equal entitlements). This value will normally be something greater than 1, so that administrators have some leeway to decrease the shares of specific users when appropriate.
Myshares [cpu.myshares]: This value is only used if this user has child lnodes (that is, if there are other lnodes that have an sgroup value of this user) that are active (that is, have processes attached). When this is the case, this value gives the relative share of CPU for processes attached to this lnode, compared with those attached to its child lnodes.
Share: The calculated percentage of the system CPU resources to which the current user is entitled. As other users log in and log out (or lnodes become active or inactive), this value will change, because only active users are included in the calculation. Recent usage by the current user is not included in this calculation.
E-share: This is the effective share of this user (that is, the actual percentage of the system CPU resources that this user would be given in the short term if the user required it and all other active users were also demanding their share). It can be thought of as the current willingness of Solaris Resource Manager to allocate CPU resources to that lnode. This value will change over time as the user uses (or refrains from using) CPU resources. Lnodes that are active but idle (that is, with attached processes sleeping), and so have a low usage, will have a high effective share value. Correspondingly, the effective share can be very small for users with attached processes that are actively using the CPU.
Usage [cpu.usage]: The accumulated usage of system resources that are used to determine scheduling priority. Typically, this indicates recent CPU usage, though other parameters may also be taken into account. The parameter mix used can be viewed with the srmadm command. Each increment to this value decays exponentially over time so that Solaris Resource Manager will eventually "forget" about the resource usage. The rate of this decay is most easily represented by its half-life, which can be seen with the srmadm command.
Accrued usage [cpu.accrue]: This is the same resource accumulation measurement as cpu.usage, but it is never decayed. It is not used directly by Solaris Resource Manager but can be used by administration for accounting purposes. Unlike usage, this value represents the sum of the accrued usages for all lnodes within the group, as well as that of the current lnode.

After the second blank line, the next four lines of the liminfo listing show fields that relate to virtual memory and terminal usage:

Mem usage [memory.usage][memory.myusage]

This is the combined memory usage of all processes attached to this lnode.

If two values are displayed, separated by a frontslash (/) character, then this lnode is a group header. The first value is the usage for the whole scheduling group, while the second value is that of the current user only.

Mem limit [memory.limit]

The maximum memory usage allowed for all processes attached to this lnode and its members (if any). That is, the sum of the memory usage for all processes within the group plus those attached to the group header will not be allowed to exceed this value. Note that in this instance, a value of zero (0) indicates that there is no limit.

Proc mem limit [memory.plimit]

The per-process memory limit is the maximum memory usage allowed for any single process attached to this lnode and its members.

Mem accrue [memory.accrue]

The memory.accrue value is measured in byte-seconds and is an indication of overall memory resources used over a period of time.

Term usage [terminal.usage]

The number of seconds of connect-time currently charged to the group.

Term accrue [terminal.accrue]

The number of seconds of connect-time used by the group.

Term limit [terminal.limit]

The maximum allowed value of the terminal.usage attribute. If zero, there is no limit, unless limited by inheritance.

After the third blank line, the next two lines of the liminfo listing display fields that relate to the user and processes:

Processes [process.usage][process.myusage]

The number of processes attached to this lnode. Note that this refers to processes, not a count of threads within a process.

If two values are displayed, separated by a frontslash (/) character, then this lnode is a group header and the first value is the usage for the whole scheduling group, while the second value is that of just the current user.

Process limit [process.limit]

The maximum total number of processes allowed to be attached to this lnode and its members.

Current logins [logins]

The current number of simultaneous Solaris Resource Manager login sessions for this user. When a user logs in through any of the standard system login mechanisms (including login(1), rlogin(1)-basically, anything that uses PAM for authentication and creates a utmp(4) entry), this counter is incremented. When the session ends, the count is decremented.

If a user's flag.onelogin flag evaluates to set, the user is only permitted to have a single Solaris Resource Manager login session.

After the fourth blank line, the next four lines of the liminfo display show these fields:

Last used [lastused]: This field shows the last time the lnode was active. This will normally be the last time the user logged out.
Directory: The user's home directory (items from the password map rather than from Solaris Resource Manager are shown for convenience).
Name: The db1 (finger) information, which is usually the user's name (items from the password map rather than from Solaris Resource Manager are shown for convenience).
Shell: The user's initial login shell (items from the password map rather than from Solaris Resource Manager are shown for convenience).

After the fifth blank line, the last line of the liminfo display shows this field:

Flags: Flags that evaluate to set or group in the lnode are displayed here. Each flag displayed is followed by suffix characters indicating the value and the way in which the flag was set (for example, whether it was explicitly from this lnode (+) or inherited (^)).

Configuring Solaris Resource Manager in Sun Cluster 3.0 Update Environments

Valid Topologies

You can install Solaris Resource Manager on any valid Sun Cluster 3.0 Update topology. See Sun Cluster 3.0 12/01 Concepts for descriptions of valid topologies.

Determining Requirements

Before you configure the Solaris Resource Manager product in a Sun Cluster environment, you must decide how you want to control and track resources across switchovers or failovers. If you configure all cluster nodes identically, usage limits will be enforced identically on primary and backup nodes.

While the configuration parameters need not be identical for all applications in the configuration files on all nodes, all applications must at least be represented in the configuration files on all potential masters of that application. For example, if Application 1 is mastered by phys-schost-1 but could potentially be switched or failed-over to phys-schost-2 or phys-schost-3, then Application 1 must be included in the configuration files on all three nodes (phys-schost-1, phys-schost-2, and phys-schost-3).

Solaris Resource Manager is very flexible with regard to configuration of usage and accrual parameters, and few restrictions are imposed by Sun Cluster. Configuration choices depend on the needs of the site. Consider the general guidelines in the following sections before configuring your systems.

Configuring Memory Limits Parameters

When using the Solaris Resource Manager product with Sun Cluster, you should configure memory limits appropriately to prevent unnecessary failover of applications and a ping-pong effect of applications. In general:

Do not set memory limits too low.

When an application reaches its memory limit, it might fail over. This is especially important for database applications, when reaching a virtual memory limit can have unexpected consequences.
Do not set memory limits identically on primary and backup nodes.

Identical limits can cause a ping-pong effect when an application hits its memory limit and fails over to a backup node with an identical memory limit. Set the memory limit slightly higher on the backup node. The applications, resources, and preferences at the site determine how much higher the limit is set. The difference in memory limits helps prevent the ping-pong scenario and gives you a period of time in which to adjust the parameters as necessary.
Do use the Solaris Resource Manager memory limits for coarse-grained problem scenario load-balancing.

For example, you can use memory limits to prevent an errant application from consuming excess resources.

Using Accrued Usage Parameters

Several Solaris Resource Manager parameters are used for keeping track of system resource usage accrual: CPU shares, number of logins, and connect-time. However, in the case of switchover or failover, usage accrual data (CPU usage, number of logins, and connect-time) will restart at zero by default on the new master for all applications that were switched or failed over. Accrual data is not transferred dynamically across nodes.

To avoid invalidating the accuracy of the Solaris Resource Manager usage accrual reporting feature, you can create scripts to gather accrual information from the cluster nodes. Because an application might run on any of its potential masters during an accrual period, the scripts should gather accrual information from all possible masters of a given application. For more information, see Chapter 9, Usage Data.

Failover Scenarios

On Sun Cluster, Solaris Resource Manager can be configured so that the resource allocation configuration described in the lnode configuration (/var/srm/srmDB) remains the same in normal cluster operation and in switchover or failover situations. For more information, see Sample Share Allocation.

The following sections are example scenarios.

The first two sections, Two-Node Cluster With Two Applications and Two-Node Cluster With Three Applications, show failover scenarios for entire nodes.
The section Failover of Resource Group Only illustrates failover operation for an application only.

In a cluster environment, an application is configured as part of a resource group (RG). When a failure occurs, the resource group, along with its associated applications, fails over to another node. In the following examples, Application 1 (App-1) is configured in resource group RG-1, Application 2 (App-2) is configured in resource group RG-2, and Application 3 (App-3) is configured in resource group RG-3.

Although the numbers of assigned shares remain the same, the percentage of CPU resources allocated to each application will change after failover, depending on the number of applications running on the node and the number of shares assigned to each active application.

In these scenarios, assume the following configurations.

All applications are configured under a common parent lnode.
The applications are the only active processes on the nodes.
The limits databases are configured the same on each node of the cluster.

Two-Node Cluster With Two Applications

You can configure two applications on a two-node cluster such that each physical host (phys-schost-1, phys-schost-2) acts as the default master for one application. Each physical host acts as the backup node for the other physical host. All applications must be represented in the Solaris Resource Manager limits database files on both nodes. When the cluster is running normally, each application is running on its default master, where it is allocated all CPU resources by Solaris Resource Manager.

After a failover or switchover occurs, both applications run on a single node where they are allocated shares as specified in the configuration file. For example, this configuration file specifies that Application 1 is allocated 80 shares and Application 2 is allocated 20 shares.

# limadm set cpu.shares=80 App-1 
# limadm set cpu.shares=20 App-2 
...

The following diagram illustrates the normal and failover operations of this configuration. Note that although the number of shares assigned does not change, the percentage of CPU resources available to each application can change, depending on the number of shares assigned to each process demanding CPU time.

The preceding context describes the graphic.

Two-Node Cluster With Three Applications

On a two-node cluster with three applications, you can configure it such that one physical host (phys-schost-1) is the default master of one application and the second physical host (phys-schost-2) is the default master for the remaining two applications. Assume the following example limits database file on every node. The limits database file does not change when a failover or switchover occurs.

# limadm set cpu.shares=50	 App-1
# limadm set cpu.shares=30	 App-2
# limadm set cpu.shares=20	 App-3 
...

When the cluster is running normally, Application 1 is allocated 50 shares on its default master, phys-schost-1. This is equivalent to 100 percent of CPU resources because it is the only application demanding CPU resources on that node. Applications 2 and 3 are allocated 30 and 20 shares, respectively, on their default master, phys-schost-2. Application 2 would receive 60 percent and Application 3 would receive 40 percent of CPU resources during normal operation.

If a failover or switchover occurs and Application 1 is switched over to phys-schost-2, the shares for all three applications remain the same, but the percentages of CPU resources are re-allocated according to the limits database file.

Application 1, with 50 shares, receives 50 percent of CPU.
Application 2, with 30 shares, receives 30 percent of CPU.
Application 3, with 20 shares, receives 20 percent of CPU.

The following diagram illustrates the normal and failover operations of this configuration.

Failover of Resource Group Only

In a configuration in which multiple resource groups have the same default master, it is possible for a resource group (and its associated applications) to fail over or be switched over to a backup node, while the default master remains up and running in the cluster.

Note -

During failover, the application that fails over will be allocated resources as specified in the configuration file on the backup node. In this example, the limits database files on the primary and backup nodes have the same configurations.

For example, this sample configuration file specifies that Application 1 is allocated 30 shares, Application 2 is allocated 60 shares, and Application 3 is allocated 60 shares.

# limadm set cpu.shares=30 App-1
# limadm set cpu.shares=60 App-2
# limadm set cpu.shares=60 App-3
...

The following diagram illustrates the normal and failover operations of this configuration, where RG-2, containing Application 2, fails over to phys-schost-2. Note that although the number of shares assigned does not change, the percentage of CPU resources available to each application can change, depending on the number of shares assigned to each application demanding CPU time.

Chapter 10 Advanced Usage

Batch Workloads

Resources Used by Batch Workloads

Problems Associated With Batch Processing

Consolidation

Virtual Memory and Databases

Managing NFS

Managing Web Servers

Resource Management of a Consolidated Web Server

Finer-Grained Resource Management of a Single Web Server

Resource Management of Multiple Virtual Web Servers

The Role and Effect of Processor Sets

A Simple Example

A More Complex Example

A Scenario to Avoid

Examples

Server Consolidation

Figure 10-1 Server Consolidation

Adding a Computational Batch Application User

Figure 10-2 Adding a Computation Batch Application

Putting on a Web Front-end Process

Figure 10-3 Adding a Web Front-end Process

Adding More Users Who Have Special Memory Requirements

Figure 10-4 Adding More Users

Sharing a Machine Across Departments

Figure 10-5 Sharing a Machine, Step 1

Figure 10-6 Sharing a Machine, Step 2

A Typical Application Server

Figure 10-7 liminfo Listing

Configuring Solaris Resource Manager in Sun Cluster 3.0 Update Environments

Valid Topologies

Determining Requirements

Configuring Memory Limits Parameters

Using Accrued Usage Parameters

Failover Scenarios

Two-Node Cluster With Two Applications

Two-Node Cluster With Three Applications

Failover of Resource Group Only

Figure 10-7 `liminfo` Listing