Appendix E

Sun Management Center Software Rules

This appendix lists the Sun Management Center rules for the following modules:

A rule is an alarm check mechanism that allows for complex or special purpose logic in determining the status of a monitored host or node.

There are two types of rules--simple and complex:

Simple rules are based on the rCompare rule, in which monitored properties are compared to the rule. If the rule condition becomes true, an alarm is generated. For example, a simple rule can be the percentage of disk space used. If the percentage of disk space used equals or is greater than the percentage specified in the rule, then an alarm is generated.
Complex rules are based on multiple conditions becoming true. For example, one complex rule states that when a disk is over 75% busy and the average queue length is over 10 and the wait queue is increasing, then an alert alarm is generated.

Note - Any user-customized Solstice SyMON 1.x rules must be ported to the new environment before the rules can be used in Sun Management Center software.

Kernel Reader

The following table lists the Kernel Reader simple rules.

TABLE E-1 Kernel Reader Simple Rules
Property
Description

avg_1min

Load Averages Over The Last 1 Minute

avg_5min

Load Average Over The Last 5 Minutes

avg_15min

Load Average Over The Last 15 Minutes

cpu_delta

Difference between the previous and current time

cpu_idle

CPU idle time

cpu_kernel

CPU kernel time

cpu_user

CPU user time

cpu_wait

CPU wait time

ipctused

Percent of inodes used

kpctused

Percent of Kbytes used

mem-inuse

Physical Memory In Use (MBytes)

numusers

Number Of Users

numsessions

Number Of User Sessions

swap_used

Swap Used Kbytes

wait_io

CPU wait time breakdown

wait_pio

CPU wait time breakdown

wait_swap

CPU wait time breakdown

The following table lists the Kernel Reader complex rules.

TABLE E-2 Kernel Reader Complex Rules
Rule ID
Description
Type of Alarm

rknrd100

This rule covers a transitory event and generates an alert alarm when the disk is over 75% busy, the average queue length is over 10, and the wait queue is increasing. Alert alarm stays on until the disk is not over 70% busy and the average queue length is no longer than 8.

Alert

rknrd102

This rule covers a transitory event and generates an alert alarm if 90% of swap space is in use. Event causing the alarm stays open until swap space in use is less than 80% of the total swap space.

Alert

rknrd103

This rule covers a transitory event and generates an alert alarm if swapping and paging is high for a given CPU. This indicates that a CPU may be thrashing. Alert alarm is generated when CPU exceeds 1 swap-out, 10 page-ins, and 10 page-outs per second. Alert alarm stays on if CPU exceeds 1 swap-out, 8 page-ins, and 8 page-outs per second.

Alert

rknrd105

File System Full error. This rule looks for a file system full error message in the syslog (/var/adm/message).

Alert alarm that is closed immediately

rknrd106

No swap space error. This rule looks for a no swap space error message in the syslog (/var/adm/message).

Alert alarm that is closed immediately

rknrd400

This rule checks for a continuous CPU load over 6 per CPU for four hours.

Informational

rknrd401

This rule checks for disks being busy more than 90% of the file for x hours. The parameters field holds the last time CPU load was below 6, and is initialized to some date in the year 2001.

Informational

rknrd402

This rule checks if available swap space drops below 10% for x hours. The parameters field holds the last time CPU load was below 6, and is initialized to some date in the year 2001.

Informational

rknrd403

This rule is not currently supported.

Informational

rknrd404

An informational alarm is generated if the rule rknrd401 gets triggered 4 times.

Informational

rknrd405

An informational alarm is generated if the rule rknrd402 gets triggered 4 times.

Informational

**TABLE E-1** Kernel Reader Simple Rules
Property	Description
avg_1min	Load Averages Over The Last 1 Minute
avg_5min	Load Average Over The Last 5 Minutes
avg_15min	Load Average Over The Last 15 Minutes
cpu_delta	Difference between the previous and current time
cpu_idle	CPU idle time
cpu_kernel	CPU kernel time
cpu_user	CPU user time
cpu_wait	CPU wait time
ipctused	Percent of inodes used
kpctused	Percent of Kbytes used
mem-inuse	Physical Memory In Use (MBytes)
numusers	Number Of Users
numsessions	Number Of User Sessions
swap_used	Swap Used Kbytes
wait_io	CPU wait time breakdown
wait_pio	CPU wait time breakdown
wait_swap	CPU wait time breakdown

**TABLE E-2** Kernel Reader Complex Rules
Rule ID	Description	Type of Alarm
rknrd100	This rule covers a transitory event and generates an alert alarm when the disk is over 75% busy, the average queue length is over 10, and the wait queue is increasing. Alert alarm stays on until the disk is not over 70% busy and the average queue length is no longer than 8.	Alert
rknrd102	This rule covers a transitory event and generates an alert alarm if 90% of swap space is in use. Event causing the alarm stays open until swap space in use is less than 80% of the total swap space.	Alert
rknrd103	This rule covers a transitory event and generates an alert alarm if swapping and paging is high for a given CPU. This indicates that a CPU may be thrashing. Alert alarm is generated when CPU exceeds 1 swap-out, 10 page-ins, and 10 page-outs per second. Alert alarm stays on if CPU exceeds 1 swap-out, 8 page-ins, and 8 page-outs per second.	Alert
rknrd105	File System Full error. This rule looks for a file system full error message in the `syslog` (`/var/adm/message`).	Alert alarm that is closed immediately
rknrd106	No swap space error. This rule looks for a no swap space error message in the `syslog (/var/adm/message).`	Alert alarm that is closed immediately
rknrd400	This rule checks for a continuous CPU load over 6 per CPU for four hours.	Informational
rknrd401	This rule checks for disks being busy more than 90% of the file for x hours. The parameters field holds the last time CPU load was below 6, and is initialized to some date in the year 2001.	Informational
rknrd402	This rule checks if available swap space drops below 10% for x hours. The parameters field holds the last time CPU load was below 6, and is initialized to some date in the year 2001.	Informational
rknrd403	This rule is not currently supported.	Informational
rknrd404	An informational alarm is generated if the rule rknrd401 gets triggered 4 times.	Informational
rknrd405	An informational alarm is generated if the rule rknrd402 gets triggered 4 times.	Informational

Health Monitor

The following table lists the Health Monitor complex rules.

TABLE E-3 Health Monitor Complex Rules
Rule ID
Description
Type of Alarm

rhltm000

This rule checks whether there is enough swap space.

Critical, Alert, Caution

rhltm001

Each time a CPU has to wait for a lock to become free, it wastes CPU power; and this event is counted, since the kernel uses mutually exclusive locks to synchronize its operation and keep multiple CPUs from concurrently accessing critical code and data regions.

Critical, Alert, Caution

rhltm002

This rule is based on the observation that NFS remote procedure call timeouts may be associated with duplicate responses after the call is retransmitted. This indicates that the network is okay but the server is responding slowly.

Critical, Alert, Caution

rhltm003

Here the run queue length is divided by the number of CPUs. This is based upon the fact that every CPU takes a job off the run queue in each time slice.

Critical, Alert, Caution

rhltm004

A busy or slow disk reduces system throughput and increases user response times. This rule identifies the disks that are loaded so that the load can be rebalanced.

Critical, Alert, Caution

rhltm005

RAM rule based on residency time for an unreferenced page. The virtual memory system indicates that it needs more memory when it scans looking for idle pages to reclaim for other uses.

Critical, Alert, Caution

rhltm006

This rule refers to the kernel memory allocation problem. It shows up when login attempts or network connections fail unexpectedly. There are two possible causes. Either the kernel has reached the extent of its address space, or the free list does not contain any pages to allocate. It is more a sign of a problem that may otherwise be overlooked.

Critical, Alert, Caution

rhltm007

There is a global cache of directory path name components called the directory name lookup cache, or Directory Name Lookup Cache Rule (DNLC). Missing a cache means that directory entries must be read from disk and scanned to locate the right file.

Critical, Alert, Caution

Rule ID	Description	Type of Alarm
rhltm000	This rule checks whether there is enough swap space.	Critical, Alert, Caution
rhltm001	Each time a CPU has to wait for a lock to become free, it wastes CPU power; and this event is counted, since the kernel uses mutually exclusive locks to synchronize its operation and keep multiple CPUs from concurrently accessing critical code and data regions.	Critical, Alert, Caution
rhltm002	This rule is based on the observation that NFS remote procedure call timeouts may be associated with duplicate responses after the call is retransmitted. This indicates that the network is okay but the server is responding slowly.	Critical, Alert, Caution
rhltm003	Here the run queue length is divided by the number of CPUs. This is based upon the fact that every CPU takes a job off the run queue in each time slice.	Critical, Alert, Caution
rhltm004	A busy or slow disk reduces system throughput and increases user response times. This rule identifies the disks that are loaded so that the load can be rebalanced.	Critical, Alert, Caution
rhltm005	RAM rule based on residency time for an unreferenced page. The virtual memory system indicates that it needs more memory when it scans looking for idle pages to reclaim for other uses.	Critical, Alert, Caution
rhltm006	This rule refers to the kernel memory allocation problem. It shows up when login attempts or network connections fail unexpectedly. There are two possible causes. Either the kernel has reached the extent of its address space, or the free list does not contain any pages to allocate. It is more a sign of a problem that may otherwise be overlooked.	Critical, Alert, Caution
rhltm007	There is a global cache of directory path name components called the directory name lookup cache, or Directory Name Lookup Cache Rule (DNLC). Missing a cache means that directory entries must be read from disk and scanned to locate the right file.	Critical, Alert, Caution