Sun Management Center 3.5 User's Guide

Appendix D Sun Management Center Software Rules

This appendix lists the Sun Management Center rules for the following modules:

Rules Concepts

A rule is an alarm check mechanism that allows for complex or special purpose logic in determining the status of a monitored host or node.

There are two types of rules:

Simple rules are based on the rCompare rule, in which monitored properties are compared to the rule. If the rule condition becomes true, an alarm is generated. For example, a simple rule can be the percentage of disk space used. If the percentage of disk space used is greater than or equal to the percentage specified in the rule, then an alarm is generated.
Complex rules are based on multiple conditions. For example, one complex rule states that an alert alarm is generated when the following conditions are met:
- The disk is over 75% busy
- The average queue length is over 10
- The wait queue is increasing
Note –
Any user-customized Solstice SyMON^TM 1.x rules must be ported to the Sun Management Center environment before the rules can be used in Sun Management Center software.

Kernel Reader

The following table lists the Kernel Reader simple rules.

Table D–1 Kernel Reader Simple Rules


Property	Description
avg_1min	Load average over the last minute
avg_5min	Load average over the last 5 minutes
avg_15min	Load average over the last 15 minutes
cpu_delta	Difference between the previous and current time
cpu_idle	CPU idle time
cpu_kernel	CPU kernel time
cpu_user	CPU user time
cpu_wait	CPU wait time
ipctused	Percent of inodes used
kpctused	Percent of Kbytes used
mem-inuse	Physical memory in use (Mbytes)
numusers	Number of users
numsessions	Number of user sessions
swap_used	Swap used (Kbytes)
wait_io	CPU wait time breakdown
wait_pio	CPU wait time breakdown
wait_swap	CPU wait time breakdown

The following table lists the Kernel Reader complex rules.

Table D–2 Kernel Reader Complex Rules


Rule ID	Description	Type of Alarm
rknrd100	This rule covers a transitory event. The rule generates an alert alarm when the disk is over 75% busy, the average queue length is over 10, and the wait queue is increasing. The alert alarm remains until the disk is less than 70% busy and the average queue length is less than 8.	Alert
rknrd102	This rule covers a transitory event. The rule generates an alert alarm if 90% of swap space is in use. The event causing the alarm remains until swap space in use is less than 80% of the total swap space.	Alert
rknrd103	This rule covers a transitory event. The rule generates an alert alarm if swapping and paging is high for a given CPU. This behavior indicates that a CPU may be thrashing. An alert alarm is generated when CPU exceeds 1 swap-out, 10 page-ins, and 10 page-outs per second. The alert alarm stays on if CPU exceeds 1 swap-out, 8 page-ins, and 8 page-outs per second.	Alert
rknrd105	File System Full error. This rule looks for a file system full error message in the `syslog` (`/var/adm/message`).	Alert alarm that is closed immediately
rknrd106	No swap space error. This rule looks for a no swap space error message in the syslog (`/var/adm/message`).	Alert alarm that is closed immediately
rknrd400	This rule checks for a continuous CPU load over six per CPU for four hours.	Informational
rknrd401	This rule checks for disks that are busy more than 90% of the file for `x` hours. The parameters field holds the last time CPU load was below six, and is initialized to some date in the year 2001.	Informational
rknrd402	This rule checks if available swap space drops below 10% for `x` hours. The parameters field indicates the last time that the CPU load was below six. This field is initialized to some date in the year 2001.	Informational
rknrd403	This rule is not currently supported.	Informational
rknrd404	An informational alarm is generated if rule rknrd401 gets triggered 4 times.	Informational
rknrd405	An informational alarm is generated if rule rknrd402 gets triggered 4 times.	Informational

Health Monitor

The following table lists the Health Monitor complex rules.

Table D–3 Health Monitor Complex Rules


Rule ID	Description	Type of Alarm
rhltm000	This rule checks whether there is enough swap space.	Critical, Alert, Caution
rhltm001	CPU power is wasted each time a CPU has to wait for a lock to become free. This event is counted because the kernel uses mutually exclusive locks to synchronize its operation and to keep multiple CPUs from concurrently accessing critical code and data regions.	Critical, Alert, Caution
rhltm002	NFS remote procedure call timeouts may be associated with duplicate responses after the call is retransmitted. These timeouts indicate that the network is okay but the server is responding slowly.	Critical, Alert, Caution
rhltm003	The run queue length is divided by the number of CPUs because every CPU takes a job off the run queue in each time slice.	Critical, Alert, Caution
rhltm004	A busy disk or a slow disk reduces system throughput and increases user response times. This rule identifies the disks that are loaded so that the load can be rebalanced.	Critical, Alert, Caution
rhltm005	RAM rule based on residency time for an unreferenced page. The virtual memory system indicates that the system needs more memory when the system scans to look for idle pages to reclaim for other uses.	Critical, Alert, Caution
rhltm006	This rule refers to the problem with kernel memory allocation that occurs when login attempts or network connections fail unexpectedly. There are two possible causes: Either the kernel has reached the extent of its address space, or the free list does not contain any pages to allocate. The repeated failures signify a problem that might otherwise be overlooked.	Critical, Alert, Caution
rhltm007	A global cache of directory path name components exists. This cache is called the directory name lookup cache (DNLC). If this cache does not exist, directory entries must be read from disk and be scanned to locate the right file.	Critical, Alert, Caution