Appendix D Sun Management Center Software
Rules
This appendix lists the Sun Management Center rules for the following modules:
Rules Concepts
A rule is an alarm check mechanism that allows for complex or special purpose
logic in determining the status of a monitored host or node.
There are two types of rules:
-
Simple rules are based on the rCompare rule, in which
monitored properties are compared to the rule. If the rule condition becomes true,
an alarm is generated. For example, a simple rule can be the percentage of disk space
used. If the percentage of disk space used is greater than or equal to the percentage
specified in the rule, then an alarm is generated.
-
Complex rules
are based on multiple conditions. For example, one complex rule states that an alert
alarm is generated when the following conditions are met:
-
The disk is over 75% busy
-
The average queue length is over 10
-
The wait queue is increasing
Note –
Any user-customized Solstice SyMONTM 1.x rules must be ported to the Sun Management Center environment before the rules
can be used in Sun Management Center software.
Kernel Reader
The following
table lists the Kernel Reader simple rules.
Table D–1 Kernel Reader Simple Rules
|
Property
|
Description
|
|
avg_1min
|
Load average over the last minute
|
|
avg_5min
|
Load average over the last 5 minutes
|
|
avg_15min
|
Load average over the last 15 minutes
|
|
cpu_delta
|
Difference between the previous and current time
|
|
cpu_idle
|
CPU idle time
|
|
cpu_kernel
|
CPU kernel time
|
|
cpu_user
|
CPU user time
|
|
cpu_wait
|
CPU wait time
|
|
ipctused
|
Percent of inodes used
|
|
kpctused
|
Percent of Kbytes used
|
|
mem-inuse
|
Physical memory in use (Mbytes)
|
|
numusers
|
Number of users
|
|
numsessions
|
Number of user sessions
|
|
swap_used
|
Swap used (Kbytes)
|
|
wait_io
|
CPU wait time breakdown
|
|
wait_pio
|
CPU wait time breakdown
|
|
wait_swap
|
CPU wait time breakdown
|
The following table lists the Kernel Reader complex rules.
Table D–2 Kernel Reader Complex Rules
|
Rule ID
|
Description
|
Type of Alarm
|
|
rknrd100
|
This rule covers
a transitory event. The rule generates an alert alarm when the disk is over 75% busy,
the average queue length is over 10, and the wait queue is increasing. The alert alarm
remains until the disk is less than 70% busy and the average queue length is less
than 8.
|
Alert
|
|
rknrd102
|
This rule covers
a transitory event. The rule generates an alert alarm if 90% of swap space is in use.
The event causing the alarm remains until swap space in use is less than 80% of the
total swap space.
|
Alert
|
|
rknrd103
|
This rule covers a transitory
event. The rule generates an alert alarm if swapping and paging is high for a given
CPU. This behavior indicates that a CPU might be thrashing. An alert alarm is generated
when CPU exceeds 1 swap-out, 10 page-ins, and 10 page-outs per second. The alert alarm
stays on if CPU exceeds 1 swap-out, 8 page-ins, and 8 page-outs per second.
|
Alert
|
|
rknrd105
|
File System Full error.
This rule looks for a file system full error message in the syslog (/var/adm/message).
|
Alert alarm that is closed immediately
|
|
rknrd106
|
No swap space error. This rule looks
for a no swap space error message in the syslog (/var/adm/message).
|
Alert alarm that is closed immediately
|
|
rknrd400
|
This rule checks for a continuous CPU load over six per CPU for four hours.
|
Informational
|
|
rknrd401
|
This rule checks for
disks that are busy more than 90% of the file for x hours.
The parameters field holds the last time CPU load was below six, and is initialized
to some date in the year 2001.
|
Informational
|
|
rknrd402
|
This rule checks if available
swap space drops below 10% for x hours. The parameters
field indicates the last time that the CPU load was below six. This field is initialized
to some date in the year 2001.
|
Informational
|
|
rknrd403
|
This rule is not currently supported.
|
Informational
|
|
rknrd404
|
An informational alarm is generated if rule rknrd401 gets triggered 4 times.
|
Informational
|
|
rknrd405
|
An informational alarm is generated if rule rknrd402 gets triggered 4 times.
|
Informational
|
Health Monitor
The following table lists the Health Monitor complex rules.
Table D–3 Health Monitor Complex Rules
|
Rule ID
|
Description
|
Type of Alarm
|
|
rhltm000
|
This rule checks whether there
is enough swap space.
|
Critical, Alert, Caution
|
|
rhltm001
|
CPU
power is wasted each time a CPU has to wait for a lock to become free. This event
is counted because the kernel uses mutually exclusive locks to synchronize its operation
and to keep multiple CPUs from concurrently accessing critical code and data regions.
|
Critical, Alert, Caution
|
|
rhltm002
|
NFS remote procedure
call timeouts may be associated with duplicate responses after the call is retransmitted.
These timeouts indicate that the network is okay but the server is responding slowly.
|
Critical, Alert, Caution
|
|
rhltm003
|
The run queue length
is divided by the number of CPUs because every CPU takes a job off the run queue in
each time slice.
|
Critical, Alert, Caution
|
|
rhltm004
|
A busy disk or a slow disk reduces system throughput and
increases user response times. This rule identifies the disks that are loaded so that
the load can be rebalanced.
|
Critical, Alert, Caution
|
|
rhltm005
|
RAM rule based on residency time for an unreferenced page. The virtual
memory system indicates that the system needs more memory when the system scans to
look for idle pages to reclaim for other uses.
|
Critical, Alert, Caution
|
|
rhltm006
|
This
rule refers to the problem with kernel memory allocation that occurs when login attempts
or network connections fail unexpectedly. There are two possible causes: Either the
kernel has reached the extent of its address space, or the free list does not contain
any pages to allocate. The repeated failures signify a problem that might otherwise
be overlooked.
|
Critical, Alert, Caution
|
|
rhltm007
|
A global cache of directory path name components exists. This cache is called
the directory name lookup cache (DNLC). If this cache does not exist, directory entries
must be read from disk and be scanned to locate the right file.
|
Critical, Alert, Caution
|