Sun Management Center 3.6.1 User's Guide

Appendix D Sun Management Center Software Rules

This appendix lists the Sun Management Center rules for the following modules:

Rules Concepts

A rule is an alarm check mechanism that allows for complex or special purpose logic in determining the status of a monitored host or node.

There are two types of rules:

Kernel Reader

The following table lists the Kernel Reader simple rules.

Table D–1 Kernel Reader Simple Rules

Property 

Description 

avg_1min 

Load average over the last minute 

avg_5min 

Load average over the last 5 minutes 

avg_15min 

Load average over the last 15 minutes 

cpu_delta 

Difference between the previous and current time 

cpu_idle 

CPU idle time 

cpu_kernel 

CPU kernel time 

cpu_user 

CPU user time 

cpu_wait 

CPU wait time 

ipctused 

Percent of inodes used 

kpctused 

Percent of Kbytes used 

mem-inuse 

Physical memory in use (Mbytes) 

numusers 

Number of users 

numsessions 

Number of user sessions 

swap_used 

Swap used (Kbytes) 

wait_io 

CPU wait time breakdown 

wait_pio 

CPU wait time breakdown 

wait_swap 

CPU wait time breakdown 

The following table lists the Kernel Reader complex rules.

Table D–2 Kernel Reader Complex Rules

Rule ID 

Description 

Type of Alarm  

rknrd100

This rule covers a transitory event. The rule generates an alert alarm when the disk is over 75% busy, the average queue length is over 10, and the wait queue is increasing. The alert alarm remains until the disk is less than 70% busy and the average queue length is less than 8.

Alert 

rknrd102

This rule covers a transitory event. The rule generates an alert alarm if 90% of swap space is in use. The event causing the alarm remains until swap space in use is less than 80% of the total swap space.

Alert 

rknrd103

This rule covers a transitory event. The rule generates an alert alarm if swapping and paging is high for a given CPU. This behavior indicates that a CPU might be thrashing. An alert alarm is generated when CPU exceeds 1 swap-out, 10 page-ins, and 10 page-outs per second. The alert alarm stays on if CPU exceeds 1 swap-out, 8 page-ins, and 8 page-outs per second.

Alert 

rknrd105

File System Full error. This rule looks for a file system full error message in the syslog (/var/adm/message).

Alert alarm that is closed immediately 

rknrd106

No swap space error. This rule looks for a no swap space error message in the syslog (/var/adm/message).

Alert alarm that is closed immediately 

rknrd400

This rule checks for a continuous CPU load over six per CPU for four hours. 

Informational 

rknrd401

This rule checks for disks that are busy more than 90% of the file for x hours. The parameters field holds the last time CPU load was below six, and is initialized to some date in the year 2001.

Informational 

rknrd402

This rule checks if available swap space drops below 10% for x hours. The parameters field indicates the last time that the CPU load was below six. This field is initialized to some date in the year 2001.

Informational  

rknrd403 

This rule is not currently supported. 

Informational 

rknrd404

An informational alarm is generated if rule rknrd401 gets triggered 4 times. 

Informational 

rknrd405

An informational alarm is generated if rule rknrd402 gets triggered 4 times. 

Informational 

Health Monitor

The following table lists the Health Monitor complex rules.

Table D–3 Health Monitor Complex Rules

Rule ID 

Description 

Type of Alarm  

rhltm000

This rule checks whether there is enough swap space.

Critical, Alert, Caution 

rhltm001

CPU power is wasted each time a CPU has to wait for a lock to become free. This event is counted because the kernel uses mutually exclusive locks to synchronize its operation and to keep multiple CPUs from concurrently accessing critical code and data regions.

Critical, Alert, Caution 

rhltm002

NFS remote procedure call timeouts may be associated with duplicate responses after the call is retransmitted. These timeouts indicate that the network is okay but the server is responding slowly.

Critical, Alert, Caution 

rhltm003

The run queue length is divided by the number of CPUs because every CPU takes a job off the run queue in each time slice.

Critical, Alert, Caution 

rhltm004

A busy disk or a slow disk reduces system throughput and increases user response times. This rule identifies the disks that are loaded so that the load can be rebalanced.

Critical, Alert, Caution 

rhltm005

RAM rule based on residency time for an unreferenced page. The virtual memory system indicates that the system needs more memory when the system scans to look for idle pages to reclaim for other uses.

Critical, Alert, Caution 

rhltm006

This rule refers to the problem with kernel memory allocation that occurs when login attempts or network connections fail unexpectedly. There are two possible causes: Either the kernel has reached the extent of its address space, or the free list does not contain any pages to allocate. The repeated failures signify a problem that might otherwise be overlooked.

Critical, Alert, Caution 

rhltm007

A global cache of directory path name components exists. This cache is called the directory name lookup cache (DNLC). If this cache does not exist, directory entries must be read from disk and be scanned to locate the right file.  

Critical, Alert, Caution