Sun N1 System Manager 1.1 Administration Guide

Handling Threshold Breaches

If a threshold value is breached for a monitored attribute, an event is generated. You can create notification rules to warn you about this type of event. Notification of threshold breaches or warnings is done through the event log. This log is most easily viewed through the browser interface.

Notifications can be created using the create notification command and the resulting notification sent by email or to a pager. See create notification in Sun N1 System Manager 1.1 Command Line Reference Manual for syntax details.

Identifying Hardware and OS Threshold Breaches

If the value of a monitored hardware health attribute, or OS resource utilization attribute breaches a threshold value, an event log indicates that the threshold has been breached. The event log becomes available from the browser interface. The length of time it takes for the event log to be available from the browser interface depends on the polling interval for the attribute:

t + polling interval

The time at which the breach occurs is indicated by t. The polling interval is in seconds, and is the amount of time between successive polls of the monitored attribute. See Setting Polling Intervals for more information. Use the show log command to verify that the event log has been generated:


N1-ok> show log
Id            Date                       Severity    Subject     Message
.
. 
10            2004-11-22T01:45:02-0800   WARNING     Sun_V20z_XG041105786
A critical high threshold was violated for server Sun_V20z_XG041105786: Attribute cpu0.vtt-s3 Value 1.32

13            2004-11-22T01:50:08-0800   WARNING     Sun_V20z_XG041105786
A normal low  threshold was violated for server Sun_V20z_XG041105786: Attribute cpu0.vtt-s3 Value 1.2

Identifying Network Connectivity Failure

If the IP addresses of the management server, monitoring agent or the data network are unavailable, an event indicates that there is a network connectivity problem. This is part of network reachability monitoring. See Network Reachability Monitoring for more information. The event log becomes available from the browser interface. The length of time it takes for the event log to be available from the browser interface depends on the polling interval for the attribute:

t + polling interval

The time at which the breach occurs is indicated by t. The polling interval is in seconds, and is the amount of time between successive polls of the monitored attribute. See Setting Polling Intervals for more information. Use the show log command to verify that the event log has been generated:


N1-ok> show log
.
.
13            2004-11-19T10:24:33-0800   INFORMATION  Sun_V20z_XGserial_number
Ip Address /<ip_address> on server Sun_V20z_XGserial_number is unreachable.

14            2004-11-19T10:24:38-0800   INFORMATION  Sun_V20z_XGserial_number
Ip Address /<ip_address> on server Sun_V20z_XGserial_number is unreachable.

Identifying Monitoring Failure

If monitoring is enabled, as described in Enabling Monitoring, and the status in the output of the show server or show group commands is unknown or unreachable, then the server or server group is not being reached successfully for monitoring. If the status remains unknown or unreachable over the duration of less than five polling intervals, it is possible that a transient network problem is occurring. However if the status remains unknown or unreachable over the duration of more than five polling intervals, it is possible that monitoring has failed. This could be the result of a failure in the monitoring agent.

A time stamp is provided in the monitoring data output. The relationship between this time stamp and the value of the polling interval can also be used to judge if there is an error with the monitoring agent. If the monitored output for a provisionable server continues to show the same timestamp, even after several polling intervals have passed, this indicates that the provisionable server has not been successfully polled, and is no longer being monitored. This could be the result of a failure in the monitoring agent.