Sun N1 System Manager 1.3.1 Troubleshooting Guide

Chapter 5 Monitoring Problems

This chapter describes the most common monitoring problems, their causes, and the solution for each problem. The following topics are discussed:

Adding OS Monitoring to a Managed Server On Which Base Management is Installed Fails

Adding the OS monitoring feature to a managed server that has the base management feature installed might fail. The following job output shows the error:


N1-ok> show job 61
Job ID: 61
Date: 2005-08-16T16:14:27-0400
Type: Modify OS Monitoring Support
Status: Error (2005-08-16T16:14:38-0400)
Command: add server 192.168.2.10 feature osmonitor agentssh root/rootpasswd
Owner: root
Errors: 1
Warnings: 0

Steps
ID Type Start Completion Result
1 Acquire Host 2005-08-16T16:14:27-0400 2005-08-16T16:14:28-0400 Completed
2 Run Command 2005-08-16T16:14:28-0400 2005-08-16T16:14:28-0400 Completed
3 Acquire Host 2005-08-16T16:14:29-0400 2005-08-16T16:14:30-0400 Completed
4 Run Command 2005-08-16T16:14:30-0400 2005-08-16T16:14:36-0400 Error

Results
Result 1:
Server: 192.168.2.10
Status: -3
Message: Repeate attempts for this operation are not allowed.

This error indicates that SSH credentials have previously been supplied and cannot be altered. To avoid this error, issue the add server feature osmonitor command without agentssh credentials for instructions.

Use the grep command as follows to determine whether the OS monitoring agents were successfully installed.

ALOM-based Managed Server Notifications Are Not Displayed

The ports of some models of manageable servers use the Advanced Lights Out Manager (ALOM) standard. These servers, detailed in Manageable Server Requirements in Sun N1 System Manager 1.3 Site Preparation Guide, use email instead of SNMP traps to send notifications about hardware events to the management server. For information about other events, see Managing Event Log Entries in Sun N1 System Manager 1.3 Discovery and Administration Guide and Setting Up Event Notifications in Sun N1 System Manager 1.3 Discovery and Administration Guide.

If no notifications appear about hardware events from ALOM architecture manageable servers, probably all managed servers are healthy. If you are using an external mail service instead of the internal secure N1 System Manager mail service, the external mail service might not have been configured correctly as an email server, or that email configuration might have been invalidated due to other issues such as network error or domain name change.

To resolve, do one of the following:

Base Management Installation for a Managed Server Fails

Installing the base management feature support might fail due to stale SSH entries on the management server. If the add server feature command fails and no true security breach has occurred, note the name and IP address of the managed server. Remove the entry for that server as described in To Update the ssh_known_hosts File.

Basic Monitoring

If monitoring is enabled as described in Enabling and Disabling Monitoring in Sun N1 System Manager 1.3 Discovery and Administration Guide, and the status in the output of the show server or show group commands is unknown or unreachable, then the server or server group is not being reached successfully for monitoring.

If the status remains unknown or unreachable for less than 10 minutes, a transient network problem might be occurring. However if the status remains unknown or unreachable for more than 30 minutes, monitoring might have failed. This failure could be the result of any of the following issues.

If monitoring traps are lost, a particular threshold status may not be refreshed for up to 30 hours, although the overall status should still be refreshed every 10 minutes.

A time stamp is provided in the monitoring data output. The relationship between this time stamp and the current time can also be used to judge whether a problem exists with the monitoring agent.

OS Monitoring

It can take 5 to 7 minutes before all OS monitoring data is fully initialized. You may see that CPU idle is at 0.0 %, which causes a Failed Critical status with OS usage. This should clear up within 5-7 minutes after adding or upgrading the OS monitoring feature to the managed server. At that point, OS monitoring data should be available for the managed server by using the show server server command. For further information, see To Add the OS Monitoring Feature in Sun N1 System Manager 1.3 Discovery and Administration Guide

Adding the base management feature to a managed server might fail due to stale or obsolete SSH entries for that managed server in the known_hosts file on the management server. If the add server server-name feature osmonitor agentip command fails and no true security breach has occurred, remove the entry for that managed server from the known_hosts file as described in To Update the ssh_known_hosts File. Then, retry the add command.

Sun Blade X8400 Server Blade is not Displayed In Its Chassis Group and is Displayed as a Separate Managed Server

Under certain circumstances, a Sun Blade X8400 server blade will not be listed in its chassis group, but will be listed as a separate managed server with the status unreachable.

This problem can be caused by any one or more of the following situations:

To resolve this problem:

After you have verified that the Sun Blade X8400 server blade is accessible using N1 System Manager and standard access protocols, refresh the server blade using either of the following two methods: