Sun N1 System Manager 1.3 Discovery and Administration Guide

Hardware Health Monitoring

The hardware health of managed servers is monitored by the N1 System Manager. Sensors provided in the hardware of managed servers are used by the N1 System Manager to monitor temperature, voltage, and fan speed. For information about supported hardware, see Manageable Server Requirements in Sun N1 System Manager 1.3 Site Preparation Guide. For a managed server's hardware health to be monitored by the N1 System Manager, the managed server must have a service processor.

Sensor data is retrieved from the service processor for SPARC devices through the Advanced Lights Out Manager (ALOM) interface. Sensor data is retrieved through the Intelligent Platform Management Interface (IPMI) for x64 servers.


Note –

Managed servers that use ALOM do not send data to the management server by use of traps. Instead, managed servers that use ALOM send management data by email. To ensure that the management server collects data from these servers, the management server has its own port 25 email server.


The following characteristics of a managed server's hardware can be monitored:


Note –

The N1 System Manager does not monitor RAID controller states.


All details for a managed server's hardware health, where available, are displayed in the hardware monitoring table on the Server Details page of the browser interface, and in the Event Log.

Table 6–1 Hard Disk and Memory Failure Monitoring

Type 

Disk Monitoring 

Memory Failure Monitoring 

ALOM servers: Netra 240 and Netra 440 

None 

None 

ALOM servers: Sun Fire V210, V240 and V440 

None 

None 

ALOM servers: Sun Fire T1000 and T2000 

None 

None 

IPMI server: Sun Fire X2100  

None 

None 

ILOM servers: X4100 and X4200 

Yes 

Yes 

IPMI servers: Sun Fire V20z and V40z 

None 

Yes 

A detailed list of hardware health sensors is provided in the documentation that accompanies your hardware.

You can view filtered hardware health monitoring information for all servers by using the show server command:


N1-ok> show server hardwarehealth hardwarehealth

See show server in Sun N1 System Manager 1.3 Command Line Reference Manual for details of possible values of the hardwarehealth filters. For more information and a graphic explaining filtering servers by health state, see To View Failed Managed Servers.

The locator lights for Sun Fire X2100, X4100 and X4200 servers can be switched on or off using the N1 System Manager. You can switch on or off a managed server's locator light by using the set server command:


N1-ok> set server server locator locator-state

The locator-state value can be either on or off. For a group of servers, use the set group command with the group's name.

Hardware Memory Problems on Sun Fire V20z and V40z Managed Servers

Memory problems on the Sun Fire V20z and V40z managed servers are handled differently by the N1 System Manager. Sun Fire V20z and V40z memory problems, if they occur, are detected by polling through the managed server's service processor.

A memory error has occurred on a Sun Fire V20z or V40z server if all of the following are true:

If a memory error has occurred, see the example on how to correct it. To avoid false warning statuses in the future, the service processor's event log must be cleared after the defective memory has been replaced or repaired.


Example 6–1 Examining Memory Errors on Sun Fire V20z or V40z Managed Servers

If a memory error has occurred on a Sun Fire V20z or V40z managed server, log into the server's service processor.


# ssh -l admin 10.0.3.2

Enter the password and check the managed server's status.


# sp get status

Check the service processor's event log.


# sp get events
ID Last Update      Component Severity      Message
1  01/01/1970 00:02 SP        informational SP localhost.localdomain IP is now set to 0.0.0.0
2  01/01/1970 18:47 SP        informational SP localhost.localdomain IP is now set to 0.0.0.0
3  01/01/1970 18:47 SP        informational SP localhost.localdomain IP is now set to 10.0.3.2 

Clear the service processor's event log.


# sp delete event -a

Hardware Sensor Attributes

For x64 servers, the management server software obtains the list of hardware sensor attributes to monitor through IPMI from the service processor of the server. For servers running the SPARC architecture, the ALOM interface is used. The list of hardware sensor attributes can vary from server to server, and between firmware versions. A sample listing for some servers and firmware versions is provided in this section. The attributes depend on the server type and on the number of CPUs that the server has.

To receive notifications for events from discrete sensors, create a notification rule and subscribe to the Ereport.Physical.ThresholdExceeded topic, as described in Setting Up Event Notifications.

For Sun Fire X4100 and Sun Fire X4200 servers, refer to the hardware documentation for to see the monitored hardware sensors.

For Sun Fire X2100 servers, only sensors describing fan speed, voltage, and temperature are used to retrieve data. Here is a list of sensors that are monitored for SP firmware version 4.11:


DDR 2.6V
CPU Core Voltage
VCC 3.3V
VCC 5V
VCC 12V
Battery Volt
CPU TEMP
SYS TEMP
CPU FAN
SYSTEM FAN3
SYSTEM FAN1
SYSTEM FAN2

For X2100 servers with SP firmware versions previous to version 4.11, CPU Core Voltage was called CPU Voltage.