The hardware health of managed servers is monitored by the N1 System Manager. Sensors provided in the hardware of managed servers are used by the N1 System Manager to monitor temperature, voltage, and fan speed. For information about supported hardware, see Manageable Server Requirements in Sun N1 System Manager 1.3 Site Preparation Guide. For a managed server's hardware health to be monitored by the N1 System Manager, the managed server must have a service processor.
Sensor data is retrieved from the service processor for SPARC devices through the Advanced Lights Out Manager (ALOM) interface. Sensor data is retrieved through the Intelligent Platform Management Interface (IPMI) for x64 servers.
Managed servers that use ALOM do not send data to the management server by use of traps. Instead, managed servers that use ALOM send management data by email. To ensure that the management server collects data from these servers, the management server has its own port 25 email server.
The following characteristics of a managed server's hardware can be monitored:
CPU temperature
Ambient temperature
Fan speed in revolutions per minute
Voltages
LEDs (for Sun Fire X4100 and Sun Fire X4200 only)
Hard disks and memory. Monitoring of hard disks and memory is only possible for some hardware types. See Table 6–1 for more information
The N1 System Manager does not monitor RAID controller states.
All details for a managed server's hardware health, where available, are displayed in the hardware monitoring table on the Server Details page of the browser interface, and in the Event Log.
Table 6–1 Hard Disk and Memory Failure Monitoring
Type |
Disk Monitoring |
Memory Failure Monitoring |
---|---|---|
ALOM servers: Netra 240 and Netra 440 |
None |
None |
ALOM servers: Sun Fire V210, V240 and V440 |
None |
None |
ALOM servers: Sun Fire T1000 and T2000 |
None |
None |
IPMI server: Sun Fire X2100 |
None |
None |
ILOM servers: X4100 and X4200 |
Yes |
Yes |
IPMI servers: Sun Fire V20z and V40z |
None |
Yes |
A detailed list of hardware health sensors is provided in the documentation that accompanies your hardware.
You can view filtered hardware health monitoring information for all servers by using the show server command:
N1-ok> show server hardwarehealth hardwarehealth |
See show server in Sun N1 System Manager 1.3 Command Line Reference Manual for details of possible values of the hardwarehealth filters. For more information and a graphic explaining filtering servers by health state, see To View Failed Managed Servers.
The locator lights for Sun Fire X2100, X4100 and X4200 servers can be switched on or off using the N1 System Manager. You can switch on or off a managed server's locator light by using the set server command:
N1-ok> set server server locator locator-state |
The locator-state value can be either on or off. For a group of servers, use the set group command with the group's name.
Memory problems on the Sun Fire V20z and V40z managed servers are handled differently by the N1 System Manager. Sun Fire V20z and V40z memory problems, if they occur, are detected by polling through the managed server's service processor.
A memory error has occurred on a Sun Fire V20z or V40z server if all of the following are true:
The Sun Fire V20z or V40z managed server's status in the Server Details of the browser interface shows a warning or critical state
No sensors for the managed server are in the warning or critical state
No detail about the event is provided in the event log, but there is a memory event error shown by the server's service processor.
If a memory error has occurred, see the example on how to correct it. To avoid false warning statuses in the future, the service processor's event log must be cleared after the defective memory has been replaced or repaired.
If a memory error has occurred on a Sun Fire V20z or V40z managed server, log into the server's service processor.
# ssh -l admin 10.0.3.2 |
Enter the password and check the managed server's status.
# sp get status |
Check the service processor's event log.
# sp get events ID Last Update Component Severity Message 1 01/01/1970 00:02 SP informational SP localhost.localdomain IP is now set to 0.0.0.0 2 01/01/1970 18:47 SP informational SP localhost.localdomain IP is now set to 0.0.0.0 3 01/01/1970 18:47 SP informational SP localhost.localdomain IP is now set to 10.0.3.2 |
Clear the service processor's event log.
# sp delete event -a |
For x64 servers, the management server software obtains the list of hardware sensor attributes to monitor through IPMI from the service processor of the server. For servers running the SPARC architecture, the ALOM interface is used. The list of hardware sensor attributes can vary from server to server, and between firmware versions. A sample listing for some servers and firmware versions is provided in this section. The attributes depend on the server type and on the number of CPUs that the server has.
To receive notifications for events from discrete sensors, create a notification rule and subscribe to the Ereport.Physical.ThresholdExceeded topic, as described in Setting Up Event Notifications.
For Sun Fire X4100 and Sun Fire X4200 servers, refer to the hardware documentation for to see the monitored hardware sensors.
For Sun Fire X2100 servers, only sensors describing fan speed, voltage, and temperature are used to retrieve data. Here is a list of sensors that are monitored for SP firmware version 4.11:
DDR 2.6V CPU Core Voltage VCC 3.3V VCC 5V VCC 12V Battery Volt CPU TEMP SYS TEMP CPU FAN SYSTEM FAN3 SYSTEM FAN1 SYSTEM FAN2 |
For X2100 servers with SP firmware versions previous to version 4.11, CPU Core Voltage was called CPU Voltage.