Fault Management

C H A P T E R 2

Fault Management

This chapter contains information about the following topics related to fault management on the Sun Blade 6048 modular system.

Section 2.1, About Fault Management

Section 2.2, Monitoring Faults

Section 2.3, Determining That Hardware Has Failed

Section 2.4, Replacing a Faulted Component

2.1 About Fault Management

The fault management software of the Sun Blade 6048 modular system monitors hardware health and diagnoses and reports hardware failures on system components. Fault management also monitors environmental conditions and reports when the systems environment is outside acceptable parameters. Various sensors on the system chassis shelf, the power supplies (PSUs), server modules, and fans are continuously monitored. When a sensor registers a problem, the fault management software, which runs on the chassis management module (CMM), is notified.

Fault management then diagnoses the problem. If it determines that a hardware or environmental failure has occurred, fault management lights the Service Action Required LED on the affected component. The ILOM management interfaces are updated to reflect the failure (the fault), and the failure is recorded as a fault in the event log.

Note - The Sun Blade 6048 modular system’s fault management software is entirely unrelated to Solaris Fault Management Architecture (FMA). Fault management is part of the system management software and does not interact with the server module hosts or their operating systems.

2.1.1 External Compared With Internal Faults

When a system component experiences a hardware failure, it is called an internal fault; that is, the fault is the result of a problem with the hardware of the Sun Blade 6048 modular system itself. Internal faults are cleared when a repair action takes place, most likely the replacement of the failed component.

There are some faults, however, that are external faults. In these cases, the system hardware has not failed, but a condition external to the system is causing a potential problem. If, for example, the ambient air temperature (external to the chassis) exceeds a certain threshold, it is a fault because it can adversely affect the operation of the system if not corrected. External faults are autoclearing; they are cleared when the external condition no longer exists. Nonetheless, an external fault can, if not attended to, cause components or the system as a whole to shut down.

2.1.2 Components Monitored by Fault Management

Fault management monitors the following system component.

TABLE 2-1 Component Fault Management
System Component	Refer to This Section
System chassis shelf	Section 2.3.1, Chassis Shelf Faults
Power supply units (PSUs)	Section 2.3.2, Power Supply Module Faults
Front fans (within power supply modules)	Section 2.3.2, Power Supply Module Faults
Rear Fans	Section 2.3.3, Rear Fan Faults

Note - For information about faults for other system components such as NEMs, PCI EMs, and blades, refer to the documentation for that component.

2.2 Monitoring Faults

There are three ways to tell when a fault has occurred somewhere in the system:

The amber Service Action Required LEDs on the failed component and on the system chassis shelf are illuminated (see Section 2.2.1, Monitoring the Service Action Required LEDs).

Component status information, available through the ILOM web interface and CLI, registers that the component is in a faulted state (see Section 2.2.2, Monitoring Faults From the Management Interfaces).

The fault is recorded in the system event log (see Section 2.2.3, Monitoring the Event Log).

2.2.1 Monitoring the Service Action Required LEDs

When a component experiences a hardware failure (enters a fault state), fault management illuminates the Service Action Required (amber) LED on that component. In addition, fault management illuminates the Service Action Required LEDs on the system chassis shelf (both front and back) when any system component is in a faulted state.

2.2.1.1 When Service Action Required LEDs Are Turned Off

Since a Service Action Required LED indicates a hardware failure, it remains illuminated until fault management detects that the failed hardware has been replaced or repaired. The chassis shelf Service Action Required LEDs, which serve as summary indicators for all component faults, remain illuminated as long as any system component remains in a faulted state.

2.2.1.2 When Only the Chassis Shelf LEDs Are Lit

If the chassis Service Action Required LEDs are illuminated but no other system component has a lit Service Action Required LED, then fault management has diagnosed an external fault: a problem outside the system that potentially affects the system as a whole. For example, if the external ambient air temperature exceeds
45o C, a fault is declared and the system shuts down although there is nothing physically wrong with any system hardware.

Refer to Section 2.3.1, Chassis Shelf Faults for information about the external conditions that can cause these chassis faults.

2.2.1.3 About Power Supply Faults

The power supply units (PSUs) are a special case; they monitor their own fault status and control their own Service Action Required LEDs. The fault management software cannot turn the PSU LEDs on or off. However, because fault management is monitoring sensors on the PSUs, it is notified when a PSU fault occurs. Fault management illuminates the chassis shelf Service Action Required LEDs and notes the fault occurrence in the ILOM management interfaces and in the event log.

Note that it is possible for a PSU to extinguish its Service Action Required LED (declare that the fault is cleared), but for fault management to continue to assert that the PSU is still in a faulted state. If this happens, the ILOM management interfaces, the chassis shelf Service Action Required LEDs, and the event log reflect that the faulted state is ongoing.

Refer to Section 2.3.2, Power Supply Module Faults for more information.

2.2.2 Monitoring Faults From the Management Interfaces

You can monitor chassis shelf and component faults from the ILOM CLI or the web interface.

Note - Refer to the Sun Integrated Lights Out Manager 2.0 User’s Guide for information about the object namespace and how to identify the targets and properties that might pertain to faults.

Section 2.2.2.1, Obtaining Sensor Readings Using the CLI

Section 2.2.2.2, Obtaining Sensor Readings Using the Web Interface

2.2.2.1 Obtaining Sensor Readings Using the CLI

To obtain sensor readings using the CLI:

1. Establish a local serial console connection or SSH connection to the CMM, and log in to the ILOM.

2. Issue the appropriate show command to display information about system components.

For example, if a power-supply AC-1 light is lit, you would issue the following command:

> show /CH/PS0/S1/AC_FAIL
 
 /CH/PS0/S1/AC_FAIL
    Targets:
 
    Properties:
        type = Voltage
        class = Discrete Sensor
        value = Predictive Failure Asserted
 
    Commands:
        cd
        show

The value = Predictive Failure Asserted shows the faulted power supply. Since one of the power supplies in power supply module 0 has failed, the entire power supply module will need to be replaced.

2.2.2.2 Obtaining Sensor Readings Using the Web Interface

In the ILOM web interface, you can obtain instantaneous sensor readings about system FRUs (field-replaceable units) or other system inventory on the System Monitoring -> Sensor Readings page.

To obtain sensor readings from the ILOM web interface:

1. Open a web browser, and type the IP address of the server SP or CMM.

The Login page for the ILOM web interface appears.

2. In the ILOM Login page, enter a user name and password, and then click OK.

The ILOM web interface appears.

3. In the web interface page, click System Monitoring -> Sensors Readings.

The Sensor Readings page appears.

FIGURE 2-1 Sensor Readings Page

Screen capture of ILOM Sensor Readings page.

Note - If the server is powered off, many components will appear as “no reading.”

4. In the Sensor Readings page, do the following:

a. Locate the name of the sensor you want to view.

b. Click the name of the sensor to view the property values associated with that sensor.

For specific details about the type of discrete sensor targets you can access, as well as the paths to access them, consult the user documentation provided with the Sun server platform.

2.2.3 Monitoring the Event Log

Faults are recorded in the system event log, which can be viewed from the ILOM CLI or web interface.

Viewing or Clearing the ILOM Event Log Using the CLI

Viewing or Clearing the ILOM Event Log Using the Web Interface

2.2.3.1 Viewing or Clearing the ILOM Event Log Using the CLI

To view or clear events in the system event log using the ILOM CLI:

1. Establish a local serial console connection or SSH connection to the CMM, and log in to the ILOM.

2. Type the following command paths to set the working directory:

cd /CMM/logs/event

3. Type the following command path to display the event log list.

show list

The contents of the event log appears. An example follows.

ID     Date/Time                 Class     Type      Severity
-----  ------------------------  --------  --------  --------
50611  Wed Aug 15 16:55:56 2007  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
50610  Wed Aug 15 16:44:44 2007  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
50609  Tue Aug 14 18:03:45 2007  Audit     Log       minor

Example of the event log

4. In the event log, perform any of the following tasks:

Scroll down the list to view entries. Press any key except q. The following table provides descriptions of the columns that appear in the log.

Column Label	Description
Event ID	The number of the event, in sequence from number 1.
Date/Time	The day and time the event occurred. If the Network Time Protocol (NTP) server is enabled to set the ILOM time, the ILOM clock will use Universal Coordinated Time (UTC).
Class/Type	Audit/ Log: Commands that result in a configuration change. Description includes user, command, command parameters, and success or fail. IPMI/Log: Any event that is placed in the IPMI SEL is also put in the management log. Chassis/State: For changes to the inventory and general system state. Chassis/Action: Category for shutdown events for server module or chassis, hot insert or removal of a FRU, and Reset Parameters button pushed. FMA/Fault: For Fault Management Architecture (FMA) faults. Description gives time of fault as detected by FMA and suspect component. FMA/Repair: For FMA repairs. Description gives component.
Severity	Critical, Major, or Minor
Description	A description of the event.

Dismiss the event log (stop displaying the log). Press the q key.

Clear entries in the event log. Perform these steps:

a. Type set clear=true

A confirmation message appears.

b. Type one of the following:

To clear the entries, type y.

To cancel clearing the log, type n.

Note - The ILOM event log accumulates many types of events, including copies of IPMI entries. Clearing the ILOM event log clears all entries in the log, including the IPMI entries. However, clearing the ILOM event log entries does not clear the actual entries posted directly to an IPMI log.

2.2.3.2 Viewing or Clearing the ILOM Event Log Using the Web Interface

To view or clear events in the ILOM event log using the ILOM web interface:

1. Open a web browser, and type the IP address of the server CMM.

The Login page for the ILOM web interface appears.

2. In the ILOM Login page, enter a user name and password, and then click OK.

The ILOM web interface appears.

3. In the web interface page, select System Monitoring -> Event Logs.

The Event Log page appears.

FIGURE 2-2 ILOM Web Interface Event Log

Screen capture of ILOM Web Interface Event Log..

4. In the Event Log page, perform any of the following:

Page through entries: Use the page navigation controls at the top and the bottom of the table to navigate forward and backward through the available data in the table.

Note that selecting a larger number of entries might cause the web interface to respond more slowly than selecting a smaller number of entries.

View the entries in the display by scrolling through the list: The following table provides descriptions of the columns that appear in the log.

Column Label	Description
Event ID	The number of the event, in sequence from number 1.
Date/Time	The day and time the event occurred. If the Network Time Protocol (NTP) server is enabled to set the ILOM time, the ILOM clock will use Universal Coordinated Time (UTC).
Class/Type	Audit/ Log: Commands that result in a configuration change. Description includes user, command, command parameters, and success or fail. IPMI/Log: Any event that is placed in the IPMI SEL is also put in the management log. Chassis/State: For changes to the inventory and general system state. Chassis/Action: Category for shutdown events for server module or chassis, hot insert or removal of a FRU, and Reset Parameters button pushed. FMA/Fault: For Fault Management Architecture (FMA) faults. Description gives time of fault as detected by FMA and suspect component. FMA/Repair: For FMA repairs. Description gives component.
Severity	Critical, Major, or Minor
Description	A description of the event.

Clear the event log - To clear the event log, click the Clear Event Log button. A confirmation dialog box appears. In the confirmation dialog box, click OK to clear the entries.

2.3 Determining That Hardware Has Failed

When a hardware failure occurs, the following actions take place:

One of the following fault LEDs are illuminated:

The amber Service Action Required LED is illuminated on the failed component, and the chassis shelf Service Action Required LEDs (both front and back) are illuminated.

The Temperature Fail LED is illuminated on the chassis shelf, showing that the ambient temperature for the chassis shelf has moved above an acceptable range.

The chassis shelf Service Action Required LEDs serve as summary indicators, notifying you that a hardware failure has occurred on one (or more) of the components in the chassis shelf.

The sensor information in the CMM ILOM identifies which component has experienced a hardware failure. The following topics in this section describe the fault sensors that are activated with component faults.

The fault associated with the hardware failure is recorded in the system event log.

See Section 2.2.3, Monitoring the Event Log or the Sun Integrated Lights Out Manager 2.0 User’s Guide for more information about reading component sensors and the event log.

The following sections contain further details on identifying faults in the system or specific components:

Section 2.3.1, Chassis Shelf Faults

Section 2.3.2, Power Supply Module Faults

Section 2.3.3, Rear Fan Faults

2.3.1 Chassis Shelf Faults

Chassis shelf faults are external faults: There is no hardware failure, but an external condition exists that can adversely affect the operation of the system. Because they are external, chassis shelf faults are auto-clearing; when fault management detects that the external condition has returned to within normal parameters, it clears the fault.

A fault is declared, and the chassis shelf Temperature Fail LEDs are illuminated when the external condition represents a potential hazard to the system. It is possible for an external fault to force a shutdown of the entire system.

The Chassis Shelf Service Action Required LED also lights when there is a fault on a chassis shelf component.

2.3.1.1 Chassis Shelf LED Locations

FIGURE 2-3 and FIGURE 2-4 show the location of the LEDs on the front and rear of the chassis.

FIGURE 2-3 Front Chassis Shelf Fault Indicators

Figure showing front chassis LEDs.

FIGURE 2-4 Rear Chassis Shelf Fault Indicators

Figure showing rear chassis LEDs.

2.3.1.2 Checking Other LEDs

If the Service Action Required LED is lit on the FIM or CMM, check the indicators on the power supplies and fan modules to see if one of these is also lit. Refer to the following sections for more information.

Section 2.3.2, Power Supply Module Faults

Section 2.3.3, Rear Fan Faults

If a blade Service Action Required LED is lit, refer to the blade documentation for servicing the blade.

2.3.1.3 Viewing Chassis Shelf Faults in ILOM

The chassis shelf Temperature Fail LED light turns on when at least one of the ambient temperature sensors in the power supply modules reaches 40o C, and shuts down the chassis shelf when the temperature reaches 45o C. See TABLE 2-3 for information about viewing this sensor information.

See the Sun Integrated Lights Out Manager 2.0 User’s Guide for more information about reading this and other chassis shelf sensors.

2.3.2 Power Supply Module Faults

There are three power supplies located within each power supply module. The AC-0 LED corresponds to power supply 0 within the power supply module, AC-1 corresponds to power supply 1, and AC-2 corresponds to power supply 2.

If you do not need the full 8400W of power from the power supplies, you can connect only two of the total three plugs to the AC0 and AC1 connectors for each power supply. Do not connect AC2.

When only two of the available three plugs is connected to the power supplies,
5600 W of power will be supplied to the chassis. The LEDs and ILOM will show different readings than for the three power cord connections. See the notes in the following sections for the differences in configurations.

2.3.2.1 Power Supply LED Locations

FIGURE 2-5 9000W Power Supply LED Location

Figure showing power supply LEDs.

2.3.2.2 Power Supply Fault LED Functions

TABLE 2-2 shows the operation of the LEDs during normal operation or when a fault has occurred. Refer to the appropriate sensor table to find the location of the fault in the ILOM CLI.

TABLE 2-2 Power Supply Fault LED Functions
Condition	AC-0 LED (Green)	AC-1 LED (Green)	AC-2 LED (Green)	DC LED (Green)	PSU Service LED (Amber)	Fan Service LED (Amber)	Sensor Table
Normal operation (3 cord configuration)	On	On	On	On	Off	Off	n/a
Normal operation (2 cord configuration)	On	On	Off	Off	On	Off	See Appendix A
Over current, over voltage, or over temperature warning fault	On	On	On	Off	On	Off	TABLE 2-4
AC 0 failed	Off	On	On	Off	Off	Off	TABLE 2-4
AC 1 failed	On	Off	On	Off	Off	Off	TABLE 2-4
AC 2 failed	On	On	Off	Off	Off	Off	TABLE 2-4
Front fan failed	On	On	On	On	Off	On	TABLE 2-5

2.3.2.3 Viewing Power Supply Faults in ILOM

If the power supply module LEDs indicate that a power supply or front fan failure has occurred, you can verify the fault by viewing the appropriate sensor through the ILOM CLI. See the Sun Integrated Lights Out 2.0 Manager User’s Guide and Appendix A for details on locating and reading the sensors in the ILOM.

Note - In the tables below, the variable n represents one of the following values: power supply module 0 (PS0), power supply module 1 (PS1), 12V output 0 (S0), 12V output 1 (S1), or 12V output 1 (S2). For example, /CH/PS0/S1 represents 12 V output 1 located within power supply module 0.

Unless noted otherwise, the sensors shown in the following tables will display the following value if a fault has occurred:

value = Predictive Failure Asserted

Note - If you are using two power cords per power supply, the ILOM readings will be different. Refer to Section A.2, ILOM Behavior With Two Power Cord Configuration for more information.

TABLE 2-3 Power Supply Module Warnings
Fault Type	CLI Path to Sensor
Power supply input lost or out of range. Possible values are: Presence detected Power supply failure detected Predictive failure Power supply input lost (AC/DC) Power supply input lost or out of range Power supply input out of range, but present	`/CH/PS`n`/STATUS`
This sensor shows the ambient temperature of the power supply. The CMM LED turns on when ambient reaches 40o C, and the chassis shelf shuts down when the temperature reaches 45o C.	`/CH/PS`n`/T_AMB`
12V_n output current exceeds 240A for 100 msec.	`/CH/PS`n`/S`n`/I+12V_WARN`
Ambient temperature reaches the following range: 50o-60o C.	`/CH/PS`n`/T_AMB_WARN`

TABLE 2-4 Power Supply Module Faults
Fault Type	CLI Path to Sensor
Power supply has failed.	`/CH/PS`n`/S`n`/AcFAIL`
Ambient temperature reaches the following range: 65o-75oC. This sensor causes the power supply to shut down.	`/CH/PS`n`/T_AMB_FAULT`
12V power output has exceeded 14V for more than 400 milliseconds.	`/CH/PS`n`/S`n`/V+12V_FAULT`
3V power output reaches the following range: 3.7-4.3V.	`/CH/PS`n`/V+3_3V_FAULT`
12V_n output current exceeds 240 amps for more than 60 seconds, or 12V_n output current exceeds 275A for 20 msec.	`/CH/PS`n`/S`n`/I+12V_FAULT`
3.3V output current exceeds 13A for more than 20 msec.	/CH/PSn/I+3_3V_Fault

TABLE 2-5 Front Fan Faults
Fault Type	CLI Path to Sensor
Front fan has failed.	`/CH/PS`n`/FAN_FAIL`

2.3.3 Rear Fan Faults

The Sun Blade 6048 chassis shelf contains six rear fans.

2.3.3.1 Rear Fan LED Location

FIGURE 2-6 Rear Fan LED Location

Figure showing rear fan LEDs

2.3.3.2 Rear Fan Fault LED Functions

The rear fan fault LEDs indicate when a failure has occurred on a fan module. The source of the failure could be mechanical, electrical, or the result of a midplane controller failure.

2.3.3.3 Viewing Rear Fan Faults in ILOM

Use the following command to view the sensor for a rear fan fault:

show /CH/FMn/FAIL

The variable n represents the fan module number. For example, /CH/FM1/FAIL indicates a fan failure in fan module 1.

See the Sun Integrated Lights Out Manager 2.0 User’s Guide and Appendix A for more information about reading this and other rear fan sensors.

2.4 Replacing a Faulted Component

When a fault indicates a hardware failure, the recommended method for clearing the fault is to replace the failed component.

To replace a failed component:

1. Determine which system component has experienced a hardware failure.

Look at the Service Action Required LEDs and the event log to get information about the component failure.

See Section 2.3, Determining That Hardware Has Failed.

2. Remove and replace the failed component.

Refer to the instructions in Chapter 4.

3. Monitor the component LEDs to confirm that the fault is cleared.