Sun Netra T5440 Server

Exit Print View

Updated: September 2015
 
 

1.6.1 Identifying PSH Detected Faults

When a PSH fault is detected, a Solaris console message similar to Console Message Showing Fault Detected by PSH is displayed.

Example 1-8  Console Message Showing Fault Detected by PSH
SUNW-MSG-ID: SUN4V-8000-DX, TYPE: Fault, VER: 1, SEVERITY: Minor
EVENT-TIME: Wed Sep 14 10:09:46 EDT 2005
PLATFORM: SUNW,Sun-Netra-T5440, CSN: -, HOSTNAME: wgs48-37
SOURCE: cpumem-diagnosis, REV: 1.5
EVENT-ID: f92e9fbe-735e-c218-cf87-9e1720a28004
DESC: The number of errors associated with this memory module has exceeded acceptable levels.  Refer to http://sun.com/msg/SUN4V-8000-DX for more information.
AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported.
IMPACT: Total system memory capacity will be reduced as pages are retired.
REC-ACTION: Schedule a repair procedure to replace the affected memory module.  Use fmdump -v -u <EVENT_ID> to identify the module.

Faults detected by the Solaris PSH facility are also reported through service processor alerts. ALOM CMT CLI Alert of PSH Diagnosed Fault depicts an ALOM CMT CLI alert of the same fault reported by Solaris PSH in ALOM CMT CLI Alert of PSH Diagnosed Fault.

Example 1-9  ALOM CMT CLI Alert of PSH Diagnosed Fault
SC Alert: Host detected fault, MSGID: SUN4V-8000-DX

The ALOM CMT CLI showfaults command provides summary information about the fault. See Detecting Faults for more information about the showfaults command.


Note - The Service Required LED is also turns on for PSH diagnosed faults.

Using the fmdump Command to Identify Faults

The fmdump command displays the list of faults detected by the Solaris PSH facility and identifies the faulty FRU for a particular EVENT_ID (UUID).

Do not use fmdump to verify a FRU replacement has cleared a fault because the output of fmdump is the same after the FRU has been replaced. Use the fmadm faulty command to verify the fault has cleared.

  1. Check the event log using the fmdump command with -v for verbose output

    In Output from the fmdump -v Command, a fault is displayed, indicating the following details:

    • Date and time of the fault (Jul 31 12:47:42.2007)
    • Universal Unique Identifier (UUID). This is unique for every fault (fd940ac2-d21e-c94a-f258-f8a9bb69d05b)
    • Sun message identifier, which can be used to obtain additional fault information (SUN4V-8000-JA)
    • Faulted FRU. The information provided in the example includes the part number of the FRU (part=541215101) and the serial number of the FRU (serial=101083). The Location field provides the name of the FRU. In Output from the fmdump -v Command the FRU name is MB, meaning the motherboard.

      Note - fmdump displays the PSH event log. Entries remain in the log after the fault has been repaired.
  2. Use the Sun message ID to obtain more information about this type of fault.
    1. In a browser, go to the Predictive Self-Healing Knowledge Article web site: http://www.sun.com/msg
    2. Obtain the message ID from the console output or the ALOM CMT CLI showfaults command.
    3. Enter the message ID in the SUNW-MSG-ID field, and click Lookup.

      In PSH Message Output, the message ID SUN4V-8000-JA provides information for corrective action:

  3. Follow the suggested actions to repair the fault.
Example 1-10  Output from the fmdump -v Command
# fmdump -v -u fd940ac2-d21e-c94a-f258-f8a9bb69d05b
TIME                 UUID                                 SUNW-MSG-ID
Jul 31 12:47:42.2007 fd940ac2-d21e-c94a-f258-f8a9bb69d05b SUN4V-8000-JA
  100%  fault.cpu.ultraSPARC-T2.misc_regs
 
        Problem in: cpu:///cpuid=16/serial=5D67334847
           Affects: cpu:///cpuid=16/serial=5D67334847
               FRU: hc://:serial=101083:part=541215101/motherboard=0
          Location: MB
Example 1-11  PSH Message Output
CPU errors exceeded acceptable levels
 
Type
    Fault 
Severity
    Major 
Description
    The number of errors associated with this CPU has exceeded acceptable levels. 
Automated Response
    The fault manager will attempt to remove the affected CPU from service. 
Impact
    System performance may be affected. 
 
Suggested Action for System Administrator
    Schedule a repair procedure to replace the affected CPU, the identity of which can be determined using fmdump -v -u <EVENT_ID>. 
 
Details
    The Message ID:  SUN4V-8000-JA indicates diagnosis has determined that a CPU is faulty. The Solaris fault manager arranged an automated attempt to disable this CPU. The recommended action for the system administrator is to contact Sun support so a Sun service technician can replace the affected component. 

Clearing PSH Detected Faults

When the Solaris PSH facility detects faults the faults are logged and displayed on the console. In most cases, after the fault is repaired, the corrected state is detected by the system and the fault condition is repaired automatically. However, this must be verified and, in cases where the fault condition is not automatically cleared, the fault must be cleared manually.

  1. After replacing a faulty FRU, power on the server.
  2. At the ALOM CMT CLI prompt, use the showfaults command to identify PSH detected faults.

    PSH detected faults are distinguished from other kinds of faults by the text: Host detected fault.

    Example:

    sc> showfaults -v
    Last POST Run: Wed Jun 29 11:29:02 2007
     
    Post Status: Passed all devices
    ID  Time              FRU                      Fault
    0  Jun 30 22:13:02   /SYS/MB/CMP0/BR1/CH0/D0  Host detected fault, MSGID: SUN4V-8000-DX  UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86
    
    • If no fault is reported, you do not need to do anything else. Do not perform the subsequent steps.
    • If a fault is reported, perform Step 3 and Step 4.
  3. Run the ALOM CMT CLI clearfault command with the UUID provided in the showfaults output.

    Example:

    sc> clearfault 7ee0e46b-ea64-6565-e684-e996963f7b86
    Clearing fault from all indicted FRUs...
    Fault cleared.
    
  4. Clear the fault from all persistent fault records.

    In some cases, even though the fault is cleared, some persistent fault information remains and results in erroneous fault messages at boot time. To ensure that these messages are not displayed, perform the following Solaris command:

    fmadm repair UUID

    Example:

    # fmadm repair 7ee0e46b-ea64-6565-e684-e996963f7b86