SPARC M5-32 and SPARC M6-32 Servers Product Notes

Exit Print View

Updated: March 2016
 
 

Intermittent Recovery Failure From SLINK UEs (17290820)

A red state condition can be identified by host console output similar to this:

Redstate trap occurred on socket 4 strand 80
2013-08-08 18:17:03  4:10:0> NOTICE:
 
 Redstate handler finished 

After a red state condition, autorecovery is initiated, and although a host's autorunonerror property is set to powercycle, the host might not complete automatic restart. Fault messages similar to these might be seen on the host console during autorecovery.

2013-08-08 18:41:51     SP> NOTICE: Faulted /SYS/SSB7/SA/SLINK12 will exclude /SYS/CMU2/CMP1 on future reboots
2013-08-08 18:41:52     SP> NOTICE: Abort boot due to /SYS/SSB7/SA/SLINK12. Power Cycle Host
2013-08-08 18:41:53     SP> NOTICE: Faulted /SYS/CMU2/CMP1/SLINK4 will exclude /SYS/CMU2/CMP1 on future reboots
2013-08-08 18:41:56     SP> NOTICE: Start Host in progress: Step 6 of 9
2013-08-08 18:42:04     SP> NOTICE: Faulted /SYS/SSB7/SA/SLINK13 will exclude /SYS/CMU0/CMP1 on future reboots
.
.
.
2013-08-08 18:43:13     SP> NOTICE: Check for usable CPUs in /SYS/DCU0
2013-08-08 18:43:14     SP> NOTICE: Exclude /SYS/CMU0/CMP0. Reason: Prior fault on dependent resource
2013-08-08 18:43:15     SP> NOTICE: Exclude /SYS/CMU0/CMP1. Reason: Prior fault on dependent resource
.
.
.
2013-08-08 18:43:19     SP> NOTICE: Apply configuration rules to /SYS/DCU0
2013-08-08 18:43:20     SP> NOTICE: Exclude all of /SYS/DCU0.  Reason: No configurable CPU in an even slot
2013-08-08 18:43:21     SP> NOTICE: HOST0 cannot be restarted. Reason: No configurable CPUs
2013-08-08 18:44:03     SP> NOTICE: Host is off 

Workaround: Manually stop the hosts, acquit the faults, and start the hosts.

  1. Stop all hosts.

    -> stop /Servers/PDomains/PDomain_x/HOST
    

    where x is 0, 1, 2, and 3.

  2. Start the Oracle ILOM fault management shell.

    -> start -script /SP/faultmgmt/shell
    
  3. List the faults.

    faultmgmtsp> fmadm faulty
    
  4. Record the UUIDs of the faults that affect SLINKs.

    For example:

    Time                UUID                                 msgid           Severity 
    ------------------- ------------------------------------ --------------  --------
    2013-08-16/12:56:32 09135d98-eafb-ee84-8643-fd8bb879cb6f SPSUN4V-8001-83 Critical 
    .
    .
    .
    Suspect 1 of 2 
    		Fault class  : fault.asic.switch.c2c-uc 
    		Certainty    : 50% 
    		Affects      : /SYS/SSB7/SA/SLINK13 
    		Status       : faulted 
    .
    .
    .
    Suspect 2 of 2 
    		Fault class  : fault.cpu.generic-sparc.c2c-uc 
    		Certainty    : 50% 
    		Affects      : /SYS/CMU0/CMP1/SLINK4 
    		Status       : faulted 
    .
    .
    .
    

    The UUID of the faults affecting SLINKs /SYS/SSB7/SA/SLINK13 and /SYS/CMU0/CMP1/SLINK4 is 09135d98-eafb-ee84-8643-fd8bb879cb6f.

  5. Acquit the faults.

    faultmgmtsp> fmadm acquit UUID
    

    where UUID is the UUID of the fault. For example:

    faultmgmtsp> fmadm acquit 09135d98-eafb-ee84-8643-fd8bb879cb6f
    
  6. Repeat Step 5 for all respective faults.

  7. Exit the Oracle ILOM fault management shell.

    faultmgmtsp> exit
    ->
    
  8. Start all hosts.

    -> start /Servers/PDomains/PDomain_x/HOST
    

    where x is 0, 1, 2, and 3.