This issue was fixed in Sun System Firmware 9.5.2.
If the primary domain is configured without enough resources (two SCCs or fewer) and correctable errors trigger an FMA retirement action affecting both these SCCs, then the domain hangs upon reboot. Other domains are not affected, and continue to run normally as long as their own network cards and drives are still available. If an error triggers a domain retirement, you can view the fault using the fmadm faulty command.
SUNW-MSG-ID: SPSUN4V-8001-YA, TYPE: Problem, VER: 1, SEVERITY: Major EVENT-TIME: Tue Oct 6 18:50:50 EDT 2015 PLATFORM: SPARC T7-2, CSN: 12345678, HOSTNAME: bur-t72-303-sp SOURCE: fdd, REV: 1.0 EVENT-ID: f78853a2-87cf-e147-efb3-ecc370ef147e DESC: An event was received indicating a fault was diagnosed by another fault manager. AUTO-RESPONSE: Refer to the document at http://support.oracle.com/msg/SPSUN4V-8001-YA. IMPACT: Refer to the document at http://support.oracle.com/msg/SPSUN4V-8001-YA. REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/SPSUN4V-8001-YA for the latest service procedures and policies regarding this diagnosis. -> fmadm faulty Time UUID msgid Severity ------------------- ------------------------------------ -------------- -------- 2015-10-06/22:51:00 abea80bd-6d18-46a4-e9cc-fda7df765748 SPSUN4V-8001-YA Major Problem Status : open [injected] Diag Engine : fdd 1.0 System Manufacturer : Oracle Corporation Name : SPARC T7-2 Part_Number : 87654321 Serial_Number : 12345678 ---------------------------------------- Suspect 1 of 1 Fault class : fault.cpu.generic-sparc.l2d-uc Certainty : 100% Affects : /SYS/MB/CM0/CMP/SCC3/L2D1 Status : faulted FRU Status : faulty Location : /SYS/MB Manufacturer : Oracle Corporation Name : ASY,MB,T7-2 Part_Number : 7093274 Revision : 02 Serial_Number : 465769T+1434NH00JJ Chassis Manufacturer : Oracle Corporation Name : SPARC T7-2 Part_Number : 87654321 Serial_Number : 12345678 Description : A cpu has experienced an uncorrectable level 2 data cache error (UE). Response : Cpu cores associated with the cache will be deconfigured. Impact : Some services may be lost and performance may be impacted. Action : Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/SPSUN4V-8001-YA for the latest service procedures and policies regarding this diagnosis. ------------------- ------------------------------------ -------------- -------- Time UUID msgid Severity ------------------- ------------------------------------ -------------- -------- 2015-10-06/22:50:50 f78853a2-87cf-e147-efb3-ecc370ef147e SPSUN4V-8001-YA Major Problem Status : open [injected] Diag Engine : fdd 1.0 System Manufacturer : Oracle Corporation Name : SPARC T7-2 Part_Number : 87654321 Serial_Number : 12345678 ---------------------------------------- Suspect 1 of 1 Fault class : fault.cpu.generic-sparc.l2d-uc Certainty : 100% Affects : /SYS/MB/CM0/CMP/SCC3/L2D0 Status : faulted FRU Status : faulty Location : /SYS/MB Manufacturer : Oracle Corporation Name : ASY,MB,T7-2 Part_Number : 7093274 Revision : 02 Serial_Number : 465769T+1434NH00JJ Chassis Manufacturer : Oracle Corporation Name : SPARC T7-2 Part_Number : 87654321 Serial_Number : 12345678 Description : A cpu has experienced an uncorrectable level 2 data cache error (UE). Response : Cpu cores associated with the cache will be deconfigured. Impact : Some services may be lost and performance may be impacted. Action : Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/SPSUN4V-8001-YA for the latest service procedures and policies regarding this diagnosis.
This issue is the root cause of a domain retirement if the fault is reported on the same cores running the primary domain, and the primary domain hangs upon reboot.
Workaround: Ensure that the primary guest domain is assigned two SCCs or more (that is, a minimum of two SCCs and a few additional cores) on the same node.
Recovery: Force reset the domain (reset -f /HOST) to regain access. Upon reboot the server is unable to access the most recently saved SPM configuration, and reverts to the factory default configuration instead.