This issue was fixed in Sun System Firmware 9.5.2.
If the primary domain is configured without enough resources (two SCCs or fewer) and correctable errors trigger an FMA retirement action affecting both these SCCs, then the domain hangs upon reboot. Other domains are not affected, and continue to run normally as long as their own network cards and drives are still available. If an error triggers a domain retirement, you can view the fault using the fmadm faulty command.
SUNW-MSG-ID: SPSUN4V-8001-YA, TYPE: Problem, VER: 1, SEVERITY: Major
EVENT-TIME: Tue Oct 6 18:50:50 EDT 2015
PLATFORM: SPARC T7-2, CSN: 12345678, HOSTNAME: bur-t72-303-sp
SOURCE: fdd, REV: 1.0
EVENT-ID: f78853a2-87cf-e147-efb3-ecc370ef147e
DESC: An event was received indicating a fault was diagnosed by another fault manager.
AUTO-RESPONSE: Refer to the document at http://support.oracle.com/msg/SPSUN4V-8001-YA.
IMPACT: Refer to the document at http://support.oracle.com/msg/SPSUN4V-8001-YA.
REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/SPSUN4V-8001-YA for the latest service procedures and policies regarding this diagnosis.
-> fmadm faulty
Time UUID msgid Severity
------------------- ------------------------------------ -------------- --------
2015-10-06/22:51:00 abea80bd-6d18-46a4-e9cc-fda7df765748 SPSUN4V-8001-YA Major
Problem Status : open [injected]
Diag Engine : fdd 1.0
System
Manufacturer : Oracle Corporation
Name : SPARC T7-2
Part_Number : 87654321
Serial_Number : 12345678
----------------------------------------
Suspect 1 of 1
Fault class : fault.cpu.generic-sparc.l2d-uc
Certainty : 100%
Affects : /SYS/MB/CM0/CMP/SCC3/L2D1
Status : faulted
FRU
Status : faulty
Location : /SYS/MB
Manufacturer : Oracle Corporation
Name : ASY,MB,T7-2
Part_Number : 7093274
Revision : 02
Serial_Number : 465769T+1434NH00JJ
Chassis
Manufacturer : Oracle Corporation
Name : SPARC T7-2
Part_Number : 87654321
Serial_Number : 12345678
Description : A cpu has experienced an uncorrectable level 2 data cache
error (UE).
Response : Cpu cores associated with the cache will be deconfigured.
Impact : Some services may be lost and performance may be impacted.
Action : Use 'fmadm faulty' to provide a more detailed view of this
event. Please refer to the associated reference document at
http://support.oracle.com/msg/SPSUN4V-8001-YA for the latest
service procedures and policies regarding this diagnosis.
------------------- ------------------------------------ -------------- --------
Time UUID msgid Severity
------------------- ------------------------------------ -------------- --------
2015-10-06/22:50:50 f78853a2-87cf-e147-efb3-ecc370ef147e SPSUN4V-8001-YA Major
Problem Status : open [injected]
Diag Engine : fdd 1.0
System
Manufacturer : Oracle Corporation
Name : SPARC T7-2
Part_Number : 87654321
Serial_Number : 12345678
----------------------------------------
Suspect 1 of 1
Fault class : fault.cpu.generic-sparc.l2d-uc
Certainty : 100%
Affects : /SYS/MB/CM0/CMP/SCC3/L2D0
Status : faulted
FRU
Status : faulty
Location : /SYS/MB
Manufacturer : Oracle Corporation
Name : ASY,MB,T7-2
Part_Number : 7093274
Revision : 02
Serial_Number : 465769T+1434NH00JJ
Chassis
Manufacturer : Oracle Corporation
Name : SPARC T7-2
Part_Number : 87654321
Serial_Number : 12345678
Description : A cpu has experienced an uncorrectable level 2 data cache
error (UE).
Response : Cpu cores associated with the cache will be deconfigured.
Impact : Some services may be lost and performance may be impacted.
Action : Use 'fmadm faulty' to provide a more detailed view of this
event. Please refer to the associated reference document at
http://support.oracle.com/msg/SPSUN4V-8001-YA for the latest
service procedures and policies regarding this diagnosis. This issue is the root cause of a domain retirement if the fault is reported on the same cores running the primary domain, and the primary domain hangs upon reboot.
Workaround: Ensure that the primary guest domain is assigned two SCCs or more (that is, a minimum of two SCCs and a few additional cores) on the same node.
Recovery: Force reset the domain (reset -f /HOST) to regain access. Upon reboot the server is unable to access the most recently saved SPM configuration, and reverts to the factory default configuration instead.