Hardware Issues

This section describes issues related to SPARC T4-4 server components.

Maximizing Memory Bandwidth

To maximize processor module memory bandwidth, Oracle recommends that only fully-populated memory configurations—as opposed to half-populated configurations—be considered for performance-critical applications.

For specific memory installation and upgrade instructions, see the SPARC T4-4 Server Service Manual.

Direct I/O Support

Only certain PCIe cards can be used as direct I/O endpoint devices on an I/O domain. You can still use other cards in your Oracle VM Server for SPARC environment, but these other cards cannot be used with the Direct I/O feature. Instead, other PCIe cards can be used for service domains and for I/O domains that have entire root complexes assigned to them.

For the most up-to-date list of supported PCIe cards, refer to https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=REFERENCE&id=1325454.1

Note - Not all cards listed on the Direct I/O web page are supported in the SPARC T4-4 server. Check the server hardware compatibility list before installing any PCIe cards.

Use Links Labeled SPARC T3 to Download `sas2ircu` Software for SPARC T4 Servers

To download sas2ircu firmware and documentation for SPARC T4-4 server from the current LSI web site, you must use links labeled SPARC T3-1 and T3-2. The software and documentation is the same for both sets of servers.

This is the web site for downloading sas2ircu software from LSI:

http://www.lsi.com/sep/Pages/oracle/index.aspx

This is the web site for downloading sas2ircu documentation from LSI:

http://www.lsi.com/sep/Pages/oracle/sparc_t3_series.aspx

Sun Type 6 Keyboards Are Not Supported by SPARC T4 Series Servers

Sun Type 6 keyboards cannot be used with SPARC T4 series servers.

Hardware RAID 1E Not Supported

Although hardware RAID 0 and 1 are supported on the SPARC T4-4 server, hardware RAID 1E is not supported. Other RAID formats are available through software RAID.

Server Panics When Booting From a USB Thumbdrive Attached to the Front USB Ports (Bug ID 15667682)

Note - This issue was originally listed as CR 6983185.

When attempting to boot a USB thumbdrive inserted in either front USB port (USB2 or USB3), the server might panic.

Workaround: Use the server's rear USB ports (USB0 or USB1) whenever booting from an external USB device.

Performance Limitations Occur When Performing a Hot-Plug Installation of a x8 Card Into a Slot Previously Occupied With a x4 Card (Bug ID 15671185)

Note - This issue was originally listed as CR 6987359.

If you hot-plug a Dual 10GbE SFP+ PCIe2.0 EM Network Interface Card (NIC) (part number 1110A-Z) into a PCI Express Module slot that had previously held a 4-Port (Cu) PCIe (x4) ExpressModule (part number (X)7284A-Z-N), the expected performance benefit of the Dual 10GbE SFP+ PCIe2.0 NIC might not occur.

This problem does not occur if the slot was previously unoccupied, or if it had been occupied by any other option card. In addition, this problem occurs if the card is present when the system is powered on.

Workaround: Hotplug the Dual 10Gbe SFP+ PCIe2.0 EM card a second time, using one of the following methods.

Use the cfgadm(1m) command to disconnect, then reconnect, the card:

# cfgadm -c disconnect slot-name 
# cfgadm -c configure slot-name

Use the hotplug(1m) command to disable and poweroff the device, and then poweron and enable the device:

# hotplug disable device-path slot-name
# hotplug poweroff device-path slot-name
# hotplug poweron device-path slot-name
# hotplug enable device-path slot-name

Use the Attention (ATTN) button on the card to deconfigure and then reconfigure the card.

Note - You don't need to physically remove and re-insert the card as part of the second hot plug operation.

Unrecoverable USB Hardware Errors Occur In Some Circumstances (Bug ID 15677875, Bug ID 15765407)

Note - This issue was originally listed as CR 6995634.

In some rare instances, unrecoverable USB hardware errors occur, such as the following:

usba: WARNING: /pci@400/pci@1/pci@0/pci@8/pci@0/usb@0,2 (ehci0): Unrecoverable USB Hardware Error
usba: WARNING: /pci@400/pci@1/pci@0/pci@8/pci@0/usb@0,1/hub@1/hub@3 (hubd5): Connecting device on port 2 failed

Workaround: Reboot the system. Contact your service representative if these error messages persist.

PSH Might Not Clear a Retired Cache Line on a Replaced CPU Module (Bug ID 15705327, Bug ID 15713018)

Note - This issue was originally listed as CR 7031216.

Note - This issue was fixed in Oracle Solaris 11.1.

When a CPU module is replaced to repair a faulty CPU, PSH might not clear retired cache lines on the replacement FRU. In such cases, the cache line remains disabled.

Workaround: Manually clear the disabled cache line by running the following command:

# fmadm repaired fmri | label

For example:

# fmdump -avNov 03 10:34:56.6192 e1ee44ed-72f7-c32b-855b-e9f4b03144af SUN4V-8002-V3
TIME                 UUID                                 SUNW-MSG-IDProblem in: hc://:product-id=ORCL,SPARC-T4-4:product-sn=xxxxyyyxxx:server-id=xxxxx:chassis-id=xxxxyyyxxx/chassis=0/cpuboard=0/chip=0/l3cache=0/cacheindex=256/cacheway=7Affects: hc://:product-id=ORCL,SPARC-T4-4:product-sn=xxxxyyyxxx:server-id=xxxxx:chassis-id=xxxxyyyxxx/chassis=0/cpuboard=0/chip=0/l3cache=0/cacheindex=256/cacheway=7
 FRU: hc://:product-id=ORCL,SPARC-T4-4:product-sn=xxxxyyyxxx:server-id=xxxxx:chassis-id=xxxxyyyxxx:serial=465769T+1115H50061:part=7013822:revision=01/chassis=0/cpuboard=0
# fmadm repaired hc://:product-id=ORCL,SPARC-T4-4:product-sn=xxxxyyyxxx:server-id=xxxxx:chassis-id=xxxxyyyxxx/chassis=0/cpuboard=0/chip=0/l3cache=0/cacheindex=256/cacheway=7Location: /SYS/PM0100%  fault.cpu.generic-sparc.cachelinefmadm: recorded repair to of hc://:product-id=ORCL,SPARC-T4-4:product-sn=xxxxyyyxxx:server-id=xxxxx:chassis-id=xxxxyyyxxx/chassis=0/cpuboard=0/chip=0/l3cache=0/cacheindex=256/cacheway=7
# fmdump -aTIME                 UUID                                 SUNW-MSG-ID
Nov 03 10:34:56.6192 e1ee44ed-72f7-c32b-855b-e9f4b03144af SUN4V-8002-V3
Nov 03 10:37:40.3545 e1ee44ed-72f7-c32b-855b-e9f4b03144af FMD-8000-4M RepairedNov 03 10:37:40.3610 e1ee44ed-72f7-c32b-855b-e9f4b03144af FMD-8000-6U Resolved

PCIe Correctable Errors Might Be Reported (Bug ID 15720000, 15722832)

Note - This issue was originally listed as CR 7051331.

Note - This issue was fixed in Oracle Solaris 11.

In rare cases, PCI Express Gen2 or low-profile PCIe devices in the server might report I/O errors that are identified and reported by PSH. For example:

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Aug 10 13:03:23 a7d43aeb-61ca-626a-f47b-c05635f2cf5a  PCIEX-8000-KP  Major
 
Host        : dt214-154
Platform    : ORCL,SPARC-T4-4  Chassis_id  :
Product_sn  :
 
Fault class : fault.io.pciex.device-interr-corr 67%
              fault.io.pciex.bus-linkerr-corr 33%
Affects     : dev:////pci@400/pci@1/pci@0/pci@c
              dev:////pci@400/pci@1/pci@0/pci@c/pci@0
                  faulted but still in service
FRU         : "/SYS/MB" (hc://:product-id=ORCL,SPARC-T4-4:product-sn=xxxx:server-id=xxxx:chassis-id=0000000-0000000000:serial=xxxx:part=541-424304:revision=50/chassis=0/motherboard=0) 67%
              "FEM0" (hc://:product-id=ORCL,SPARCT4-4:product-sn=xxxxx:server-id=xxxxx:chassis-id=0000000-0000000000/chassis=0/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0/pciexbus=2/pciexdev=12/pciexfn=0/pciexbus=62/pciexdev=0) 33%
                  faulty
 
Description : Too many recovered bus errors have been detected, which indicates
              a problem with the specified bus or with the specified
              transmitting device. This may degrade into an unrecoverable
              fault.
              ...
 
Response    : One or more device instances may be disabled
 
Impact      : Loss of services provided by the device instances associated with
              this fault
 
Action      : If a plug-in card is involved check for badly-seated cards or
              bent pins. Otherwise schedule a repair procedure to replace the
              affected device.  Use fmadm faulty to identify the device or
              contact Sun for support.

These errors might be an indication of a faulty or incorrectly seated device. Or these errors might be erroneous.

Workaround: Ensure that the device is properly seated and functioning. If the errors continue, apply patch 147705-01 or higher.

L2 Cache Uncorrectable Errors Might Lead to an Entire Processor Being Faulted (Bug ID 15727651, Bug ID 15732875, Bug ID 15732876, Bug ID 15733117)

Note - This issue was originally listed as CR 7065563.

Note - This issue was fixed in System Firmware 8.1.4.

An L2 cache uncorrectable error might lead to an entire processor being faulted when only specific core strands should be faulted.

Workaround: Schedule a service call with your authorized Oracle service provider to replace the processor module containing the faulty core. Until it is replaced, you can return the strands related to the functioning cores to service using the following procedure. This restores as much system functionality as the active cores provide.

Identify the faulty core:

# fmdump -eV -c ereport.cpu.generic-sparc.l2tagctl-uc

The detector portion of the fmdump output is displayed as follows.

Note - Key elements in the example are highlighted for emphasis. They would not be highlighted in the actual output.

     detector = (embedded nvlist)
          nvlist version: 0
                  version = 0x0
                  scheme = hc
                  hc-root =
                  hc-list-sz = 4
                  hc-list = (array of embedded nvlists)
                  (start hc-list[0])
                  nvlist version: 0
                          hc-name = chassis
                          hc-id = 0
                  (end hc-list[0])
                  (start hc-list[1])
                  nvlist version: 0
                        hc-name = cpuboard 
                        hc-id = 1 
 (start hc-list[2])
(end hc-list[1])
hc-name = chip nvlist version: 0hc-id = 2   
                  (end hc-list[2])
 (start hc-list[3])
nvlist version: 0
hc-name = core  hc-id = 19 
                 (end hc-list[3])
 
          (end detector)

In this example, the faulted chip is indicated by the following FMRI values:

Chassis = 0
CPU Board = 1
Chip = 2
Core = 19

The following table includes additional examples with corresponding Nomenclature Architecture Council (NAC) names.

Sample `fmdump` Output	Corresponding NAC Name
`cpuboard=0/chip=0/core=0`	`/SYS/PM0/CMP0/CORE0`
`cpuboard=1/chip=2/core=16`	`/SYS/PM1/CMP0/CORE0`
`cpuboard=1/chip=2/core=19`	`/SYS/PM1/CMP0/CORE3`

For example, given a FMRI of chassis=0/cpuboard=x/chip=y/core=z, the corresponding NAC name for /SYS/PMa/CMPb/COREc can be derived as follows:

a = x
b = (y mod 2)
c = (z mod 8)

Halt the Oracle Solaris OS, and power off the server.

Disable the faulty core. From the Oracle ILOM CLI:

-> cd /SYS/PM1/CMP0/CORE0
/SYS/PM1/CMP0/CORE0
-> show /SYS/PM1/CMP0/CORE01331
-> set component_state=disabled Targets:
              P0
              P1
              P2
              P3
              P4
              P5
              P6
              P7
              L2CACHE
              L1CACHE
 
          Properties:
              type = CPU Core
              component_state = Enabled
 
          Commands:
              cd
              set
              show

Power on the server, and restart the Oracle Solaris OS.

Refer to the SPARC T4 Series Servers Administration Guide for information on powering on the server from the Oracle ILOM prompt.
Override the FMA diagnosis manually.

The faulty component's UUID value is provided in the first line of the fmdump output.
```
 # fmadm repair uuid-of-fault
```

L2 Cache UEs Are Sometimes Reported as Core Faults Without Any Cache Line Retirements (Bug ID 15731176)

Note - This issue was originally listed as CR 7071237.

When a processor cache line encounters an uncorrectable error (UE), the fault manager is supposed to attempt to retire the cache line involved in the error. Because of this defect, the fault manager might not retire the faulty cache line and instead report the entire chip as faulted.

Workaround: Schedule a replacement of the FRU containing the faulty component. For additional information about UEs in processor cache lines, search for message ID SUN4V-8002-WY on the Oracle support site, http://support.oracle.com.

Upon a Reboot After an Unrecoverable Hardware Error, CPUs Might Not Start (Bug ID 15733431)

Note - This issue was originally listed as CR 7075336.

In rare cases, if the server or sever module experiences a serious problem that results in a panic, when the server is rebooted, a number of CPUs might not start, even though the CPUs are not faulty.

Example of the type of error displayed:

rebooting...
Resetting...
 
ERROR: 63 CPUs in MD did not start

Workaround: Power cycle the server.

-> stop /SYS
Are you sure you want to stop /SYS (y/n)? y
Stopping /SYS
-> start /SYS
Are you sure you want to start /SYS (y/n) ? y
Starting /SYS

Intermittent Power Supply Faults Occur During Power On (Bug ID 15727974)

Note - This issue was originally listed as CR 7066165.

In rare instances, the system FRU power-up probing routine might fail to list all installed system power supplies. The power supplies themselves are not faulted, but commands listing system FRUs do not show the presence of the non-probed power supply.

The fault sets the system fault LED, but no power supply fault LED is illuminated. To find the fault, use the fmadm utility from the ILOM fault management shell.

Start the fmadm utility from the ILOM CLI:

-> start /SP/faultmgmt/shell
 Are you sure you want to start /SP/faultmgmt/shell (y/n)? y
faultmgmtsp>

To view the fault, type the following:

faultmgmtsp> fmadm faulty
------------------- ------------------------------------ -------------- ------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- ------
2011-09-21/13:59:35 f13524d6-9970-4002-c2e6-de5d750f4088 ILOM-8000-2V   Major
 
Fault class : fault.fruid.corrupt
 
FRU         : /SYS/PS0
              (Part Number: 300-2159)
              (Serial Number: 476856F+1115CC0001)
 
Description : A Field Replaceable Unit (FRU) has a corrupt FRUID SEEPROM
 
Response    : The service-required LED may be illuminated on the affected
              FRU and chassis.
 
Impact      : The system may not be able to use one or more components on
              the affected FRU.  This may prevent the system from powering
              on.
 
Action      : The administrator should review the ILOM event log for
              additional information pertaining to this diagnosis.  Please
              refer to the Details section of the Knowledge Article for
              additional information.

Workaround: From the fault management shell prompt, clear the fault, exit the fault management shell, and reset the SP. For example:

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y
faultmgmtsp> fmadm repair /SYS/PS0
faultmgmtsp> exit
 
-> reset /SP
Are you sure you want to reset /SP (y/n)? y

After the SP has reset, verify that all installed power supplies appear in the list of system devices:

-> ls /SYS

If the problem occurs again after applying this workaround, contact your authorized Oracle Service Provider for further assistance.

Non-Critical Power Supply Threshold Messages Occur Under Heavy Load (Bug ID 15728319)

Note - This issue was originally listed as CR 7066726.

In some instances under heavy load, power supply threshold messages similar to the following appear in the /var/adm/messages file:

SC Alert: [ID 579591 daemon.notice] Sensor | minor: Power Unit : /SYS/VPS : Upper Non-critical going high : reading 2140 >= threshold 2140 Watts
SC Alert: [ID 807701 daemon.notice] Sensor | minor: Power Unit : /SYS/VPS : Upper Non-critical going low  : reading 2100 <= threshold 2140 Watts