Hardware Issues

This section describes issues related to server components.

Direct I/O Support

Only certain PCIe cards can be used as direct I/O endpoint devices on an I/O domain. You can still use other cards in your Oracle VM Server for SPARC environment, but these other cards cannot be used with the Direct I/O feature. Instead, these PCIe cards can be used for service domains and for I/O domains that have entire root complexes assigned to them.

For the most up-to-date list of PCIe cards that support the Sirect I/O feature, refer to https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=REFERENCE&id=1325454.1

Note - Not all cards listed on the Direct I/O web page are supported in the SPARC T3-4 server. Check the server hardware compatibility list before installing any PCIe cards.

Sun Type 6 Keyboards Are Not Supported by SPARC T3 Series Servers

Sun Type 6 keyboards cannot be used with SPARC T3 series servers.

Hardware RAID 1E Not Supported

Although hardware RAID 0 and 1 are supported on the SPARC T3-4 server, hardware RAID 1E is not supported. Other RAID formats are available through software RAID.

I/O Performance Might Degrade When Using More Than Two Ports Across Multiple Sun Dual 10 GbE SFP+ PCIe Cards (CR 6943558)

Excessive packet loss has been seen when three or more ports are used across multiple Sun Dual 10GbE SFP+ PCIe cards. This loss is likely to significantly degrade trasmit and receive performance. When only two ports are used, packet loss is minimal and transmit/receive performance is as expected.

Workaround:

If you are experiencing network performance issues, use one of the following procedures to enable flow control for the interfaces. This will greatly reduce packet loss and improve performance.

Enable Flow Control (With a System Reboot)

Add the following lines in the /kernel/drv/ixgbe.conf file:

fm_capable = 0;
flow_control = 3;
tx_queue_number = 2;
rx_queue_number = 6;
intr_throttling = 1000;

Reboot the system.

Enable Flow Control (Without a System Reboot)

Add the following lines in the /kernel/drv/ixgbe.conf file:

fm_capable = 0;
flow_control = 3;
tx_queue_number = 2;
rx_queue_number = 6;
intr_throttling = 1000;

Unplumb all the ixgbe interfaces.
Type the update_drv ixgbe command:
```
# update_drv ixgbe
```
Plumb all the ixgbe interfaces.

`PARALLEL_BOOT/HOST_LAST_POWER_STATE=enabled` Failed, Unexpected Power State (Off) After AC Cycle (CR 6994047)

When HOST_LAST_POWER_STATE() is set to enabled and then the system goes through an AC power cycle, the host sometimes is shown as OFF when the power up operation completes. This status information might be false.

Recovery:

Power cycle the system again to clear the false status information.

Server Panics When Booting From a USB Thumbdrive Attached to the Front USB Ports (CR 6983185)

When attempting to boot a USB thumbdrive (portable USB flash drive) inserted in one of the front USB ports (USB2 or USB3), the server will panic and fail to boot.

Workaround:

Use the server's rear USB ports (USB0 or USB1) whenever booting off of an external USB thumbdrive.

Copper QSFP Cables Not Supported (CR 6941888)

The SPARC T3-4 Server 10 Gb Network Module does not support copper QSFP cables. The network module supports only optical QSFP transceiver modules and cables.

Workaround:

Use the cable specified in the list of supported system options.

Performance Limitations Occur When Performing a Hot-Plug Installation of a x8 Card Into a Slot Previously Occupied With a x4 Card (CR 6987359)

If you hot-plug a Dual 10GbE SFP+ PCIe2.0 Niantic EM Network Interface Card (NIC) (part number 1110A-Z) into a PCI Express Module slot that had previously held a 4-Port (Cu) PCIe (x4) Northstar ExpressModule (part number (X)7284A-Z-N), the expected performance benefit of the Dual 10GbE SFP+ PCIe2.0 Niantic NIC might not occur.

This problem does not occur if the slot was previously unoccupied, or if it had been occupied by any other option card. This problem does this occur if the card is present when the system is powered on.

Workaround:

Hot-plug the Dual 10Gbe SFP+ PCIe2.0 Niantic EM card a second time, using one of the following methods.

Use the cfgadm (1m) command to disconnect, then reconnect, the card:

# cfgadm -c disconnect slot-name 
# cfgadm -c configure slot-name

Use the hotplug(1M) command to disable and poweroff the device, and then poweron and enable the device:

# hotplug disable device-path slot-name
# hotplug poweroffdevice-path slot-name
# hotplug poweron device-path slot-name
# hotplug enable device-path slot-name

Use the Attention (ATTN) button on the card to deconfigure and then reconfigure the card.

Note - You don't need to physically remove and re-insert the card as part of the second hot plug operation.

Error Messages Not Retained After UE and CE Memory Failures (CR 6990058)

If your server's memory experiences a uncorrectable error (UE) followed by a correctable error (CE), the correct error messages will not be generated and they will not be retained by the service processor. You will not be able to diagnose the memory problem.

Workaround:

Reboot the system. If memory problems persist, contact your service representative for assistance.

Watchdog Timeouts Might Occur Under Very Heavy Load (CR 6994535)

In certain unusual heavy workloads, the host may appear to suddenly reset back to OBP without any sign of a crash or a panic. The ILOM event log contains a “Host watchdog expired” entry.

Display the SP event log:

-> show /SP/logs/event/list

If this issue is affecting the server, the event log contains an entry labelled, “Host watchdog expired.”

Workaround:

Contact your authorized service provider to see if a fix is available.

You can also extend the watchdog timeout period by adding this entry to the Oracle Solaris /etc/system file:

set watchdog_timeout = 600000

This extends the watchdog timeout period to 10 minutes (600,000 milliseconds).

In extreme cases, you can also disable the watchdog timeout altogether by adding this entry to the /etc/system file:

set watchdog_enabled = 0

Note - You must reboot the server for any /etc/system modification to take effect.

Unrecoverable USB Hardware Errors Occur In Some Circumstances (CR 6995634)

In some rare instances, unrecoverable USB hardware errors occur, such as the following:

usba: WARNING: /pci@400/pci@1/pci@0/pci@8/pci@0/usb@0,2 (ehci0): Unrecoverable USB Hardware Error
usba: WARNING: /pci@400/pci@1/pci@0/pci@8/pci@0/usb@0,1/hub@1/hub@3 (hubd5): Connecting device on port 2 failed

Workaround:

Reboot the system. Contact your service representative if these error messages persist.

Replace Faulty DIMMs With Uncorrectable Errors (UEs) As Soon As Possible (CR 6996144)

If a DIMM has an uncorrectable error (UE), the server will generate a fault.memory.bank error that labels a DIMM as faulty. You can view this error using the Oracle ILOM show faulty command or the fmdump -v command.

If a DIMM in your system contains a persistent uncorrectable error (an error that continually occurs even after multiple reboots), replace this DIMM as soon as possible to avoid any server downtime.

Workaround:

Instead of scheduling downtime to replace the faulty DIMMs, replace the faulty DIMMs as soon as possible. Contact your service representative for assistance.

Service Processor Does Not Always Initialize When AC Power Is Removed for Less Than 120 Seconds (CR 6997182)

The service processor (SP) does not always initialize when AC power is removed for less than 120 seconds.

Workaround:

To initialize the SP, unplug all four server power cords. Wait at least 120 seconds before reconnecting the power cords.

Intermittent Power Supply Faults Occur During Power On (CR 7066165)

In rare instances, the system FRU power-up probing routine might fail to list all installed system power supplies. The power supplies themselves are not faulted, but commands listing system FRUs do not show the presence of the non-probed power supply.

The fault sets the system fault LED, but no power supply fault LED is illuminated. To find the fault, use the fmadm utility from the ILOM fault management shell.

Start the fmadm utility from the ILOM CLI:

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y
faultmgmtsp>

To view the fault, type the following:

faultmgmtsp> fmadm faulty
------------------- ------------------------------------ -------------- ------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- ------
2011-09-21/13:59:35 f13524d6-9970-4002-c2e6-de5d750f4088 ILOM-8000-2V   Major
 
Fault class : fault.fruid.corrupt
 
FRU         : /SYS/PS0
              (Part Number: 300-2159)
              (Serial Number: 476856F+1115CC0001)
 
Description : A Field Replaceable Unit (FRU) has a corrupt FRUID SEEPROM
 
Response    : The service-required LED may be illuminated on the affected
              FRU and chassis.
 
Impact      : The system may not be able to use one or more components on
              the affected FRU.  This may prevent the system from powering
              on.
 
Action      : The administrator should review the ILOM event log for
              additional information pertaining to this diagnosis.  Please
              refer to the Details section of the Knowledge Article for
              additional information.

Workaround:

From the fault management shell prompt, clear the fault, exit the fault management shell, and reset the SP. For example:

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y
faultmgmtsp> fmadm repair /SYS/PS0
faultmgmtsp> exit
 
-> reset /SP
Are you sure you want to reset /SP (y/n)? y

After the SP has reset, verify that all installed power supplies appear in the list of system devices:

-> ls /SYS

If the problem occurs again after applying this workaround, contact your authorized Oracle Service Provider for further assistance.

Voltage Fault Prevents Host Power-On (CR 7003014)

In a very small percentage of the time when a system is powered on, ILOM may report a problem with a 12V sensor on one of the processor modules (PM0 or PM1), log a system fault, and abort the power-on sequence.The following is an example of the error message desplayed in the ILOM command line interface if the ILOM start /SYS command fails and the power-on sequence is aborted:

-> start /SYSstart: System faults or hardware configuration prevents power on.
Are you sure you want to start /SYS (y/n)? y 
Are you sure you want to start /SYS (y/n)? y

If your system does not power on using the ILOM start /SYS command, view the ILOM event log:

-> show /SP/logs/event/list

This issue might be present if you see an error in the ILOM event log that includes PMx/PDx/V_+12V0 (where x is either 0 or 1), such as the following:

1115   Sat Jan  1 12:44:15 2000  IPMI      Log       minor
       ID =   b2 : 01/01/2000 : 12:44:15 : Voltage : PM0/PD1/V_+12V0 : Lower Non
       -critical going low  : reading 0 <= threshold 11.43 Volts

In addition, the ILOM fault management shell indicates that the processor module is faulty.To view a list of faulty components, do the following:

Start the ILOM fault management shell:

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

Display the list of faulty components:

faultmgmtsp> fmadm faulty

The following example displays a voltage sensor fault on Processor Module 0 (PM0):

------------------- ------------------------------------ ------------- -------
Time                UUID                                 msgid         Severity
------------------- ------------------------------------ ------------- -------
2010-11-12/19:59:33 c55af62d-2da0-48de-f02f-b437146752f7 SPT-8000-DH   Critical
 
Fault class : fault.chassis.voltage.fail
 
FRU         : /SYS/PM0
(Part Number: 541-4182-08)
(Serial Number: 1005LCB-1041HB01A1)
 
Description : A chassis voltage supply is operating outside of the
allowable range.
 
Response    : The system will be powered off.  The chassis-wide service
required LED will be illuminated.
 
Impact      : The system is not usable until repaired.  ILOM will not allow
the system to be powered on until repaired.
 
Action      : The administrator should review the ILOM event log for
additional information pertaining to this diagnosis.  Please
refer to the Details section of the Knowledge Article for
additional information.

Workaround:

Contact your authorized service provider to see if a fix for this issue is available. This issue is fixed in firmware release 8.0.4.b and later. If you have a firmware release earlier than 8.0.4.b, continue with these workaround instructions.

If a fix is not available and you do encounter a power-on failure and an event with one of the PMx/PDx/V_+12V0 sensors, clear the fault using one of the procedures below, and attempt to power on the system again.

Clear the fault using one of the following methods:
1. To clear the fault using the ILOM CLI:
```
-> set FRU-name clear_fault_action=true 
```
  For example, to clear a fault on Processor Module 0 (PM0):
```
-> set /SYS/PM0 clear_fault_action=true 
```
2. To clear the fault using the ILOM fault management shell:
```
faultmgmtsp> fmadm repair FRU-name
```
  For example, to clear a fault on Processor Module 0 (PM0):
```
faultmgmtsp> fmadm repair /SYS/PM0
...
faultmgmtsp> exit->
 
```
Attempt to power on the system:
```
-> start /SYS
```

If the system powers on without failure after you clear the fault, you have encountered CR 7003014, and your system should power on and operate normally.

If the error persists and the system fails to power on, it should be treated as a genuine failure. Contact your authorized service provider for assistance.

Skip Navigation Links
Exit Print View
	SPARC T3-4 Server Product Notes

Product Notes

Hardware Issues

Direct I/O Support

Sun Type 6 Keyboards Are Not Supported by SPARC T3 Series Servers

Hardware RAID 1E Not Supported

I/O Performance Might Degrade When Using More Than Two Ports Across Multiple Sun Dual 10 GbE SFP+ PCIe Cards (CR 6943558)

Enable Flow Control (With a System Reboot)

Enable Flow Control (Without a System Reboot)

PARALLEL_BOOT/HOST_LAST_POWER_STATE=enabled Failed, Unexpected Power State (Off) After AC Cycle (CR 6994047)

Server Panics When Booting From a USB Thumbdrive Attached to the Front USB Ports (CR 6983185)

Copper QSFP Cables Not Supported (CR 6941888)

Performance Limitations Occur When Performing a Hot-Plug Installation of a x8 Card Into a Slot Previously Occupied With a x4 Card (CR 6987359)

Error Messages Not Retained After UE and CE Memory Failures (CR 6990058)

Watchdog Timeouts Might Occur Under Very Heavy Load (CR 6994535)

Unrecoverable USB Hardware Errors Occur In Some Circumstances (CR 6995634)

Replace Faulty DIMMs With Uncorrectable Errors (UEs) As Soon As Possible (CR 6996144)

Service Processor Does Not Always Initialize When AC Power Is Removed for Less Than 120 Seconds (CR 6997182)

Intermittent Power Supply Faults Occur During Power On (CR 7066165)

Voltage Fault Prevents Host Power-On (CR 7003014)

`PARALLEL_BOOT/HOST_LAST_POWER_STATE=enabled` Failed, Unexpected Power State (Off) After AC Cycle (CR 6994047)