Hardware Issues

Language:

This section describes the hardware issues in the Oracle Solaris 11.4 release.

HBA Connected with ALUA Multipath SAS Can Cause I/O Failures During Failover (28337990)

Oracle Solaris multipathing might experience I/O failures on Asymmetric Logical Unit Access (ALUA) storage targets during path failover. This will only happen with such storage attached via SAS SCSI transport. A device that is connected via SAS SCSI is enumerated by cfgadm -alv under a controller with type scsi-sas:

c7                             connected    configured   unknown
unavailable  scsi-sas     n /devices/pci@301/pci@1/scsi@0/iport@1:scsi
c7::w5000cca02f187da1,0        connected    configured   unknown
                          Client Device: /dev/dsk/c0t5000CCA02F187DA0d0s0(sd7)

In addition, the mpathadm show lu command will claim asymmetric multipathing:

# mpathadm show lu /dev/dsk/c0t5000CCA02F187DA0d0s0
Logical Unit:  /dev/rdsk/c0t5000CCA02F187DA0d0s2
        mpath-support:  libmpscsi_vhci.so
        ...
        Asymmetric:  yes

If this issue occurs, you will see an error similar to the following (lines are artificially broken for readability):

Jul 15 2018 13:22:45.123456789 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
       class = ereport.io.scsi.cmd.disk.tran
       ...
       thread-stacks = stack[0] = genunix`fm_dev_report_postv+2c8()
                                  |scsi`scsi_fm_report_post+204()
                                  |sd`sd_report_post+a04()
                                  |sd`sd_intr_report_post+150()
                                  |sd`sd_return_command+15c()
                                  |sd`sdintr+a00()|scsi`scsi_hba_pkt_comp+e94()
                                  |scsi_vhci`vhci_intr+d6c()
                                  |scsi`scsi_hba_pkt_comp+e94()
                                  |scsi`scsi_pkt_comp_daemon+c8()
       ...
       pkt-reason = 0x1a
       pkt-state = 0x0
       pkt-stats = 0x0
       ...

Workaround: Until a fix is released, you can work around this issue by increasing the values of sd and ssd tunables for an affected VID/PID pair. Modify /etc/driver/drv/sd.conf or /etc/driver/drv/ssd.conf as shown in the following example:

sd-config-list = "VID PID", "path-busy-retry-count:4294967295, path-busy-retry-timeout:180000";

Note that the value shown for path-busy-retry-count in this example is the maximum allowed setting. A lower value should work, but what value will work depends on system architecture and other circumstances. Therefore, a minimum value that works for any case cannot be stated.

This workaround has the following restrictions and limitations:

A large path-busy-retry-count value may cause the kernel to spin while waiting for the failover to occur. This will lead to high CPU usage. Therefore, a system with this workaround enabled might experience higher load and poor performance. Once the failover is complete, the system will recover.
These tunables might change in the future and should not be used after a fix for bug 28337990 is available. See the Bugs Fixed section of the SRU Readme files.
These tunables should not be used for any other purpose unless explicitly recommended by Oracle.

Panic When Performing a DR Operation on an InfiniBand HCA Device (28150723)

A panic can occur if an InfiniBand (IB) tool or utility such as ibqueryerrors or ibdiagnet is running while a Dynamic Reconfiguration (DR) operation is being performed on an HCA. The DR operation can be from commands such as cfgadm or ldm remove-io that result in the removal or unconfiguration of an HCA device. See the ibqueryerrors(8), ibdiagnet(1), cfgadm(8), and ldm(8) man pages for more information.

If a panic occurs for this reason, you will see an error message similar to the following:

panic[cpu14]/thread=c0405b9fe3980: BAD TRAP: type=31 rp=2a101bcf320 addr=62
mmu_fsr=0 occurred in module "ibtl" due to a NULL pointer dereference

Normally, if an IB tool is active and using an HCA on which a DR is being attempted, the DR operation fails, indicating that the HCA is in use.

Workaround: Ensure that no InfiniBand tools, utilities, or applications (such as ibqueryerrors or ibdiagnet) are active while performing a DR operation on an InfiniBand HCA device.

iSCSI Driver Might Give Up Prematurely When Trying to Reconnect to a Target (21216881)

When the connection to a target is temporarily disrupted, the default iSCSI maximum connection retry of 180 seconds (3 minutes) might be insufficient for the initiators that are using an iSCSI boot device. The following error message is displayed:

NOTICE: iscsi connection(19) unable to connect to target iqn.1986-03.com.sun:02:hostname, target address 192.168.001.160

Workaround: Increase iSCSI maximum connection retry to at least 1080 seconds (18 minutes) on initiators that are using the iSCSI boot device.