Bug ID: 28244670
Issue: Oracle ILOM might report a fault.io.scsi.cmd.disk.dev.rqs.baddrv error for NVMe devices. Oracle ILOM incorrectly faults the device with message Fault fault.io.scsi.cmd.disk.dev.rqs.baddrv on FRU /SYS.
Affected Hardware and Software: NVMe storage devices on all supported operating systems
If a system encounters this issue, do the following steps.
Look for the NVMe ILOM fault code: fault.io.scsi.cmd.disk.dev.rqs.baddrv
The following screen shows a fault.io.scsi.cmd.disk.dev.rqs.baddrv error for Oracle 6.4 TB NVMe SSD v2.
ereport.io.scsi.cmd.disk.dev.rqs.baddrv@/SYS/DBP/HDD10/NVME status_flags = 0xc3 smart_warning = 0xff reason = Drive is not functional
You can also use the Oracle ILOM show faulty command at the Oracle ILOM command-line prompt (->) to identify a drive failure.
To list all known faults in the server, log in to the Oracle ILOM service processor from the Oracle ILOM Fault Management Shell and issue the fmadm faulty command. For more information about how to use the Oracle ILOM Fault Management Shell and supported commands, refer to the Oracle ILOM User's Guide for System Monitoring and Diagnostics Firmware Release 4.0.x in the Oracle Integrated Lights Out Manager (ILOM) 4.0 Documentation Library at https://www.oracle.com/goto/ilom/docs.
Upgrade drive firmware if not current.
See Oracle 6.4 TB NVMe SSD v2 Supported Hardware and Software.
Do one of the following:
If SMbus status_flags = 0xbb displays, then clear the fault. No power cycling is required. To clear the fault code in Oracle ILOM, go to step 4.
If SMbus status_flags = 0xc3 displays, complete a server power cycle, then clear the fault. Do the following to recover, then go to step 4.
To identify the drive slot, type:
# lspci -vv -s 1b:00.0 1b:00.0 Non-Volatile memory controller: [NVM Express]) Subsystem: Oracle/SUN Device Physical Slot: 900 Control: I/O- The PCIe address of /dev/nvme10n1 is 0000:e7:00.0
Take the affected drive off-line.
Disconnect all users of the NVMe drive and back up the NVMe drive data as needed. Use the umount command to unmount any file systems that are mounted on the device. Remove the device from any multiple device (md) and Logical Volume Manager (LVM) volume using it.
If the device is a member of an LVM Volume group, then it might be necessary to move data off the device using the pvmove command, then use the vgreduce command to remove the physical volume, and (optionally) pvremove to remove the LVM metadata from the disk. If the device uses multipathing, run multipath -l and note all the paths to the device. Then, remove the multipathed device using the multipath -f device command. Run the blockdev --flushbufs device command to flush any outstanding I/O to all paths to the device.
To prepare the NVMe drive for removal, that is, to detach the NVMe device driver and power off the NVMe drive slot, type: # echo 0 >/sys/bus/pci/slots/900/power
To power on the drive, type: # echo 1 >/sys/bus/pci/slots/900/power
To clear the fault code in Oracle ILOM, type:
-> set /SYS/DBP/HDD0 clear_fault_action=true Are you sure you want to clear /SYS/DBP/HDD0 (y/n)? y Set ‘clear_fault_action’ to ‘true’ ->
Enable the drive.
Rescan the PCI bus to rediscover the NVMe drive.
# echo 1 > /sys/bus/pci/rescan.
If the same failure occurs again, use the same recovery process noted above. The drive has failed if the failure occurs again within minutes. If problem persists, then replace the faulty drive identified in the fmadm faulty output.
Refer to the following document for the latest procedures for displaying event content in preparation for submitting a service request and applying any post-repair actions that may be required. PSH Procedural Article for ILOM-Based Diagnosis (Doc ID 1155200.1)