Known Issues

This section describes important operating issues and known hardware and software issues for Oracle 3.84 TB NVMe SSDs.

Supplementary and workaround information for Oracle 3.84 TB NVMe SSDs. Specific Bug ID identification numbers are provided for service personnel.

Oracle ILOM Incorrectly Faults the Device with Message fault.io.scsi.cmd.disk.dev.rqs.baddrv

Bug ID: 28244670

Issue: Oracle ILOM might report a fault.io.scsi.cmd.disk.dev.rqs.baddrv error for NVMe devices. Oracle ILOM incorrectly faults the device with message Fault fault.io.scsi.cmd.disk.dev.rqs.baddrv on FRU /SYS .

Affected Hardware and Software: NVMe storage devices on all supported operating systems

Workaround: None

Recovery:

If a system encounters this issue, do the following steps.

  1. Look for the NVMe ILOM fault code: fault.io.scsi.cmd.disk.dev.rqs.baddrv

    The following screen shows a fault.io.scsi.cmd.disk.dev.rqs.baddrv error for Oracle 3.84 TB NVMe SSD.

    ereport.io.scsi.cmd.disk.dev.rqs.baddrv@/SYS/MB/PCIE5
                             status_flags  = 0xc3
                             smart_warning = 0xff
                             reason        = Drive is not functional

    You can also use the Oracle ILOM show faulty command at the Oracle ILOM command-line prompt (->) to identify a drive failure.

    To list all known faults in the server, log in to the Oracle ILOM service processor from the Oracle ILOM Fault Management Shell and issue the fmadm faulty command. For more information about how to use the Oracle ILOM Fault Management Shell and supported commands, refer to the Oracle ILOM User's Guide for System Monitoring and Diagnostics in the Oracle Integrated Lights Out Manager (ILOM) 5.0 Documentation Library at https://www.oracle.com/goto/ilom/docs .

  2. Upgrade drive firmware if not current.

    See Oracle 3.84 TB NVMe SSD Supported Hardware and Software.

  3. Do one of the following:

    If SMbus status_flags = 0xbb displays, then clear the fault. No power cycling is required. To clear the fault code in Oracle ILOM, go to step 4.

    If SMbus status_flags = 0xc3 displays, complete a server power cycle, then clear the fault. Do the following to recover, then go to step 4.

    1. To identify the drive slot, type:

      # lspci -vv -s 1b:00.0
        1b:00.0 Non-Volatile memory controller: [NVM Express])
            Subsystem: Oracle/SUN Device
            Physical Slot: 900
            Control: I/O- The PCIe address of /dev/nvme10n1 is 0000:e7:00.0
    2. Take the affected drive off-line.

      Disconnect all users of the NVMe drive and back up the NVMe drive data as needed. Use the umount command to unmount any file systems that are mounted on the device. Remove the device from any multiple device (md) and Logical Volume Manager (LVM) volume using it.

      If the device is a member of an LVM Volume group, then it might be necessary to move data off the device using the pvmove command, then use the vgreduce command to remove the physical volume, and (optionally) pvremove to remove the LVM metadata from the disk. If the device uses multipathing, run multipath -l and note all the paths to the device. Then, remove the multipathed device using the multipath -f device command. Run the blockdev --flushbufs device command to flush any outstanding I/O to all paths to the device.

    3. To prepare the NVMe drive for removal, that is, to detach the NVMe device driver and power off the NVMe drive slot, type: # echo 0 >/sys/bus/pci/slots/900/power

    4. To power on the drive, type: # echo 1 >/sys/bus/pci/slots/900/power

  4. To clear the fault code in Oracle ILOM, type:

    -> set /SYS/DBP/HDD0 clear_fault_action=true
       Are you sure you want to clear /SYS/MB/PCIE5 (y/n)? y
       Set ‘clear_fault_action’ to ‘true’
    ->
  5. Enable the drive.

    Rescan the PCI bus to rediscover the NVMe drive.

    # echo 1 > /sys/bus/pci/rescan.

If the same failure occurs again, use the same recovery process noted above. The drive has failed if the failure occurs again within minutes. If problem persists, then replace the faulty card identified in the fmadm faulty output.

Refer to the following document for the latest procedures for displaying event content in preparation for submitting a service request and applying any post-repair actions that may be required. PSH Procedural Article for ILOM-Based Diagnosis (Doc ID 1155200.1)

Oracle ILOM Reports a Fault for NVMe Devices When Performing a Reboot, Firmware Update, or Hot-Plug Operation

Bug ID: 28654297

Issue: Oracle ILOM might report a fault.chassis.device.fail error for NVMe devices when performing a reboot, a firmware update, or hot-plug operation.

Affected Hardware and Software: NVMe storage devices on all supported operating systems

Workaround : Disable the device_monitor feature in Oracle ILOM using the following command:

set /SP/services/device_monitor servicestate=disabled

Oracle ILOM Reports Faults for Correctable Errors

Bug ID: 28601316

Issue: The PCIe link retrains, a PCIe PHY reset event occurs on PCIe channels, and Oracle ILOM reports three different types of correctable errors. OS logs contain errors.

  • Bad DLLP

  • Bad TLP

  • RTTO

Workaround: None

The TCRH (Train Cold – Run Hot) Compensation Feature is an expected behavior on Oracle Server X9 series servers.

Secure Erase Drives Before Use

Oracle 3.84 TB NVMe SSD may report uncorrectable errors or assert after not being powered for three or more months. For best practice, secure erase Oracle 3.84 TB NVMe SSDs before use (especially if use is reading from the card as a test) and especially if the Oracle 3.84 TB NVMe SSD has been unpowered for more than three months. If the NAND media is not refreshed for approximately three months, the drive may experience media errors.

Over time, the drive firmware policy refreshes the media in the background while it remains powered-on. If the drive has been powered on long enough for the background refresh policy to be applied to all bits, the drive is not at risk for this issue. The time required to refresh all the bits is approximately 14 days and varies by product.

If the number of bits experiencing this issue exceeds the error-correction code (ECC) capability, it may result in an uncorrectable read error. If the uncorrectable read errors occur during normal drive operation, the drive will report an increased number of SMART media errors to the host.

Workaround:

Secure erase the drive to return the drive to service. Secure erase frees and reuses all blocks starting with an empty Flash Translation Layer table (FTL). Any LBAs that may have held data that may have degraded are now released as free blocks to be reused.

Select one of the following methods before use of the drive for operation or test. An off-line server can be used.

Choose one of the erase options:

  • Secure erase the drive, using the nvmeadmin utility.

  • Download and use third party utilities to secure erase the drive.

  • Wait two weeks for a media refresh while the drive is powered-on before using the drive.

Caution:

All data will be destroyed after an erase.

Secure Erase Drive Using nvmeadmin Utility

To secure erase the drive, using the Oracle Hardware Management Pack NVMe admin utility:

  1. Stop all IO to the NVMe device before attempting this action.

  2. To securely erase all namespaces, type: # nvmeadm erase -s -a controller_name. For example: # nvmeadm erase -s -a SUNW-NVME-1

  3. List all server devices.

  4. Verify drive health.

Refer to Oracle Hardware Management Pack 2.4 Server CLI Tools User's Guide: https://www.oracle.com/goto/ohmp/docs . See Server Management Tools.

Secure Erase Drive Using Third-party Utilities

To secure erase the drive before use, using the Intel Solid-State Drive Configuration Manager utility, if available:

  1. Install the Intel Solid-State Drive Configuration Manager.

  2. Stop all IO to the NVMe device before attempting this action.

  3. Use the –secure_erase option to erase all the data on the drive.

    issdcm –drive_index 1 –secure_erase
  4. The user is prompted unless the –force option is used:

    WARNING: You have selected to secure erase the drive!
    Proceed with the secure erase? (Y/N)
  5. If the drive contains a partition, the prompt contains a second warning message:

    WARNING: You have selected to secure erase the drive!
    WARNING: Tool has detected as partition on the drive!
    Proceed with the secure erase? (Y/N)
  6. To bypass the warning prompts, use the –force option:

    issdcm –drive_index 1 –secure_erase -force
  7. List all server devices.

  8. Verify drive health.