Replacing PCIe Hardware on a System With an IOR Configuration

This section outlines a streamlined procedure for replacing a PCIe device in a running system without rebooting any of the domains. This method will accomplish the same end result as performing a manual replacement but is less prone to errors and allows the existing SR-IOV and IOR configuration to be automatically recreated after the insertion of the new PCIe device. This method is especially useful for systems with large or complex configurations such as those often employed for IOR.

The commands used for the configuration save and device poweroff and subsequent restoral are ldm evactuate-io and ldm restore-io respectively.

System Requirements

Ensure that the I/O domain, root domain, service domain, and primary domain run at least the Oracle Solaris 11.4 SRU 13 OS. This also ensures that Oracle VM Server for SPARC version 3.6.1 or later is running, which is also a requirement.

The system hardware must be capable of using SR-IOV enabled PCIe cards in an IOR configuration. See Resilient I/O Domain Requirements.

Limitations

  • This feature is only relevant to Dynamic I/O using SR-IOV, see the section Dynamic PCIe Bus Assignment Requirements.

  • The system must fully support SR-IOV and IOR in both hardware and software versions as outlined above.

  • Only an equivalent PCIe card may be used as the replacement device. This means it must be the same manufacturer and model supporting the same number of SR-IOV PFs and VFs.

  • For Fujitsu M10 servers or Fujitsu SPARC M12 servers, Oracle Solaris 11.4 SRU 24 OS or later is required for the I/O domain, root domain, service domain, and primary domain.

Example 8-29 Example Faulty PCIe Card Replacement Procedure

In this example the PCIe device with path /SYS/IOU1/PCIE13 will be replaced in a Non Primary Root Domain (NPRD) which is the owner of that PCIe slot. In effect the target slot and all it's children (PFs and VFs) are removed and restored during this procedure.

The target for the commands is the SR-IOV device itself, as represented in NAC name format. Thus, you can take the output of an ldm ls-io command and directly copy and paste it into an ldm evacuate-io or restore-io command. NAC name format is the standard for ldm commands and ILOM utilities.

The target device must be considered by the hotplug daemon to be a "connector". A connector is a device that is listed in the output of a hotplug list -c command run in the root domain which owns the target PCIe device (be it the primary or an NPRD). See hotplug(8), and for information about Oracle Solaris OS hotplug capabilities, see Chapter 2, Dynamically Configuring Devices in Managing Devices in Oracle Solaris 11.4.

Steps

  1. Identify the card to remove by reviewing fault logs on the primary.

    In this example the target is /SYS/IOU1/PCIE13.

    primary# fmadm faulty
  2. Review and save a copy of the current I/O configuration on the machine (not strictly required, done to allow manual verification of the restored configuration).

    primary# ldm ls-io > io_config.txt
  3. Perform evacuation command to automatically save current configuration, remove and destroy the VF children, and power down the device.

    nprd# ldm evacuate-io /SYS/IOU1/PCIE13
  4. Wait for the ldm command to complete and the power LED on the target SR-IOV card/slot to be unlit.

  5. Physically remove the device and replace it in the same slot with an equivalent card.

  6. Restore the previous configuration by running the following command.

    nprd# ldm restore-io /SYS/IOU1/PCIE13
  7. Wait for the ldm command to complete and the power LED on the target SR-IOV card/slot to be lit.

  8. Check that the configuration matches the previously saved configuration (not strictly required).

    primary# ldm ls-io

    Note that if any portion of either of the above ldm commands fails, the remaining steps are not attempted. In the case of a command failure, no attempt is made to undo the effect of the completed actions. The error message printed to the console should indicate the cause of the failure. If the command is run again, an attempt will be made to complete all unfinished work.

    For background details see Making PCIe Hardware Changes. For general guidelines on hardware changes and manual instructions for PCIe device replacement in systems configured with Oracle VM Server, see How to Replace PCIe Direct I/O Cards Assigned to an Oracle VM Server for SPARC Guest Domain (Doc ID 1684273.1) (https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=226878266536565&id=1684273.1&_adf.ctrl-state=bo9fbmr1n_49).