FMA contacts the Logical Domains Manager when it detects a faulty resource. Then, the Logical Domains Manager attempts to stop using that resource in all running domains. To ensure that a faulty resource cannot be assigned to a domain in the future, FMA adds the resource to a blacklist.
The Logical Domains Manager supports blacklisting only for CPU and memory resources, not for I/O resources.
If a faulty resource is not in use, the Logical Domains Manager removes it from the available resource list, which you can see in the ldm list-devices output. At this time, this resource is internally marked as “blacklisted” so that it cannot be re-assigned to a domain in the future.
If the faulty resource is in use, the Logical Domains Manager attempts to evacuate the resource. To avoid a service interruption on the running domains, the Logical Domains Manager first attempts to use CPU or memory dynamic reconfiguration to evacuate the faulty resource. The Logical Domains Manager remaps a faulted core if a core is free to use as a target. If this “live evacuation” succeeds, the faulty resource is internally marked as blacklisted and is not shown in the ldm list-devices output so that it will not be assigned to a domain in the future.
If the live evacuation fails, the Logical Domains Manager internally marks the faulty resource as “evacuation pending.” The resource is shown as normal in the ldm list-devices output because the resource is still in use on the running domains until the affected guest domains are rebooted or stopped.
When the affected guest domain is stopped or rebooted, the Logical Domains Manager attempts to evacuate the faulty resources and internally mark them as blacklisted so that the resource cannot be assigned in the future. Such a device is not shown in the ldm output. After the pending evacuation completes, the Logical Domains Manager attempts to start the guest domain. However, if the guest domain cannot be started because sufficient resources are not available, the guest domain is marked as “degraded” and the following warning message is logged for the user intervention to perform the manual recovery.
primary# ldm ls NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME primary active -n-cv- UART 368 2079488M 0.1% 0.0% 16h 57m gd0 bound -d---- 5000 8 warning: Could not restart domain gd0 after completing pending evacuation. The domain has been marked degraded and should be examined to see if manual recovery is possible.
When the system is power-cycled, FMA repeats the evacuation requests for resources that are still faulty and the Logical Domains Manager handles those requests by evacuating the faulty resources and internally marking them as blacklisted.
Prior to support for FMA blacklisting, a guest domain that panicked because of a faulty resource might result in a never-ending panic-reboot loop. By using resource evacuation and blacklisting when the guest domain is rebooted, you can avoid this panic-reboot loop and prevent future attempts to use a faulty resource.