FMA contacts the Logical Domains Manager when it detects a faulty resource. Then, the Logical Domains Manager attempts to stop using that resource in all running domains. To ensure that a faulty resource cannot be assigned to a domain in the future, FMA adds the resource to a blacklist.
The Logical Domains Manager supports blacklisting only for CPU and memory resources, not for I/O resources.
If a faulty resource is not in use, the Logical Domains Manager removes it from the available resource list, which you can see in the ldm list-devices output. At this time, this resource is internally marked as “blacklisted” so that it cannot be re-assigned to a domain in the future.
If the faulty resource is in use, the Logical Domains Manager attempts to evacuate the resource. To avoid a service interruption on the running domains, the Logical Domains Manager first attempts to use CPU or memory dynamic reconfiguration to evacuate the faulty resource. The Logical Domains Manager remaps a faulted core if a core is free to use as a target. If this “live evacuation” succeeds, the faulty resource is internally marked as blacklisted and is not shown in the ldm list-devices output so that it will not be assigned to a domain in the future.
If the live evacuation fails, the Logical Domains Manager internally marks the faulty resource as “evacuation pending.” The resource is shown as normal in the ldm list-devices output because the resource is still in use on the running domains until the affected guest domains are rebooted or stopped.
You can use the ldm list-devices -B command to view blacklisted resources or resources pending evacuation. The following command shows the blacklisted memory and core resources:
primary# ldm list-devices -B CORE ID STATUS DOMAIN 1 Blacklisted 2 Evac_pending ldg1 MEMORY PA SIZE STATUS DOMAIN 0xa30000000 87G Blacklisted 0x80000000000 128G Evac_pending ldg1
When the affected guest domain is stopped or rebooted, the Logical Domains Manager attempts to evacuate the faulty resources and internally mark them as blacklisted so that the resource cannot be assigned in the future. Such a device is not shown in the ldm output. After the pending evacuation completes, the Logical Domains Manager attempts to start the guest domain. However, if the guest domain cannot be started because sufficient resources are not available, the guest domain is marked as “degraded” and the following warning message is logged for the user intervention to perform the manual recovery.
primary# ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME primary active -n-cv- UART 368 2079488M 0.1% 0.0% 16h 57m gd0 bound -d---- 5000 8 Notice: the system is running in a degraded mode as domain <guest> could not be started because required resources were blacklisted and evacuated.
When the system is power-cycled, FMA repeats the evacuation requests for resources that are still faulty and the Logical Domains Manager handles those requests by evacuating the faulty resources and internally marking them as blacklisted.
Prior to support for FMA blacklisting, a guest domain that panicked because of a faulty resource might result in a never-ending panic-reboot loop. By using resource evacuation and blacklisting when the guest domain is rebooted, you can avoid this panic-reboot loop and prevent future attempts to use a faulty resource.