Go to main content

Oracle® VM Server for SPARC 3.5 Administration Guide

Exit Print View

Updated: November 2017
 
 

Using FMA to Blacklist or Unconfigure Faulty Resources

FMA contacts the Logical Domains Manager when it detects a faulty resource. Then, the Logical Domains Manager attempts to stop using that resource in all running domains. To ensure that a faulty resource cannot be assigned to a domain in the future, FMA adds the resource to a blacklist.

The Logical Domains Manager supports blacklisting only for CPU and memory resources, not for I/O resources.

If a faulty resource is not in use, the Logical Domains Manager removes it from the available resource list, which you can see in the ldm list-devices output. At this time, this resource is internally marked as “blacklisted” so that it cannot be re-assigned to a domain in the future.

If the faulty resource is in use, the Logical Domains Manager attempts to evacuate the resource. To avoid a service interruption on the running domains, the Logical Domains Manager first attempts to use CPU or memory dynamic reconfiguration to evacuate the faulty resource. The Logical Domains Manager remaps a faulted core if a core is free to use as a target. If this “live evacuation” succeeds, the faulty resource is internally marked as blacklisted and is not shown in the ldm list-devices output so that it will not be assigned to a domain in the future.

If the live evacuation fails, the Logical Domains Manager internally marks the faulty resource as “evacuation pending.” The resource is shown as normal in the ldm list-devices output because the resource is still in use on the running domains until the affected guest domains are rebooted or stopped.

You can use the ldm list-devices -B command to view blacklisted resources or resources pending evacuation. The following command shows the blacklisted memory and core resources:

primary# ldm list-devices -B
CORE
ID STATUS DOMAIN
1 Blacklisted
2 Evac_pending ldg1
MEMORY
PA SIZE STATUS DOMAIN
0xa30000000 87G Blacklisted
0x80000000000 128G Evac_pending ldg1

When the affected guest domain is stopped or rebooted, the Logical Domains Manager attempts to evacuate the faulty resources and internally mark them as blacklisted so that the resource cannot be assigned in the future. Such a device is not shown in the ldm output. After the pending evacuation completes, the Logical Domains Manager attempts to start the guest domain. However, if the guest domain cannot be started because sufficient resources are not available, the guest domain is marked as “degraded” and the following warning message is logged for the user intervention to perform the manual recovery.

primary# ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
primary          active     -n-cv-  UART    368   2079488M 0.1%  0.0%  16h 57m
gd0              bound      -d----  5000    8

Notice: the system is running in a degraded mode as domain <guest> could not
be started because required resources were blacklisted and evacuated.

When the system is power-cycled, FMA repeats the evacuation requests for resources that are still faulty and the Logical Domains Manager handles those requests by evacuating the faulty resources and internally marking them as blacklisted.

Prior to support for FMA blacklisting, a guest domain that panicked because of a faulty resource might result in a never-ending panic-reboot loop. By using resource evacuation and blacklisting when the guest domain is rebooted, you can avoid this panic-reboot loop and prevent future attempts to use a faulty resource.