Tuning the HA for Logical Domains Fault Monitor

This section describes the HA for Logical Domains fault monitor's probing algorithm or functionality, and states the conditions, messages, and recovery actions associated with unsuccessful probing.

For conceptual information about fault monitors, see the Oracle Solaris Cluster Concepts Guide.

Resource Properties

The HA for Logical Domains guest domain fault monitor uses the resource properties specified in the resource type SUNW.ldom. Refer to the SUNW.ldom(5) man page for a complete list of resource properties used.

Probing Algorithm and Functionality

HA for Logical Domains is controlled by the extension properties that control the probing frequency. The default values of these properties determine the preset behavior of the fault monitor and are suitable for most Oracle Solaris Cluster installations. You can modify this preset behavior by performing the following actions:

Setting the interval between fault monitor probes (Thorough_probe_interval)
Setting the timeout for fault monitor probes (Probe_timeout)
Setting the number of times the fault monitor attempts to restart the resource (Retry_count)

The HA for Logical Domains fault monitor checks the domain status within an infinite loop. During each cycle, the fault monitor checks the domain state and reports either a failure or success.

If the fault monitor is successful, it returns to its infinite loop and continues the next cycle of probing and sleeping.

If the fault monitor reports a failure, a request is made to the cluster to restart the resource. If the fault monitor reports another failure, another request is made to the cluster to restart the resource. This behavior continues whenever the fault monitor reports a failure. If successive restarts exceed the Retry_count within the Thorough_probe_interval, a request is made to fail over the resource group onto a different node.

Operations of the Logical Domains Probe

The probe checks the domain state every 60 seconds by using the ldm list-domain command.
The ldm list-domain command produces a status line for the domain and is accurate at the instant that the command executes.
The status modes that are considered to be normal operational modes are as follows: active, suspending, resuming, suspended, and starting. Whenever the ldm command reports these status modes, the probe considers that the domain is operating in an acceptable mode.
The status modes that are considered to be restartable modes are as follows: inactive and stopping. These modes are not considered acceptable and if one of these modes is encountered, the probe requests a restart of the resource.
The probe also requests a resource to restart if any unknown status modes are reported by the ldm command.
If the guest domain configuration has changed, the probe will update this information to CCR.
The probe runs the user-supplied script or binary provided for plugin_probe. If this process fails, then the probe will restart the Logical Domains guest domain resource.
If the Logical Domains guest domain resource is repeatedly restarted and subsequently exhausts the Retry_count within the Retry_interval, then a failover is initiated for the resource group onto another node if Failover_enabled is set to TRUE.

Skip Navigation Links
Exit Print View
	Oracle Solaris Cluster Data Service for Oracle VM Server for SPARC Guide