Tuning the HA for Oracle VM Server Fault Monitor - Oracle® Solaris Cluster Data Service for Oracle VM Server for SPARC Guide

Language:

SPARC: Tuning the HA for Oracle VM Server Fault Monitor

This section describes the HA for Oracle VM Server fault monitor's probing algorithm or functionality, and states the conditions, messages, and recovery actions associated with unsuccessful probing.

Note - For any maintenance or modification activities on the domain, you must disable monitoring, perform maintenance tasks, and then re-enable the resource monitor.

For conceptual information about fault monitors, see the Oracle Solaris Cluster 4.3 Concepts Guide.

Resource Properties

The HA for Oracle VM Server logical domain fault monitor uses the resource properties specified in the resource type SUNW.ldom. Refer to the SUNW.ldom(5) man page for a complete list of resource properties used.

Probing Algorithm and Functionality

HA for Oracle VM Server is controlled by the extension properties that control the probing frequency. The default values of these properties determine the preset behavior of the fault monitor and are suitable for most Oracle Solaris Cluster installations. You can modify this preset behavior by performing the following actions:

Setting the interval between fault monitor probes (Thorough_probe_interval)
Setting the timeout for fault monitor probes (Probe_timeout)
Setting the number of times the fault monitor attempts to restart the resource (Retry_count)

The HA for Oracle VM Server fault monitor checks the domain status within an infinite loop. During each cycle, the fault monitor checks the domain state and reports either a failure or success.

If the fault monitor is successful, it returns to its infinite loop and continues the next cycle of probing and sleeping.

If the fault monitor reports a failure, a request is made to the cluster to restart the resource. If the fault monitor reports another failure, another request is made to the cluster to restart the resource. This behavior continues whenever the fault monitor reports a failure. If successive restarts exceed the Retry_count within the Thorough_probe_interval, a request is made to fail over the resource group onto a different node.

Operations of the Oracle VM Server for SPARC Probe

The probe checks the domain state every 60 seconds by using the ldm list-domain command.
The ldm list-domain command produces a status line for the domain and is accurate at the instant that the command executes.
The status modes that are considered to be normal operational modes are as follows: active, suspending, resuming, suspended, and starting. Whenever the ldm command reports these status modes, the probe considers that the domain is operating in an acceptable mode.
The status modes that are considered to be restartable modes are as follows: inactive and stopping. These modes are not considered acceptable and if one of these modes is encountered, the probe requests a restart of the resource.
The probe also requests a resource to restart if any unknown status modes are reported by the ldm command.
If the logical domain configuration has changed, the probe updates this information to the CCR in the next probe cycle. Alternatively, you can perform the following steps to update the changed configuration to the CCR immediately:
1. Make a dummy update to the resource. For example:
```
# clresource set \
-p R_DESCRIPTION="Oracle Solaris Cluster HA for Oracle VM Server SPARC Guest Domains - Modified" \
ldg1-rs
```
2. Verify whether the configuration change was successfully done.
  
  From the node where the guest domain is online, type the following commands, where ld1-rs is the logical domain resource name and ldg1 is the guest domain name:
```
# (/usr/cluster/lib/sc/ccradm showkey --key xml_ldg1-rs ldom_domain_config | \
xmllint --format -) > /var/tmp/ldg1_ccr.xml
# ldm list-constraints -x ldg1 > /var/tmp/ldg1_current.xml
# diff /var/tmp/ldg1_current.xml /var/tmp/ldg1_ccr.xml
```
The probe runs the user-supplied script or binary provided for plugin_probe. If this process fails, then the probe will restart the logical domain resource. The exit status of the plugin_probe command is used to determine the severity of the failure of the application. This exit status, called the plugin_probe status, must be an integer between 0 (for success) and 100 (for complete failure). The plugin_probe status can also be a special value of 201 which results in immediate failover of the application unless Failover_enabled is set to FALSE.
If the logical domain resource is repeatedly restarted and subsequently exhausts the Retry_count within the Retry_interval, then a failover is initiated for the resource group onto another node if Failover_enabled is set to TRUE.