Enable High-Availability for Self-Hosted Engine Host

The host that houses the self-hosted engine is not highly available by default. Since the self-hosted engine runs inside a virtual machine on a host, if you do not configure high-availability for the host, then virtual machine recovery after a host crash is not possible.

If you want the self-hosted engine host to be responsive and available when unexpected failures happen, you should use fencing. Fencing allows the host to react to unexpected failures and enforce power saving, load balancing, and virtual machine availability policies. You should configure the fencing parameters for your host’s power management device and test their correctness from time to time.

A Non Operational host is different from a Non Responsive host. A Non Operational host can communicate with the Manager, but has incorrect configuration, for example a missing logical network. A Non Responsive host cannot communicate with the Manager.

In a fencing operation, a non-responsive host is rebooted, and if the host does not return to an active status within a prescribed time, it remains non-responsive pending manual intervention and troubleshooting.

The Manager can perform management operations after it reboots, by a proxy host, or manually in the Administration Portal. All the virtual machines running on the non-responsive host are stopped, and highly available virtual machines are restarted on a different host. At least two hosts are required for power management operations.

Important:

If a host runs virtual machines that are highly available, power management must be enabled and configured.

Configure Power Management and Fencing for Host

The Manager uses a proxy to send power management commands to a host power management device because the engine does not communicate directly with fence agents. The host agent (VDSM) executes power management device actions and another host in the environment is used as a fencing proxy. This means that you must have at least two hosts for power management operations.

When you configure a fencing proxy host, make sure the host is in:

the same cluster as the host requiring fencing.
the same data center as the host requiring fencing.
UP or Maintenance status to remain viable.

Power management operations can be performed in three ways:

by the Manager after it reboots
by a proxy host
manually in the Administration Portal

To configure power management and fencing on a host:

Click Compute and select Hosts.
Select a host and click Edit.
Click the Power Management tab.
Check Enable Power Management to enable the rest of the fields.
Check Kdump integration to prevent the host from fencing while performing a kernel crash dump. Kdump integration is enabled by default.

Important:

If you enable or disable Kdump integration on an existing host, you must reinstall the host.
(Optional) Check Disable policy control of power management if you do not want your host’s power management to be controlled by the scheduling policy of the host's cluster.
To configure a fence agent, click the plus sign (+) next to Add Fence Agent.

The Edit fence agent pane opens.
Enter the Address (IP Address or FQDN) to access the host's power management device.
Enter the User Name and Password of the of the account used to access the power management device.
Select the power management device Type from the drop-down list.
Enter the Port (SSH) number used by the power management device to communicate with the host.
Enter the Slot number used to identify the blade of the power management device.
Enter the Options for the power management device. Use a comma-separated list of key-value pairs.
- If you leave the Options field blank, you are able to use both IPv4 and IPv6 addresses
- To use only IPv4 addresses, enter inet4_only=1
- To use only IPv6 addresses, enter inet6_only=1
Check Secure to enable the power management device to connect securely to the host.

You can use ssh, ssl, or any other authentication protocol your power management device supports.
Click Test to ensure the settings are correct and then click OK.

Test Succeeded, Host Status is: on displays if successful.

Attention:

Power management parameters (userid, password, options, etc.) are tested by the Manager only during setup and manually after that. If you choose to ignore alerts about incorrect parameters, or if the parameters are changed on the power management hardware without changing in the Manager as well, fencing is likely to fail when most needed.
Fence agents are sequential by default. To change the sequence in which the fence agents are used:
1. Review your fence agent order in the Agents by Sequential Order field.
2. To make two fence agents concurrent, next to one fence agent click the Concurrent with drop-down list and select the other fence agent.
  
  You can add additional fence agents to this concurrent fence agent group.
Expand the Advanced Parameters and use the up and down buttons to specify the order in which the Manager searches the host’s cluster and dc (data center) for a power management proxy.
To add an additional power management proxy:
1. Click the plus sign (+) next to Add Power Management Proxy.
  
  The Select fence proxy preference type to add pane opens.
2. Select a power management proxy from the drop-down list and then click OK.
  
  Your new proxy displays in the Power Management Proxy Preference list.
Note:

By default, the Manager searches for a fencing proxy within the same cluster as the host. If The Manager cannot find a fencing proxy within the cluster, it searches the data center.
Click OK.

From the list of hosts, the exclamation mark next to the host’s name disappeared, signifying that you have successfully configured power management and fencing.

Prevent Host Fencing During Boot

After you configure power management and fencing, when you start the Manager it automatically attempts to fence non-responsive hosts that have power management enabled after the quiet time (5 minutes by default) has elapsed. You can opt to extend the quiet time to prevent, for example, a scenario where the Manager attempts to fence hosts while they boot up. This can happen after a data center outage because a host’s boot process is normally longer than the Manager boot process.

You can configure quiet time using the engine-config command option DisableFenceAtStartupInSec:

# engine-config -s DisableFenceAtStartupInSec=number

Check Fencing Parameters

To automatically check the fencing parameters, you can configure the PMHealthCheckEnabled (false by default) and PMHealthCheckIntervalInSec (3600 sec by default) engine-config options.

# engine-config -s PMHealthCheckEnabled=True

# engine-config -s PMHealthCheckIntervalInSec=number

When set to true, PMHealthCheckEnabled checks all host agents at the interval specified by PMHealthCheckIntervalInSec and raises warnings if it detects issues.