Sun N1 Grid Engine 6.1 Administration Guide

Configuring Shadow Master Hosts

Shadow master hosts are machines in the cluster that can detect a failure of the master daemon and take over its role as master host. When the shadow master daemon detects that the master daemon sge_qmaster has failed abnormally, it starts up a new sge_qmaster on the host where the shadow master daemon is running.


Note –

If the master daemon is shut down gracefully, the shadow master daemon does not start up. If you want the shadow master daemon to take over after you shut down the master daemon gracefully, remove the lock file that is located in the sge_qmaster spool directory. The default location of this spool directory is sge-root/cell/spool/qmaster.


The automatic failover start of a sge_qmaster on a shadow master host takes approximately one minute. Meanwhile, you get an error message whenever a grid engine system command is run.


Note –

The file sge-root/cell/common/act_qmaster contains the name of the host actually running the sge_qmaster daemon.


Shadow Master Host Requirements

To prepare a host as a shadow master, the following requirements must be met:

As soon as these requirements are met, the shadow-master-host facility is activated for this host. You do not have to restart the grid engine system daemons to activate the feature.

Shadow Master Hosts File

The shadow master host file, sge-root/cell/common/shadow_masters, contains the following:

The format of the shadow master hostname file is as follows:

The order of the shadow master hosts is significant. The primary master host is the first line in the file. If the primary master host fails to proceed, the shadow master defined in the second line takes over. If this shadow master also fails, the shadow master defined in the third line takes over, and so forth.

Starting Shadow Master Hosts

To start a shadow sge_qmaster, the system must be sure either that the old sge_qmaster has terminated, or that it will terminate without performing actions that interfere with the newly-started shadow sge_qmaster.

In very rare circumstances, you might not be able to determine that the old sge_qmaster has terminated or that it will terminate. In such cases, an error message is logged to the messages log file of the sge_shadowds on the shadow master hosts. See Chapter 9, Fine Tuning, Error Messages, and Troubleshooting. Also, any attempts to open a tcp connection to a sge_qmaster daemon permanently fail. If this occurs, make sure that no master daemon is running, and then restart sge_qmaster manually on any of the shadow master machines. See Restarting Daemons From the Command Line.

Configuring Shadow Master Hosts Environment Variables

There are three environment variables which affect the takeover time for a shadow master:

These variables interact in the following way.

  1. The master host updates the heartbeat file every 30 seconds.

  2. The sge_shadowd daemon checks for changes to heartbeat file every number of seconds defined by the SGE_CHECK_INTERVAL variable. So, this value must be greater than 30 seconds.

  3. If the sge_shadowd daemon notices that the heartbeat file has been updated, it starts waiting again until it is once more time to check the heartbeat file.

  4. If the sge_shadowd daemon notices that the heartbeat file has not been updated, it waits for number of seconds defined by the SGE_CHECK_INTERVAL variable to expire. This step lets you make sure that the sge_shadowd daemon is not too agressive in trying to takeover and allows the master host some leeway in updating the heartbeat file.

  5. When the SGE_GET_ACTIVE_INTERVAL has expired, sge_shadowd daemon takes over if heartbeat file is still not updated.

A reasonable configuration might be to set the SGE_CHECK_INTERVAL to 45 seconds and the SGE_GET_ACTIVE_INTERVAL to 90 seconds. So, after about 2 minutes, the take over will occur. If you want to check the operation of the shadow host after you have configured these environment variables you will have to pull out the master host's network cable to simulate a failure.