There are three environment variables which affect the takeover time for a shadow master:
SGE_DELAY_TIME - This variable controls the interval in which sge_shadowd pauses if a takeover bid fails. This value is used only when there are multiple sge_shadowd instances and they are contending to be the master. (the default is 600 seconds.)
SGE_CHECK_INTERVAL - This variable controls the interval in which the sge_shadowd checks the heartbeat file (60 seconds by default.)
SGE_GET_ACTIVE_INTERVAL - This variable controls the interval when a sge_shadowd instance tries to take over when the heartbeat file has not changed.
These variables interact in the following way.
The master host updates the heartbeat file every 30 seconds.
The sge_shadowd daemon checks for changes to heartbeat file every number of seconds defined by the SGE_CHECK_INTERVAL variable. So, this value must be greater than 30 seconds.
If the sge_shadowd daemon notices that the heartbeat file has been updated updated, it starts waiting again until it is once more time to check the heartbeat file.
If the sge_shadowd daemon notices that the heartbeat file has not been updated, it waits for number of seconds defined by the SGE_CHECK_INTERVAL variable to expire. This step lets you make sure that the sge_shadowd daemon is not too agressive in trying to takeover and allows the master host some leeway in updating the heartbeat file.
When the SGE_GET_ACTIVE_INTERVAL has expired, sge_shadowd daemon takes over if heartbeat file is still not updated.
A reasonable configuration might be to set the SGE_CHECK_INTERVAL to be 45 seconds and the SGE_GET_ACTIVE_INTERVAL to be 90 seconds. So, after about 2 minutes, the take over will occur. If you want to check the operation of the shadow host after you have configured these environment variables you will have to pull out the master host's network cable to simulate a failure.