11 Using the Service Guardian

This chapter provides instructions for using and configuring the service guardian to detect and resolve deadlocked service threads and includes instructions for implementing custom failure policies.

This chapter includes the following sections:

Overview
Configuring the Service Guardian
Issuing Manual Guardian Heartbeats

11.1 Overview

The service guardian is a mechanism that detects and attempts to resolve deadlocks in Coherence threads. Deadlocked threads on a member may result in many undesirable behaviors that are visible to the rest of the cluster, such as the inability to add new nodes to the cluster and the inability to service requests by nodes currently in the cluster.

The service guardian receives periodic heartbeats that are issued by Coherence-owned and created threads. Should a thread fail to issue a heartbeat before the configured timeout, the service guardian takes corrective action. Both the timeout and corrective action (recovery) can be configured as required.

Note:

The term deadlock does not necessarily indicate a true deadlock; a thread that does not issue a timely heartbeat may be executing a long running process or waiting on a slow resource. The service guardian does not have the ability to distinguish a deadlocked thread from a slow one.

Interfaces That Are Executed By Coherence

Implementations of the following interfaces are executed by Coherence-owned threads. Any processing in an implementation that exceeds the configured guardian timeout results in the service guardian attempting to recover the thread. The list is not exhaustive and only provides the most common interfaces that are implemented by end users.

com.tangosol.net.Invocable
com.tangosol.net.cache.CacheStore
com.tangosol.util.Filter
com.tangosol.util.InvocableMap.EntryAggregator
com.tangosol.util.InvocableMap.EntryProcessor
com.tangosol.util.MapListener
com.tangosol.util.MapTrigger

Understanding Recovery

The service guardian's recovery mechanism uses a series of steps to determine if a thread is deadlocked. Corrective action is taken if the service guardian concludes that the thread is deadlocked. The action to take can be configured and custom actions can be created if required. The recovery mechanism is outlined below:

Soft Timeout – The recovery mechanism first attempts to interrupt the thread just before the configured timeout is reached. The following example log message demonstrates a soft timeout message:

<Error> (thread=DistributedCache, member=1): Attempting recovery (due to soft
timeout) of Daemon{Thread="Thread[WriteBehindThread:CacheStoreWrapper(com.
tangosol.examples.rwbm.TimeoutTest),5,WriteBehindThread:CacheStoreWrapper(com.
tangosol.examples.rwbm.TimeoutTest)]", State=Running}

If the thread can be interrupted and it results in a heartbeat, normal processing resumes.

Hard Timeout – The recovery mechanism attempts to stop a thread after the configured timeout is reached. The following example log message demonstrates a hard timeout message:

<Error> (thread=DistributedCache, member=1): Terminating guarded execution (due 
to hard timeout) of Daemon{Thread="Thread[WriteBehindThread:CacheStoreWrapper
(com.tangosol.examples.rwbm.TimeoutTest),5,WriteBehindThread:CacheStoreWrapper
(com.tangosol.examples.rwbm.TimeoutTest)]", State=Running}

Lastly, if the thread cannot be stopped, the recovery mechanism performs an action based on the configured failure policy. Actions that can be performed include: shutting down the cluster service, shutting down the JVM, and performing a custom action. The following example log message demonstrates an action taken by the recovery mechanism:
```
<Error> (thread=Termination Thread, member=1): Write-behind thread timed out; 
stopping the cache service
```

11.2 Configuring the Service Guardian

The service guardian is enabled out-of-the box and has two configured items: the timeout value and the failure policy. The timeout value is the length of time the service guardian waits to receive a heartbeat from a thread before starting recovery. The failure policy is the corrective action that the service guardian takes after it concludes that the thread is deadlocked.

11.2.1 Setting the Guardian Timeout

The service guardian timeout can be set in three different ways based on the level of granularity that is required:

All threads – This option allows a single timeout value to be applied to all Coherence-owned threads on a cluster node. This is the out-of-box configuration and is set at 305000 milliseconds by default.
Threads per service type – This option allows different timeout values to be set for specific service types. The timeout value is applied to the threads of all service instances. If a timeout is not specified for a particular service type, then the timeout defaults to the timeout that is set for all threads.
Threads per service instance – This option allows different timeout values to be set for specific service instances. If a timeout is not set for a specific service instance, then the service's timeout value, if specified, is used; otherwise, the timeout that is set for all threads is used.

Setting the timeout value to 0 stops threads from being guarded. In general, the service guardian timeout value should be set equal to or greater than the timeout value for packet delivery.

Note:

The guardian timeout can also be used for cache store implementations that are configured with a read-write-backing-map scheme. In this case, the <cachestore-timeout> element is set to 0, which defaults the timeout to the guardian timeout. See "read-write-backing-map-scheme".

11.2.1.1 Setting the Guardian Timeout for All Threads

To set the guardian timeout for all threads in a cluster node, add a <timeout-milliseconds> element to an operational override file within the <service-guardian> element. The following example sets the timeout value to 120000 milliseconds:

<?xml version='1.0'?>

<coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config"
   xsi:schemaLocation="http://xmlns.oracle.com/coherence/
   coherence-operational-config coherence-operational-config.xsd">
   <cluster-config>
      <service-guardian>
         <timeout-milliseconds>120000</timeout-milliseconds>
      </service-guardian>
   </cluster-config>
</coherence>

The <timeout-milliseconds> value can also be set using the tangosol.coherence.guard.timeout system property.

11.2.1.2 Setting the Guardian Timeout Per Service Type

To set the guardian timeout per service type, override the service's guardian-timeout initialization parameter in an operational override file. The following example sets the guardian timeout for the DistributedCache service to 120000 milliseconds:

<?xml version='1.0'?>

<coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config"
   xsi:schemaLocation="http://xmlns.oracle.com/coherence/
   coherence-operational-config coherence-operational-config.xsd">
   <cluster-config>
      <services>
         <service id="3">
            <init-params>
               <init-param id="17">
                  <param-name>guardian-timeout</param-name>
                  <param-value>120000</param-value>
               </init-param>
            </init-params>
         </service>
      </services>
   </cluster-config>
</coherence>

The guardian-timeout initialization parameter can be set for the DistributedCache, ReplicatedCache, OptimisticCache, Invocation, and Proxy services. Refer to the tangosol-coherence.xml file that is located in the coherence.jar file for the correct service ID and initialization parameter ID to use when overriding the guardian-timeout parameter for a service.

Each service also has a system property that sets the guardian timeout, respectively:

tangosol.coherence.distributed.guard.timeout
tangosol.coherence.replicated.guard.timeout
tangosol.coherence.optimistic.guard.timeout
tangosol.coherence.invocation.guard.timeout
tangosol.coherence.proxy.guard.timeout

11.2.1.3 Setting the Guardian Timeout Per Service Instance

To set the guardian timeout per service instance, add a <guardian-timeout> element to a cache scheme definition in the cache configuration file. The following example sets the guardian timeout for a distributed cache scheme to 120000 milliseconds.

<distributed-scheme>
   <scheme-name>example-distributed</scheme-name>
   <service-name>DistributedCache</service-name>
   <guardian-timeout>120000</guardian-timeout>
   <backing-map-scheme>
      <local-scheme>
         <scheme-ref>example-binary-backing-map</scheme-ref>
      </local-scheme>
   </backing-map-scheme>
   <autostart>true</autostart>
</distributed-scheme>

The <guardian-timeout> element can be used in the following schemes: <distributed-scheme>, <replicated-scheme>, <optimistic-scheme>, <transaction-scheme>, <invocation-scheme>, and <proxy-scheme>.

11.2.2 Using the Timeout Value From the PriorityTask API

Custom implementations of the Invocable, EntryProcessor, and EntryAggregator interface can implement the com.tangosol.net.PriorityTask interface. In this case, the service guardian attempts recovery after the task has been executing for longer than the value returned by getExecutionTimeoutMillis(). See Chapter 31, "Managing Thread Execution," for more information on using the API.

The execution timeout can be set using the <task-timeout> element within an <invocation-scheme> element defined in the cache configuration file. For the Invocation service, the <task-timeout> element specifies the timeout value for Invocable tasks that implement the PriorityTask interface, but do not explicitly specify the execution timeout value; that is, the getExecutionTimeoutMillis() method returns 0.

If the <task-timeout> element is set to 0, the default guardian timeout is used. See Appendix B, "Cache Configuration Elements" for more information on the different cache schemes that support the use of the <task-timeout> element.

11.2.3 Setting the Guardian Service Failure Policy

The service failure policy determines the corrective action that the service guardian takes after it concludes that a thread is deadlocked. The following policies are available:

exit-cluster – This policy attempts to recover threads that appear to be unresponsive. If the attempt fails, an attempt is made to stop the associated service. If the associated service cannot be stopped, this policy causes the local node to stop the cluster services. This is the default policy if no policy is specified.
exit-process – This policy attempts to recover threads that appear to be unresponsive. If the attempt fails, an attempt is made to stop the associated service. If the associated service cannot be stopped, this policy cause the local node to exit the JVM and terminate abruptly.
logging – This policy logs any detected problems but takes no corrective action.
custom – the name of a Java class that provides an implementation for the com.tangosol.net.ServiceFailurePolicy interface. See "Enabling a Custom Guardian Failure Policy".

The service guardian failure policy can be set three different ways based on the level of granularity that is required:

All threads – This option allows a single failure policy to be applied to all Coherence-owned threads on a cluster node. This is the out-of-box configuration.
Threads per service type – This option allows different failure policies to be set for specific service types. The policy is applied to the threads of all service instances. If a policy is not specified for a particular service type, then the timeout defaults to the timeout that is set for all threads.
Threads per service instance – This option allows different failure policies to be set for specific service instances. If a policy is not set for a specific service instance, then the service's policy, if specified, is used; otherwise, the policy that is set for all threads is used.

11.2.3.1 Setting the Guardian Failure Policy for All Threads

To set a guardian failure policy, add a <service-failure-policy> element to an operational override file within the <service-guardian> element. The following example sets the failure policy to exit-process:

<?xml version='1.0'?>

<coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config"
   xsi:schemaLocation="http://xmlns.oracle.com/coherence/
   coherence-operational-config coherence-operational-config.xsd">
   <cluster-config>
      <service-guardian>
         <service-failure-policy>exit-process</service-failure-policy>
      </service-guardian>
   </cluster-config>
</coherence>

11.2.3.2 Setting the Guardian Failure Policy Per Service Type

To set the failure policy per service type, override the service's service-failure-policy initialization parameter in an operational override file. The following example sets the failure policy for the DistributedCache service to the logging policy:

<?xml version='1.0'?>

<coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config"
   xsi:schemaLocation="http://xmlns.oracle.com/coherence/
   coherence-operational-config coherence-operational-config.xsd">
   <cluster-config>
      <services>
         <service id="3">
            <init-params>
               <init-param id="18">
                  <param-name>service-failure-policy</param-name>
                  <param-value>logging</param-value>
              </init-param>
            </init-params>
         </service>
      </services>
   </cluster-config>
</coherence>

The service-failure-policy initialization parameter can be set for the DistributedCache, ReplicatedCache, OptimisticCache, Invocation, and Proxy services. Refer to the tangosol-coherence.xml file that is located in the coherence.jar file for the correct service ID and initialization parameter ID to use when overriding the service-failure-policy parameter for a service.

11.2.3.3 Setting the Guardian Failure Policy Per Service Instance

To set the failure policy per service instance, add a <service-failure-policy> element to a cache scheme definition in the cache configuration file. The following example sets the failure policy to logging for a distributed cache scheme:

<distributed-scheme>
   <scheme-name>example-distributed</scheme-name>
   <service-name>DistributedCache</service-name>
   <guardian-timeout>120000</guardian-timeout>
   <service-failure-policy>logging</service-failure-policy>
   <backing-map-scheme>
      <local-scheme>
         <scheme-ref>example-binary-backing-map</scheme-ref>
      </local-scheme>
   </backing-map-scheme>
   <autostart>true</autostart>
</distributed-scheme>

The <service-failure-policy> element can be used in the following schemes: <distributed-scheme>, <replicated-scheme>, <optimistic-scheme>, <transaction-scheme>, <invocation-scheme>, and <proxy-scheme>.

11.2.3.4 Enabling a Custom Guardian Failure Policy

To use a custom failure policy, include an <instance> subelement and provide a fully qualified class name that implements the ServiceFailurePolicy interface. See "instance" for detailed instructions on using the <instance> element. The following example enables a custom failure policy that is implemented in the MyFailurePolicy class. Custom failure policies can be enabled for all threads (as shown below) or can be enabled per service instance within a cache scheme definition.

<?xml version='1.0'?>

<coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config"
   xsi:schemaLocation="http://xmlns.oracle.com/coherence/
   coherence-operational-config coherence-operational-config.xsd">
   <cluster-config>
      <service-guardian>
         <service-failure-policy>
            <instance>
               <class-name>package.MyFailurePolicy</class-name>
            </instance>
         </service-failure-policy>
      </service-guardian>
   </cluster-config>
</coherence>

As an alternative, the <instance> element supports the use of a <class-factory-name> element to use a factory class that is responsible for creating ServiceFailurePolicy instances, and a <method-name> element to specify the static factory method on the factory class that performs object instantiation. The following example gets a custom failure policy instance using the getPolicy method on the MyPolicyFactory class.

<?xml version='1.0'?>

<coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config"
   xsi:schemaLocation="http://xmlns.oracle.com/coherence/
   coherence-operational-config coherence-operational-config.xsd">
   <cluster-config>
      <service-guardian>
         <service-failure-policy>
            <instance>
               <class-factory-name>package.MyPolicyFactory</class-factory-name>
               <method-name>getPolicy</method-name>
            </instance>
         </service-failure-policy>
      </service-guardian>
   </cluster-config>
</coherence>

Any initialization parameters that are required for an implementation can be specified using the <init-params> element. The following example sets the iMaxTime parameter to 2000.

<?xml version='1.0'?>

<coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config"
   xsi:schemaLocation="http://xmlns.oracle.com/coherence/
   coherence-operational-config coherence-operational-config.xsd">
   <cluster-config>
      <service-guardian>
         <service-failure-policy>
            <instance>
               <class-name>package.MyFailurePolicy</class-name>
               <init-params>
                  <init-param>
                     <param-name>iMaxTime</param-name>
                     <param-value>2000</param-value>
                  </init-param>
               </init-params>
            </instance>
         </service-failure-policy>
      </service-guardian>
   </cluster-config>
</coherence>

11.3 Issuing Manual Guardian Heartbeats

The com.tangosol.net.GuardSupport class provides heartbeat methods that applications can use to manually issue heartbeats to the guardian:

GuardSupport.heartbeat();

For known long running operations, the heartbeat can be issued with the number of milliseconds that should pass before the operation is considered "stuck:"

GuardSupport.heartbeat(long cMillis);