17 Using Quorum

This chapter provides instructions for using and configuring quorum policies to control when specific service actions are allowed in a cluster in order to ensure that a cluster is adequately provisioned.

This chapter includes the following sections:

17.1 Overview of Using Quorum

A quorum, in Coherence, refers to the minimum number of service members that are required in a cluster before a service action is allowed or disallowed. Quorums are beneficial because they automatically provide assurances that a cluster behaves in an expected way when member thresholds are reached. For example, a partitioned cache backup quorum might require at least 5 storage-enabled members before the partitioned cache service is allowed to back up partitions.

Quorums are service-specific and defined within a quorum policy; there is a cluster quorum policy for the Cluster service, a partitioned quorum policy for the Partitioned Cache service, and a proxy quorum policy for the Proxy service. Quorum thresholds are set on the policy using a cache configuration file.

Each quorum provides benefits for its particular service. However, in general, quorums:

  • control service behavior at different service member levels

  • mandate the minimum service member levels that are required for service operations

  • ensure an optimal cluster and cache environment for a particular application or solution

17.2 Using the Cluster Quorum

The cluster quorum policy defines a single quorum (the timeout survivor quorum) for the Cluster Service. The timeout survivor quorum mandates the minimum number of cluster members that must remain in the cluster when the cluster service is terminating suspect members. A member is considered suspect if it has not responded to network communications and is in imminent danger of being disconnected from the cluster. The quorum can be specified generically across all members or constrained to members that have a specific role in the cluster, such as client or server members. See the <role-name> element in "member-identity" for more information on defining role names for cluster members.

This quorum is typically used in environments where network performance varies. For example, intermittent network outages may cause a high number of cluster members to be removed from the cluster. Using this quorum, a certain number of members are maintained during the outage and are available when the network recovers. This behavior also minimizes the manual intervention required to restart members. Naturally, requests that require cooperation by the nodes that are not responding are not able to complete and are either blocked for the duration of the outage or are timed out.

17.2.1 Configuring the Cluster Quorum Policy

The timeout survivor quorum threshold is configured in an operational override file using the <timeout-survivor-quorum> element and optionally the role attribute. This element must be used within a <cluster-quorum-policy> element. The following example demonstrates configuring the timeout survivor quorum threshold to ensure that5 cluster members with the server role are always kept in the cluster while removing suspect members:

<cluster-config>
   <member-identity>
      <role-name>server</role-name>
   </member-identity>
   <cluster-quorum-policy>
      <timeout-survivor-quorum role="Server">5</timeout-survivor-quorum>
   </cluster-quorum-policy>
</cluster-config>

17.3 Using the Partitioned Cache Quorums

The partitioned cache quorum policy defines four quorums for the partitioned cache service (DistributedCache) that mandate how many service members are required before different partitioned cache service operations can be performed:

  • Distribution Quorum – This quorum mandates the minimum number of storage-enabled members of a partitioned cache service that must be present before the partitioned cache service is allowed to perform partition distribution.

  • Restore Quorum – This quorum mandates the minimum number of storage-enabled members of a partitioned cache service that must be present before the partitioned cache service is allowed to restore lost primary partitions from backup.

  • Read Quorum – This quorum mandates the minimum number of storage-enabled members of a partitioned cache service that must be present to process read requests. A read request is any request that does not mutate the state or contents of a cache.

  • Write Quorum – This quorum mandates the minimum number of storage-enabled members of a partitioned cache service that must be present to process write requests. A write request is any request that may mutate the state or contents of a cache.

These quorums are typically used to indicate at what service member levels different service operations are best performed given the intended use and requirements of a distributed cache. For example, a small distributed cache may only require three storage-enabled members to adequately store data and handle projected request volumes. While; a large distributed cache may require 10, or more, storage-enabled members to adequately store data and handle projected request volumes. Optimal member levels are tested during development and then set accordingly to ensure that the minimum service member levels are provisioned in a production environment.

If the number of storage-enabled nodes running the service drops below the configured level of read or write quorum, the corresponding client operation are rejected by throwing the com.tangosol.net.RequestPolicyException. If the number of storage-enabled nodes drops below the configured level of distribution quorum, some data may become "endangered" (no backup) until the quorum is reached. Dropping below the restore quorum may cause some operation to be blocked until the quorum is reached or to be timed out.

17.3.1 Configuring the Partitioned Cache Quorum Policy

Partitioned cache quorums are configured in a cache configuration file within the <partitioned-quorum-policy-scheme> element. The element must be used within a <distributed-scheme> element. The following example demonstrates configuring thresholds for the partitioned cache quorums. Ideally, the threshold values would indicate the minimum amount of service members that are required to perform the operation.

<distributed-scheme>
   <scheme-name>partitioned-cache-with-quorum</scheme-name>
   <service-name>PartitionedCacheWithQuorum</service-name>
   <backing-map-scheme>
      <local-scheme/>
   </backing-map-scheme>
   <partitioned-quorum-policy-scheme>
      <distribution-quorum>4</distribution-quorum>
      <restore-quorum>3</restore-quorum>
      <read-quorum>3</read-quorum>
      <write-quorum>5</write-quorum>
   </partitioned-quorum-policy-scheme>
   <autostart>true</autostart>
</distributed-scheme>

The <partitioned-quorum-policy-scheme> element also supports the use of scheme references. In the below example, a <partitioned-quorum-policy-scheme>, with the name partitioned-cache-quorum, is referenced from within the <distributed-scheme> element:

<distributed-scheme>
   <scheme-name>partitioned-cache-with-quorum</scheme-name>
   <service-name>PartitionedCacheWithQuorum</service-name>
   <backing-map-scheme>
      <local-scheme/>
   </backing-map-scheme>
   <partitioned-quorum-policy-scheme>
      <scheme-ref>partitioned-cache-quorum</scheme-ref>
   </partitioned-quorum-policy-scheme>
   <autostart>true</autostart>
</distributed-scheme>
 
<distributed-scheme>
   <scheme-name>dist-example</scheme-name>
   <service-name>DistributedCache</service-name>
   <backing-map-scheme>
      <local-scheme/>
   </backing-map-scheme>
   <partitioned-quorum-policy-scheme>
      <scheme-name>partitioned-cache-quorum</scheme-name>
      <distribution-quorum>4</distribution-quorum>
      <restore-quorum>3</restore-quorum>
      <read-quorum>3</read-quorum>
      <write-quorum>5</write-quorum>
   </partitioned-quorum-policy-scheme>
   <autostart>true</autostart>
</distributed-scheme>

17.4 Using the Proxy Quorum

The proxy quorum policy defines a single quorum (the connection quorum) for the proxy service. The connection quorum mandates the minimum number of proxy service members that must be available before the proxy service can allow client connections.

This quorum is typically used to ensure enough proxy service members are available to optimally support a given set of TCP clients. For example, a small number of clients may efficiently connect to a cluster using two proxy services. While; a large number of clients may require 3 or more proxy services to efficiently connect to a cluster. Optimal levels are tested during development and then set accordingly to ensure that the minimum service member levels are provisioned in a production environment.

17.4.1 Configuring the Proxy Quorum Policy

The connection quorum threshold is configured in a cache configuration file within the <proxy-quorum-policy-scheme> element. The element must be used within a <proxy-scheme> element. The following example demonstrates configuring the connection quorum threshold to ensures that 3 proxy service members are present in the cluster before the proxy service is allowed to accept TCP client connections:

<proxy-scheme>
   <scheme-name>proxy-with-quorum</scheme-name>
   <service-name>TcpProxyService</service-name>
   <acceptor-config>
      <tcp-acceptor>
         <local-address>
            <address>localhost</address>
            <port>32000</port>
         </local-address>
      </tcp-acceptor>
   </acceptor-config>
   <proxy-quorum-policy-scheme>
      <connect-quorum>3</connect-quorum>
   </proxy-quorum-policy-scheme>
   <autostart>true</autostart>
</proxy-scheme>

The <proxy-quorum-policy-scheme> element also supports the use of scheme references. In the below example, a <proxy-quorum-policy-scheme>, with the name proxy-quorum, is referenced from within the <proxy-scheme> element:

<proxy-scheme>
   <scheme-name>proxy-with-quorum</scheme-name>
   <service-name>TcpProxyService</service-name>
   ...
   <proxy-quorum-policy-scheme>
      <scheme-ref>proxy-quorum</scheme-ref>
   </proxy-quorum-policy-scheme>
   <autostart>true</autostart>
</proxy-scheme>
<proxy-scheme>
   <scheme-name>proxy-example</scheme-name>
   <service-name>TcpProxyService</service-name>
   ...
   <proxy-quorum-policy-scheme>
      <scheme-name>proxy-quorum</scheme-name>
      <connect-quorum>3</connect-quorum>
   </proxy-quorum-policy-scheme>
   <autostart>true</autostart>
</proxy-scheme>

17.5 Using Custom Action Policies

Custom action policies can be used instead of the default quorum policies for the Cluster service, Partitioned Cache service, and Proxy service. Custom action policies must implement the com.tangosol.net.ActionPolicy interface.

This section includes the following topics:

17.5.1 Enabling Custom Action Policies

To enable a custom policy, add a <class-name> element within a quorum policy scheme element that contains the fully qualified name of the implementation class. The following example adds a custom action policy to the partitioned quorum policy for a distributed cache scheme definition:

<distributed-scheme>
   <scheme-name>partitioned-cache-with-quorum</scheme-name>
   <service-name>PartitionedCacheWithQuorum</service-name>
   <backing-map-scheme>
     <local-scheme/>
   </backing-map-scheme>
   <partitioned-quorum-policy-scheme>
      <class-name>package.MyCustomAction</class-name>
   </partitioned-quorum-policy-scheme>
   <autostart>true</autostart>
</distributed-scheme>

As an alternative, a factory class can create custom action policy instances. To define a factory class, use the <class-factory-name> element to enter the fully qualified class name and the <method-name> element to specify the name of a static factory method on the factory class which performs object instantiation. For example.

<distributed-scheme>
   <scheme-name>partitioned-cache-with-quorum</scheme-name>
   <service-name>PartitionedCacheWithQuorum</service-name>
   <backing-map-scheme>
     <local-scheme/>
   </backing-map-scheme>
   <partitioned-quorum-policy-scheme>
      <class-factory-name>package.Myfactory</class-factory-name>
      <method-name>createPolicy</method-name>
   </partitioned-quorum-policy-scheme>
   <autostart>true</autostart>
</distributed-scheme>

17.5.2 Enabling the Custom Failover Access Policy

Coherence provides a pre-defined custom action policy that moderates client request load during a failover event in order to allow cache servers adequate opportunity to re-establish partition backups. Use this policy in situations where a heavy load of high-latency requests may prevent, or significantly delay, cache servers from successfully acquiring exclusive access to partitions needing to be transferred or backed up.

To enable the custom failover access policy, add a <class-name> element within a <partition-quorum-policy-scheme> element that contains the fully qualified name of the failover access policy (com.tangosol.net.partition.FailoverAccessPolicy). The policy accepts the following parameters:

  • cThresholdMillis – Specifies the delay before the policy should start holding requests (after becoming endangered). The default value is 5000 milliseconds.

  • cLimitMillis – Specifies the delay before the policy makes a maximal effort to hold requests (after becoming endangered). The default values is 60000 milliseconds.

  • cMaxDelayMillis – Specifies the maximum amount of time to hold a request. The default value is 5000 milliseconds.

The following example enables the custom failover access policy and sets each of the parameters:

<distributed-scheme>
   <scheme-name>partitioned-cache-with-quorum</scheme-name>
   <service-name>PartitionedCacheWithQuorum</service-name>
   <backing-map-scheme>
     <local-scheme/>
   </backing-map-scheme>
   <partitioned-quorum-policy-scheme>
      <class-name>com.tangosol.net.partition.FailoverAccessPolicy/class-name>
      <init-params>
         <init-param>
            <param-name>cThresholdMillis</param-name>
            <param-value>7000</param-value>
         </init-param>
         <init-param>
            <param-name>cLimitMillis</param-name>
            <param-value>30000</param-value>
         </init-param>
         <init-param>
            <param-name>cMaxDelayMillis</param-name>
            <param-value>2000</param-value>
         </init-param>
      </init-params>
   </partitioned-quorum-policy-scheme>
   <autostart>true</autostart>
</distributed-scheme>