5 Using the Health Check API

Coherence 14.1.1.2206 introduces a health check API to enable application code to determine the health of the local Coherence member, and corresponding HTTP and management endpoints to allow external applications to query the health of a cluster and its members.
The health API also enables applications to register their own health checks, which is then included in the member’s and cluster’s health status.

Note:

To enable health checks, you must use the Boostrap API to start the Coherence cluster members. See Using the Bootstrap API.

This chapter includes the following topics:

About the Health Check API

You can use the health check API from the application code, to determine whether Coherence is healthy, and also from a HTTP endpoint making it useful for health checks in containerized environments such as Kubernetes and Docker.

All health checks in Coherence implement a simple interface.
public interface HealthCheck
    {
    /**
     * Returns the unique name of this health check.
     *
     * @return the unique name of this health check
     */
    String getName();

    /**
     * Return {@code true} if this {@link HealthCheck} should
     * be included when working out this Coherence member's
     * health status.
     *
     * @return {@code true} if this {@link HealthCheck} should
     *         be included in the member's health status
     */
    default boolean isMemberHealthCheck()
        {
        return true;
        }

    /**
     * Returns {@link true} if the resource represented by
     * this {@link HealthCheck} is ready, otherwise returns
     * {@code false}.
     * <p>
     * The concept of what "ready" means may vary for different
     * types of resources.
     *
     * @return {@link true} if the resource represented by this
     *         {@link HealthCheck} is ready, otherwise {@code false}
     */
    boolean isReady();

    /**
     * Returns {@link true} if the resource represented by
     * this {@link HealthCheck} is alive, otherwise returns
     * {@code false}.
     * <p>
     * The concept of what "alive" means may vary for different
     * types of resources.
     *
     * @return {@link true} if the resource represented by this
     *         {@link HealthCheck} is alive, otherwise returns
     *         {@code false}
     */
    boolean isLive();

    /**
     * Returns {@link true} if the resource represented by
     * this {@link HealthCheck} is started, otherwise returns
     * {@code false}.
     * <p>
     * The concept of what "started" means may vary for different
     * types of resources.
     *
     * @return {@link true} if the resource represented by this
     *         {@link HealthCheck} is started, otherwise returns
     *         {@code false}
     */
    boolean isStarted();

    /**
     * Returns {@link true} if the resource represented by this
     * {@link HealthCheck} is in a safe state to allow a rolling
     * upgrade to proceed, otherwise returns {@code false}.
     * <p>
     * The concept of what "safe" means may vary for different
     * types of resources.
     *
     * @return {@link true} if the resource represented by this
     *         {@link HealthCheck} is in a safe state to allow
     *         a rolling upgrade to proceed, otherwise returns
     *         {@code false}
     */
    boolean isSafe();
    }

The methods were specifically chosen to integrate with other systems where Coherence is run, for example Kubernetes, that use similar, "started", "live" and "ready" health checks. The "safe" check is specific to Coherence to be used for controlling use-cases such as rolling upgrades, where it is important to know that a cluster is "safe" before rolling the next cluster member.

The health check API is part of the Coherence management APIs and can be accessed from the com.tangosol.net.management.Registry class. The Registry is typically obtained from the current Coherence Cluster instance. For example, when Coherence has been started by running com.tangosol.net.Coherence.main(), or by using the bootstrap API, you can obtain the management Registry as shown below:
Cluster  cluster  = Coherence.getInstance().getCluster();
Registry registry = cluster.getManagement();

The health check API can only see registered health checks for the local Coherence member; it is not a cluster wide API. For cluster wide health checks, use the corresponding health MBeans through the Coherence management API, JMX, or management over REST.

This section includes the following topics:

Obtaining All Health Checks

To obtain a collection of all the registered health checks, the getHealthChecks() method can be called on the Registry instance. This method returns an immutable collection of registered HealthCheck instances.

For example, the code below obtains a Set of names of HealthCheck instances that are not ready:
Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
Collection<HealthCheck> healthChecks = registry.getHealthChecks();
Set<String> names = healthChecks.stream()
        .filter(hc -> !hc.isReady())
        .map(HealthCheck::getName)
        .collect(Collectors.toSet());

Obtaining a Health Check by Name

To obtain a specific health check by name, the getHealthCheck(String name) method can be called on the Registry instance. This method returns an Optional containing the requested HealthCheck, if one has been registered with the requested name, or returns an empty Optional if no HealthCheck has been registered with the request name.

For example, the code below obtains the HealthCheck with the name "Foo":
Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
Optional<HealthCheck> healthChecks = registry.getHealthCheck("Foo");

Checking All Health Checks Are Ready

The allHealthChecksReady() method on the Registry instance can be used to determine whether all locally registered health checks are ready. Only health checks that return true from their isMemberHealthCheck() method are included in the ready check.
Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
boolean ready = registry.allHealthChecksReady();

Checking All Health Checks Have Started

The allHealthChecksStarted() method on the Registry instance can be used to determine whether all locally registered health checks have been started. Only health checks that return true from their isMemberHealthCheck() method are included in the started check.

Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
boolean started = registry.allHealthChecksStarted();

Checking All Health Checks Are Live

The allHealthChecksLive() method on the Registry instance can be used to determine whether all locally registered health checks are live. Only health checks that return true from their isMemberHealthCheck() method are included in the live check.
Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
boolean live = registry.allHealthChecksLive();

Checking All Health Checks Are Safe

The allHealthChecksSafe() method on the Registry instance can be used to determine whether all locally registered health checks are safe. Only health checks that return true from their isMemberHealthCheck() method are included in the safe check.
Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
boolean safe = registry.allHealthChecksSafe();

Using the Built-In Health Checks

Coherence has a number of health checks that are ready-to-use.
  • Each Coherence service has a corresponding health check.
  • Instances of com.tangosol.net.Coherence provide a corresponding health check.
  • When using Coherence gRPC integrations, the gRPC proxy server has a health check.

This section includes the following topics:

Using the Service Health Checks

For Coherence services, heath checks have the following functionality:
  • Started: The isStarted() method for a service health check will return true if the corresponding service is running.
  • Live: The isLive() method for a service health check will return true if the corresponding service is running.
  • Ready: For a service, the isReady() method will return false until a service becomes "safe", after which the "ready" state will remain true. This is specifically for use cases such as Kubernetes, where Pods will be removed from a Service if not Ready. However, this behavior is typically not required for Coherence.
  • Safe: For all services except a partitioned cache service, the isSafe() method will always return true.

Using the PartitionedCache Service isSafe Check

A Coherence PartitionedCache service is more complex than most services in Coherence, and as such, its health checks also do more. The isSafe() check for a PartitionedCache service performs a number of checks to ensure that the service is stable and safe. The main use-cases for the "safe" check are when performing a rolling upgrade or safely scaling down a cluster.

  • The isSafe() health check for a PartitionedCache service on a non-storage enabled member will return true as long as the service is running.
  • The isSafe() health check for a PartitionedCache service will return false if this member is the only storage enabled member for the service, but does not own all the partitions. This can happen just after all the other members of the cluster have been stopped but the partition recovery and reallocation logic is still in progress. Therefore, this member does not yet know that it owns all the partitions.
  • The isSafe() health check for a PartitionedCache service will return false if the backup count is configured to be greater than zero and the StatusHA state for the service is endangered. You can change this behavior for individual services in the cache configuration file to allow them to be endangered. A service with a backup count of zero is allowed to be endangered for the safe check.
  • The isSafe() health check for a PartitionedCache service will return false if partition redistribution is in progress.
  • The isSafe() health check for a PartitionedCache service will return false if recovery from persistent storage is in progress.

Excluding Services from Member Health

Sometimes it may be desirable to exclude a Coherence service from the member’s overall health check. This can be done by setting the <member-health-check> element in service’s <health> element in the cache configuration file.

For example, the proxy-scheme below has the <member-health-check> element value set to false. The health checks for the Proxy service will still be accessible through the health API, but checks of the overall member health, such as the Registry class’s allHealthChecksReady() method will not include the Proxy service.

<proxy-scheme>
  <service-name>Proxy</service-name>
  <autostart>true</autostart>
  <health>
    <member-health-check>false</member-health-check>
  </health>
</proxy-scheme>

Allowing Endangered Services

Sometimes an application may configure a distributed cache service that can intentionally become endangered. However, this state should not be reflected in the member’s overall health. This can be done by setting the <allow-endangered> element in the distributed scheme’s <health> element in the cache configuration file.

For example, the distributed-scheme below has the <allow-endangered> element value set to true. The health checks for the PartitionedCache will report that the service is "ready" or "safe" even if the StatusHA value for the service is 'ENDANGERED'.

<distributed-scheme>
  <scheme-name>distributed-scheme</scheme-name>
  <service-name>PartitionedCacheOne</service-name>
  <backing-map-scheme>
    <local-scheme/>
  </backing-map-scheme>
  <autostart>true</autostart>
  <health>
    <allow-endangered>true</allow-endangered>
  </health>
</distributed-scheme>

Enabling HTTP Health Checks

The health check HTTP endpoints are enabled when Coherence is run using the bootstrap API, or starting Coherence using com.tangosol.net.Coherence as the main class. If Coherence is started by any other method, the health check API is still available, but the HTTP endpoints will not be running. By default, the HTTP server will bind to an ephemeral port, but this can be changed by setting the coherence.health.http.port system property or the COHERENCE_HEALTH_HTTP_PORT environment variable.
For example, running the following command will start Coherence with the health endpoints on http://localhost:6676:
java -cp coherence.jar -Dcoherence.health.http.port=6676 \
    com.tangosol.net.Coherence
or with Java modules:
java -p coherence.jar \
    -Dcoherence.health.http.port=6676 \
    --module com.oracle.coherence
The curl utility can then be used to poll one of the endpoints, for example /ready:
curl -i -X GET http://localhost:6676/ready
The above command returns output like the following:
HTTP/1.1 200 OK
Date: Tue, 19 Apr 2022 17:59:05 GMT
Content-type: application/json
Vary: Accept-Encoding
Content-length: 0
X-content-type-options: nosniff

If Coherence health check fails, the response code will be 503, for service unavailable.

Health HTTP Endpoints

The health check HTTP server has a number of endpoints.

Note:

None of the endpoints accepts a payload or returns a response body. The only response is either a 200 or a 503 status code. This means that although the health endpoints can be configured to use SSL/TLS, there is little need for encryption, making their use by external tooling such as Kubernetes and other container environments simpler.

Table 5-1 Health Check HTTP Server Endpoints

Endpoint Description

/started

This endpoint returns the status code of 200 as response if all the health checks for the member the request is sent to have been "started". If one or more health check is not started, the endpoint returns a 503 response.

/live

This endpoint returns the status code of 200 as response if all the health checks for the member the request is sent to are "live". If one or more health check is not live, the endpoint returns a 503 response.

/ready

This endpoint returns the status code of 200 as response if all the health checks for the member the request is sent to are "ready". If one or more health check is not ready, the endpoint returns a 503 response.

/safe

This endpoint returns the status code of 200 response if all the health checks for the member the request is sent to are "safe". If one or more health check is not safe, the endpoint returns a 503 response.

Using Application Health Checks

The health check API allows application developers to add custom health checks. This can be useful where an application provides a service that should be used to determine the overall health of a Coherence member. For example, an application could include a web server and should not be considered "ready" until the web server is started.

To register a custom health check, just write an implementation of com.tangosol.util.HealthCheck.

The getName() method for the custom health check should return a unique name that represents this health check. As health checks are exposed as MBeans, the name must be a name that is valid in a JMX MBean object name.

The health check implementation should then use relevant application logic to determine the result to return for each of the methods. Some methods may not apply, in which case they should just return true.

It is important to understand how the results of the different health check methods will be used outside the application code. For example, when the application is deployed and managed by an external system that monitors application health. For example, an application deployed into Kubernetes could be killed if it reports not being "live" too many times. An application that does not report being "ready" may be excluded from request routing, and so on. An application that is not "safe" will block rolling upgrades or safe scaling of a Coherence cluster.

Excluding Custom Health Checks from Member Health

An application developer may want to add custom health checks for application services, but not have these checks impact the overall Coherence member health. The HealthCheck interface has a isMemberHealthCheck() method for this purpose. The default implementation of isMemberHealthCheck() always returns true, so by default all health checks are included in the member’s health. To exclude a health check from the member’s health, override the isMemberHealthCheck() method to return false.

Using Containerized Health Checks

When running Coherence applications in containers, for example, in Docker or Kubernetes, it is useful to be able to make use of health and readiness checks. By running Coherence with the health HTTP endpoints enabled, configuring container health is made simple.

When using the health check endpoints in a container, the HTTP port needs to be fixed so that the image’s health checks can be configured. The default behavior of binding to an ephemeral port would mean that the system does not know which port the health check API is bound to. The HTTP port can be set using the coherence.health.http.port system property or the COHERENCE_HEALTH_HTTP_PORT environment variable. When creating images, it is typically simpler to use environment variables, which is what the examples in this section show.

This section includes the following topics:

Using the Docker Health Checks

It is possible to build a Coherence Docker image configured with a health check using the HEALTHCHECK configuration in the Dockerfile.

The example Dockerfile below, sets the health check port to 6676 using the ENV COHERENCE_HEALTH_HTTP_PORT=6676 setting. The Dockerfile is then configured with a HEALTHCHECK where the command will run curl against the HTTP endpoint on http://127.0.0.1:6676/ready. This command will fail if the response is not 200.
FROM openjdk:11-jre

ADD coherence.jar /coherence/lib/coherence.jar

ENTRYPOINT [ "java" ]
CMD [ "-cp", "/coherence/lib/*", "com.tangosol.net.Coherence" ]

ENV COHERENCE_HEALTH_HTTP_PORT=6676

HEALTHCHECK CMD curl --fail http://127.0.0.1:6676/ready || exit 1

The check above assumes that the base image has curl installed. This is not always the case, for example, some very slim Linux base images or distroless images will not have any additional tools such as curl. In this case, all the image has is Java. Therefore, you can configure the health check to use a Java health check client class com.tangosol.util.HealthCheckClient that is built into the Coherence jar. This class can be run with a single parameter, which is the URL of the HTTP endpoint to check.

The example Dockerfile below uses a distroless base image that only has a Linux kernel and Java 11 installed. The health check port is set to 6676 using the ENV COHERENCE_HEALTH_HTTP_PORT=6676 setting. The Dockerfile is then configured with a HEALTHCHECK where the command will run java -cp /coherence/lib/coherence.jar com.tangosol.util.HealthCheckClient http://127.0.0.1:6676/ready. This command will fail if the response is not 200.
FROM gcr.io/distroless/java11

ADD coherence.jar /coherence/lib/coherence.jar

ENTRYPOINT [ "java" ]
CMD [ "-cp", "/coherence/lib/*", "com.tangosol.net.Coherence" ]

ENV COHERENCE_HEALTH_HTTP_PORT=6676

HEALTHCHECK CMD java -cp /coherence/lib/coherence.jar com.tangosol.util.HealthCheckClient http://127.0.0.1:6676/ready

Using the Kubernetes Readiness and Liveness Probes

In Kubernetes, there are various readiness and liveness probes that can be configured. The image itself does not need a health check (see Using the Docker Health Checks) as Kubernetes readiness and liveness is independent of the image. For complete details about configuring Kubernetes readiness and liveness, see the Kubernetes documentation.

The example below is just a simple Pod using a Coherence image and health checks. The COHERENCE_HEALTH_HTTP_PORT environment variable is used to fix the health check HTTP port to 6676. The readinessProbe is then configured to use a HTTP GET request on port 6676 using the request path /ready. The host for the request defaults to the Pod IP address, so will effectively be the same as http://<pod-ip>:6676/ready.
apiVersion: v1
kind: Pod
metadata:
  name: coherence
spec:
  containers:
  - name: coherence
    image: ghcr.io/oracle/coherence-ce:22.06
    env:
      - name: COHERENCE_HEALTH_HTTP_PORT
        value: "6676"
      - name: COHERENCE_WKA
        value: coherence_wka.svc.cluster.local
    readinessProbe:
      httpGet:
        path: "/ready"
        port: 6676
      initialDelaySeconds: 30
      periodSeconds: 30