5 Using the Health Check API
Note:
To enable health checks, you must use the Boostrap API to start the Coherence cluster members. See Using the Bootstrap API.This chapter includes the following topics:
- About the Health Check API
You can use the health check API from the application code, to determine whether Coherence is healthy, and also from a HTTP endpoint making it useful for health checks in containerized environments such as Kubernetes and Docker. - Using the Built-In Health Checks
Coherence has a number of health checks that are ready-to-use. - Enabling HTTP Health Checks
The health check HTTP endpoints are enabled when Coherence is run using the bootstrap API, or starting Coherence usingcom.tangosol.net.Coherence
as the main class. If Coherence is started by any other method, the health check API is still available, but the HTTP endpoints will not be running. - Using Application Health Checks
The health check API allows application developers to add custom health checks. This can be useful where an application provides a service that should be used to determine the overall health of a Coherence member. - Using Containerized Health Checks
When running Coherence applications in containers, for example, in Docker or Kubernetes, it is useful to be able to make use of health and readiness checks. By running Coherence with the health HTTP endpoints enabled, configuring container health is made simple.
About the Health Check API
You can use the health check API from the application code, to determine whether Coherence is healthy, and also from a HTTP endpoint making it useful for health checks in containerized environments such as Kubernetes and Docker.
public interface HealthCheck
{
/**
* Returns the unique name of this health check.
*
* @return the unique name of this health check
*/
String getName();
/**
* Return {@code true} if this {@link HealthCheck} should
* be included when working out this Coherence member's
* health status.
*
* @return {@code true} if this {@link HealthCheck} should
* be included in the member's health status
*/
default boolean isMemberHealthCheck()
{
return true;
}
/**
* Returns {@link true} if the resource represented by
* this {@link HealthCheck} is ready, otherwise returns
* {@code false}.
* <p>
* The concept of what "ready" means may vary for different
* types of resources.
*
* @return {@link true} if the resource represented by this
* {@link HealthCheck} is ready, otherwise {@code false}
*/
boolean isReady();
/**
* Returns {@link true} if the resource represented by
* this {@link HealthCheck} is alive, otherwise returns
* {@code false}.
* <p>
* The concept of what "alive" means may vary for different
* types of resources.
*
* @return {@link true} if the resource represented by this
* {@link HealthCheck} is alive, otherwise returns
* {@code false}
*/
boolean isLive();
/**
* Returns {@link true} if the resource represented by
* this {@link HealthCheck} is started, otherwise returns
* {@code false}.
* <p>
* The concept of what "started" means may vary for different
* types of resources.
*
* @return {@link true} if the resource represented by this
* {@link HealthCheck} is started, otherwise returns
* {@code false}
*/
boolean isStarted();
/**
* Returns {@link true} if the resource represented by this
* {@link HealthCheck} is in a safe state to allow a rolling
* upgrade to proceed, otherwise returns {@code false}.
* <p>
* The concept of what "safe" means may vary for different
* types of resources.
*
* @return {@link true} if the resource represented by this
* {@link HealthCheck} is in a safe state to allow
* a rolling upgrade to proceed, otherwise returns
* {@code false}
*/
boolean isSafe();
}
The methods were specifically chosen to integrate with other systems where Coherence is run, for example Kubernetes, that use similar, "started", "live" and "ready" health checks. The "safe" check is specific to Coherence to be used for controlling use-cases such as rolling upgrades, where it is important to know that a cluster is "safe" before rolling the next cluster member.
com.tangosol.net.management.Registry class
. The
Registry
is typically obtained from the current Coherence
Cluster
instance. For example, when Coherence has been started by
running com.tangosol.net.Coherence.main()
, or by using the bootstrap
API, you can obtain the management Registry
as shown
below:Cluster cluster = Coherence.getInstance().getCluster();
Registry registry = cluster.getManagement();
The health check API can only see registered health checks for the local Coherence member; it is not a cluster wide API. For cluster wide health checks, use the corresponding health MBeans through the Coherence management API, JMX, or management over REST.
This section includes the following topics:
- Obtaining All Health Checks
- Obtaining a Health Check by Name
- Checking All Health Checks Are Ready
- Checking All Health Checks Have Started
- Checking All Health Checks Are Live
- Checking All Health Checks Are Safe
Parent topic: Using the Health Check API
Obtaining All Health Checks
To obtain a collection of all the registered health checks, the
getHealthChecks()
method can be called on the
Registry
instance. This method returns an immutable collection of
registered HealthCheck
instances.
Set
of names of
HealthCheck
instances that are not
ready:Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
Collection<HealthCheck> healthChecks = registry.getHealthChecks();
Set<String> names = healthChecks.stream()
.filter(hc -> !hc.isReady())
.map(HealthCheck::getName)
.collect(Collectors.toSet());
Parent topic: About the Health Check API
Obtaining a Health Check by Name
To obtain a specific health check by name, the getHealthCheck(String
name)
method can be called on the Registry
instance. This
method returns an Optional
containing the requested
HealthCheck
, if one has been registered with the requested name, or
returns an empty Optional
if no HealthCheck
has been
registered with the request name.
HealthCheck
with the
name
"Foo":Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
Optional<HealthCheck> healthChecks = registry.getHealthCheck("Foo");
Parent topic: About the Health Check API
Checking All Health Checks Are Ready
allHealthChecksReady()
method on the Registry
instance can be used to determine whether all locally registered health checks are
ready. Only health checks that return true
from their
isMemberHealthCheck()
method are included in the ready
check.Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
boolean ready = registry.allHealthChecksReady();
Parent topic: About the Health Check API
Checking All Health Checks Have Started
The allHealthChecksStarted()
method on the Registry
instance can be used to determine whether all locally registered health checks have been
started. Only health checks that return true
from their
isMemberHealthCheck()
method are included in the started check.
Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
boolean started = registry.allHealthChecksStarted();
Parent topic: About the Health Check API
Checking All Health Checks Are Live
allHealthChecksLive()
method on the Registry
instance can be used to determine whether all locally registered health checks are live.
Only health checks that return true
from their
isMemberHealthCheck()
method are included in the live
check.Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
boolean live = registry.allHealthChecksLive();
Parent topic: About the Health Check API
Checking All Health Checks Are Safe
allHealthChecksSafe()
method on the Registry
instance can be used to determine whether all locally registered health checks are safe.
Only health checks that return true
from their
isMemberHealthCheck()
method are included in the safe
check.Coherence coherence = Coherence.getInstance();
Registry registry = coherence.getManagement();
boolean safe = registry.allHealthChecksSafe();
Parent topic: About the Health Check API
Using the Built-In Health Checks
- Each Coherence service has a corresponding health check.
- Instances of
com.tangosol.net.Coherence
provide a corresponding health check. - When using Coherence gRPC integrations, the gRPC proxy server has a health check.
This section includes the following topics:
- Using the Service Health Checks
- Using the PartitionedCache Service isSafe Check
- Excluding Services from Member Health
- Allowing Endangered Services
Parent topic: Using the Health Check API
Using the Service Health Checks
- Started: The
isStarted()
method for a service health check will returntrue
if the corresponding service is running. - Live: The
isLive()
method for a service health check will returntrue
if the corresponding service is running. - Ready: For a service, the
isReady()
method will returnfalse
until a service becomes "safe", after which the "ready" state will remaintrue
. This is specifically for use cases such as Kubernetes, where Pods will be removed from aService
if notReady
. However, this behavior is typically not required for Coherence. - Safe: For all services except a partitioned cache service, the
isSafe()
method will always returntrue
.
Parent topic: Using the Built-In Health Checks
Using the PartitionedCache Service isSafe Check
A Coherence PartitionedCache
service is more complex than most
services in Coherence, and as such, its health checks also do more. The
isSafe()
check for a PartitionedCache
service
performs a number of checks to ensure that the service is stable and safe. The main
use-cases for the "safe" check are when performing a rolling upgrade or safely scaling
down a cluster.
- The
isSafe()
health check for aPartitionedCache
service on a non-storage enabled member will returntrue
as long as the service is running. - The
isSafe()
health check for aPartitionedCache
service will returnfalse
if this member is the only storage enabled member for the service, but does not own all the partitions. This can happen just after all the other members of the cluster have been stopped but the partition recovery and reallocation logic is still in progress. Therefore, this member does not yet know that it owns all the partitions. - The
isSafe()
health check for aPartitionedCache
service will returnfalse
if the backup count is configured to be greater than zero and the StatusHA state for the service isendangered
. You can change this behavior for individual services in the cache configuration file to allow them to be endangered. A service with a backup count of zero is allowed to be endangered for the safe check. - The
isSafe()
health check for aPartitionedCache
service will returnfalse
if partition redistribution is in progress. - The
isSafe()
health check for aPartitionedCache
service will returnfalse
if recovery from persistent storage is in progress.
Parent topic: Using the Built-In Health Checks
Excluding Services from Member Health
Sometimes it may be desirable to exclude a Coherence service from the member’s
overall health check. This can be done by setting the
<member-health-check>
element in service’s
<health>
element in the cache configuration file.
For example, the proxy-scheme below has the
<member-health-check>
element value set to
false
. The health checks for the Proxy
service
will still be accessible through the health API, but checks of the overall member
health, such as the Registry
class’s
allHealthChecksReady()
method will not include the
Proxy
service.
<proxy-scheme>
<service-name>Proxy</service-name>
<autostart>true</autostart>
<health>
<member-health-check>false</member-health-check>
</health>
</proxy-scheme>
Parent topic: Using the Built-In Health Checks
Allowing Endangered Services
Sometimes an application may configure a distributed cache service that can
intentionally become endangered. However, this state should not be reflected in the
member’s overall health. This can be done by setting the
<allow-endangered>
element in the distributed scheme’s
<health>
element in the cache configuration file.
For example, the distributed-scheme
below has the
<allow-endangered>
element value set to true
.
The health checks for the PartitionedCache
will report that the service
is "ready" or "safe" even if the StatusHA value for the service is 'ENDANGERED'.
<distributed-scheme>
<scheme-name>distributed-scheme</scheme-name>
<service-name>PartitionedCacheOne</service-name>
<backing-map-scheme>
<local-scheme/>
</backing-map-scheme>
<autostart>true</autostart>
<health>
<allow-endangered>true</allow-endangered>
</health>
</distributed-scheme>
Parent topic: Using the Built-In Health Checks
Enabling HTTP Health Checks
com.tangosol.net.Coherence
as the main class. If Coherence is
started by any other method, the health check API is still available, but the HTTP
endpoints will not be running. By default, the HTTP server will bind to an
ephemeral port, but this can be changed by setting the
coherence.health.http.port
system property or the
COHERENCE_HEALTH_HTTP_PORT
environment variable.
http://localhost:6676
:java -cp coherence.jar -Dcoherence.health.http.port=6676 \
com.tangosol.net.Coherence
java -p coherence.jar \
-Dcoherence.health.http.port=6676 \
--module com.oracle.coherence
curl
utility can then be used to poll one of the endpoints, for
example
/ready
:curl -i -X GET http://localhost:6676/ready
HTTP/1.1 200 OK Date: Tue, 19 Apr 2022 17:59:05 GMT Content-type: application/json Vary: Accept-Encoding Content-length: 0 X-content-type-options: nosniff
If Coherence health check fails, the response code will be 503, for service unavailable.
Parent topic: Using the Health Check API
Health HTTP Endpoints
The health check HTTP server has a number of endpoints.
Note:
None of the endpoints accepts a payload or returns a response body. The only response is either a 200 or a 503 status code. This means that although the health endpoints can be configured to use SSL/TLS, there is little need for encryption, making their use by external tooling such as Kubernetes and other container environments simpler.Table 5-1 Health Check HTTP Server Endpoints
Endpoint | Description |
---|---|
|
This endpoint returns the status code of 200 as response if all the health checks for the member the request is sent to have been "started". If one or more health check is not started, the endpoint returns a 503 response. |
|
This endpoint returns the status code of 200 as response if all the health checks for the member the request is sent to are "live". If one or more health check is not live, the endpoint returns a 503 response. |
|
This endpoint returns the status code of 200 as response if all the health checks for the member the request is sent to are "ready". If one or more health check is not ready, the endpoint returns a 503 response. |
|
This endpoint returns the status code of 200 response if all the health checks for the member the request is sent to are "safe". If one or more health check is not safe, the endpoint returns a 503 response. |
Parent topic: Enabling HTTP Health Checks
Using Application Health Checks
To register a custom health check, just write an implementation of
com.tangosol.util.HealthCheck
.
The getName()
method for the custom health check should
return a unique name that represents this health check. As health checks are exposed as
MBeans, the name must be a name that is valid in a JMX MBean object name.
The health check implementation should then use relevant application logic to
determine the result to return for each of the methods. Some methods may not apply, in
which case they should just return true
.
It is important to understand how the results of the different health check methods will be used outside the application code. For example, when the application is deployed and managed by an external system that monitors application health. For example, an application deployed into Kubernetes could be killed if it reports not being "live" too many times. An application that does not report being "ready" may be excluded from request routing, and so on. An application that is not "safe" will block rolling upgrades or safe scaling of a Coherence cluster.
Excluding Custom Health Checks from Member Health
An application developer may want to add custom health checks for application
services, but not have these checks impact the overall Coherence member health. The
HealthCheck
interface has a isMemberHealthCheck()
method for this purpose. The default implementation of
isMemberHealthCheck()
always returns true
, so by
default all health checks are included in the member’s health. To exclude a health check
from the member’s health, override the isMemberHealthCheck()
method to
return false
.
Parent topic: Using Application Health Checks
Using Containerized Health Checks
When running Coherence applications in containers, for example, in Docker or Kubernetes, it is useful to be able to make use of health and readiness checks. By running Coherence with the health HTTP endpoints enabled, configuring container health is made simple.
When using the health check endpoints in a container, the HTTP port needs to be fixed
so that the image’s health checks can be configured. The default behavior of binding to
an ephemeral port would mean that the system does not know which port the health check
API is bound to. The HTTP port can be set using the
coherence.health.http.port
system property or the
COHERENCE_HEALTH_HTTP_PORT environment variable. When creating
images, it is typically simpler to use environment variables, which is what the examples
in this section show.
This section includes the following topics:
Parent topic: Using the Health Check API
Using the Docker Health Checks
It is possible to build a Coherence Docker image configured with a health check using
the HEALTHCHECK
configuration in the Dockerfile.
Dockerfile
below, sets the health check port to
6676
using the ENV COHERENCE_HEALTH_HTTP_PORT=6676
setting. The Dockerfile
is then configured with a
HEALTHCHECK
where the command will run curl
against the HTTP endpoint on http://127.0.0.1:6676/ready
. This command
will fail if the response is not
200
.FROM openjdk:11-jre
ADD coherence.jar /coherence/lib/coherence.jar
ENTRYPOINT [ "java" ]
CMD [ "-cp", "/coherence/lib/*", "com.tangosol.net.Coherence" ]
ENV COHERENCE_HEALTH_HTTP_PORT=6676
HEALTHCHECK CMD curl --fail http://127.0.0.1:6676/ready || exit 1
The check above assumes that the base image has curl
installed. This is
not always the case, for example, some very slim Linux base images or distroless images
will not have any additional tools such as curl
. In this case, all the
image has is Java. Therefore, you can configure the health check to use a Java health
check client class com.tangosol.util.HealthCheckClient
that is built
into the Coherence jar. This class can be run with a single parameter, which is the URL
of the HTTP endpoint to check.
Dockerfile
below uses a distroless base image that only has
a Linux kernel and Java 11 installed. The health check port is set to
6676
using the ENV COHERENCE_HEALTH_HTTP_PORT=6676
setting. The Dockerfile
is then configured with a
HEALTHCHECK
where the command will run java -cp
/coherence/lib/coherence.jar com.tangosol.util.HealthCheckClient
http://127.0.0.1:6676/ready
. This command will fail if the response is not
200.FROM gcr.io/distroless/java11
ADD coherence.jar /coherence/lib/coherence.jar
ENTRYPOINT [ "java" ]
CMD [ "-cp", "/coherence/lib/*", "com.tangosol.net.Coherence" ]
ENV COHERENCE_HEALTH_HTTP_PORT=6676
HEALTHCHECK CMD java -cp /coherence/lib/coherence.jar com.tangosol.util.HealthCheckClient http://127.0.0.1:6676/ready
Parent topic: Using Containerized Health Checks
Using the Kubernetes Readiness and Liveness Probes
In Kubernetes, there are various readiness and liveness probes that can be configured. The image itself does not need a health check (see Using the Docker Health Checks) as Kubernetes readiness and liveness is independent of the image. For complete details about configuring Kubernetes readiness and liveness, see the Kubernetes documentation.
Pod
using a Coherence
image and health checks. The COHERENCE_HEALTH_HTTP_PORT
environment
variable is used to fix the health check HTTP port to 6676
. The
readinessProbe
is then configured to use a HTTP GET request on port
6676
using the request path /ready
. The host for
the request defaults to the Pod IP address, so will effectively be the same as
http://<pod-ip>:6676/ready
.apiVersion: v1
kind: Pod
metadata:
name: coherence
spec:
containers:
- name: coherence
image: ghcr.io/oracle/coherence-ce:22.06
env:
- name: COHERENCE_HEALTH_HTTP_PORT
value: "6676"
- name: COHERENCE_WKA
value: coherence_wka.svc.cluster.local
readinessProbe:
httpGet:
path: "/ready"
port: 6676
initialDelaySeconds: 30
periodSeconds: 30
Parent topic: Using Containerized Health Checks