Disaster Recovery Service and Peering Health Check

When the Disaster Recovery (DR) service is active, Private Cloud Appliance monitors the status of its local service and the health of the system it has a peer connection with.

Internally, the controller software ensures that the container hosting the DR service is running and functioning correctly. Using a Kubernetes liveness probe, messages are exchanged over the RabbitMQ bus at regular intervals to confirm status. If the communication times out or the response indicates the container is unhealthy, Kubernetes tries to restart it.

To verify the operational status of the DR service, health checks are performed for the local service and the remote target. They report the status of the DR service, the replication status, and the presence of a peer connection. For an active peer connection, the health status of the remote DR service is also reported.

Data from DR service health checks is stored in Prometheus.

DR Metric

Description

dr_health_status

Status of the local DR service. Possible values are:

  • 0 = healthy

  • 1 = unhealthy

  • 2 = service not set up

  • 3 = peer connection not enabled

dr_peer_health_status

Status of the DR service on the peer system. Possible values are:

  • 0 = healthy

  • 1 = unhealthy

  • 2 = service not set up

  • 3 = peer connection not enabled