A Resilient BSF Microservices with Kubernetes

BSF microservices pods run on Kubernetes cluster. Occasionally, Kubernetes pod disruptions may occur within the cluster, either from voluntary or involuntary causes. This results in pod/service failing, and the cluster experiences some outage resulting in discontinuity of the service. In order to mitigate these disruptions, the BSF Kubernetes cluster and the services running on it are designed to be resilient by adapting the recommendations and strategies from Kubernetes framework. Thus, improving the availability, prevent/minimize the downtime and outages before they occur.

Described below are the various failure points that might occur in the BSF cluster, and the resilience model that is adopted to handle it.
Failure Point: Worker Node failure
Recovery - Multiple pod

By having multiple nodes in a cluster, it provides high availability by scheduling pods in different nodes, removing the single point of failure. By running multiple copies of BSF services/pods reduces the chances of outages and service degradation.

For more information about this functionality, see section BSF Services.

Recovery - Anti-Affinity rules

The placement of the pod and any of its replicas can be controlled using Kubernetes pod affinity and anti-affinity rules. Pod anti-affinity rule is used to instruct Kubernetes not to co-locate pods on the same node. This avoids an outage due to the loss of a node.

Failure Point: Physical Server (Hosting Worker Node/s) failure
Recovery - Pod Topology Spread (PTS)

Pod topology spread constraints tells the Kubernetes scheduler how to spread pods across nodes in a cluster. It can be across nodes, zones, regions, or other user-defined topology domains. They allow users to use labels to split nodes into groups. Then users can specify pods using a label selector and indicate to the scheduler how evenly or unevenly those pods can be distributed.

Failure Point: Cluster needs to be upgraded or needs to shut down
Recovery - PodDisruptionBudget (PDB)

Setting PDB ensures that the cluster have a sufficient number of available replicas, to keep it functioning even during maintenance. Using the PDB, we define a number (or percentage) of pods that can be terminated. With PDB configured, Kubernetes drains a node following the configured disruption schedule. New pods is deployed on other available nodes. This approach ensures Kubernetes schedules workloads in an optimal way while controlling the disruption based on PDB configuration.

For more information about this functionality, see section PodDisruptionBudget Configuration.

Recovery - Terminate gracefully

When a pod is evicted, it is gracefully terminated honoring the termination gracePeriod setting in the custom yaml file.

For more information about this functionality, see section Graceful Shutdown Configurations.

Failure Point: Pod/Application failure
Recovery - Kubernetes Probes
Kubernetes provides probes i.e health checks to monitor and act on the state of pods (Containers) and to make sure only healthy pods serve traffic. With help of probes, we can control when a pod should be deemed started, ready for service, or live to serve traffic. Kubernetes gives three types of health checks probes:
  • Liveness probes let Kubernetes know whether the application is running or not.
  • Readiness probes let Kubernetes know when the application is ready to serve traffic.
  • Startup probes let Kubernetes know whether the application has properly started or not.
Failure Point: High traffic Rate
Recovery - Horizontal Pod Auto-Scaling (HPA)

When there is an increase or drop in the traffic, Kubernetes can automatically increase or decrease the pod replicas that serve the traffic. Horizontal scaling means that the response to increased load is to deploy more pods. If the load decreases, and the number of pods is above the configured minimum, the HorizontalPodAutoscaler instructs the work load resource to scale back down.

Any(including intra/inter-NF communication Failures)
Recovery - All BSF service support metrics/logs to capture the behavior.

BSF Microservices Resilience details

The criticality of the service failures is indicated as HIGH, MEDIUM and LOW, and they mean:
  • HIGH- Service failure impacts the traffic, and it can not be handled successfully.
  • MEDIUM- Service failure impacts the traffic, and it cannot be handled successfully by the default processing model.
  • LOW- Service failure does not impact the traffic directly.

Table A-1 BSF Kubernetes cluster Resiliency details

Service Name Multi-pod Affinity/Anti-affinityRule HPA PDB PTS Node Selector Serviceability status tracking Criticality Impact of Service loss/failure Overload Control/ Protection Dependent service tracking and reporting
Alternate Route Service Y Y 0%/80% 1 N Y Y HIGH On DNS-SRV based alternate routing is enabled:
  • Handles subsequent messages on failure of initial producer.
  • Handles notifications on failure of consumer.
  • SRV based lookup for NRF and SCP.
N N
App-info Y Y 2%/80% 50% N Y Y HIGH

This service tracks status of all services and cnDBTier. This is used by services like:

  • Nrf-client for NF registration:

    On appInfo pod failure, Nrfclient uses the last known state fetched from AppInfo. However, if NrfClient pod also restarts, cache data is lost, NF service will be suspended at NRF.

  • Diam-gateway to track readiness status of cnDbtier:

    On appInfo pod failure, diameter-gateway uses the last known state fetched from AppInfo. However, if diameter-gateway pod also restarts, then it will fail to detect DB availability and will not be able to accept signaling traffic.

N N
Audit Service N Y 0%/60% -- Y N Y LOW

This service handles stale session cleanup and retry binding create operation.

Loss of this service leads to large number of stale records and failure of retry binding sessions.

N N
BSF Management Service Y Y 0%/40% 50% N Y Y HIGH   N N
CM Service Y Y --(fixed set of replicas) 50% N Y Y HIGH

This service provides user interface to make configuration changes for policy (including common service configuration e.g. Ingress gateway, Egress gateway, NrfClient etc).

If this service fails, other services will not be able to fetch the configuration data. Common services pods can continue to run with existing configurations, but it will get impacted on pod restart cases.

N N
Config Server Y Y 5%/80% 50% N Y Y HIGH

This service is responsible for providing any configuration change to other services.

Other services continue to work with existing configuration data, but container restart or pod scaling will lead to readiness failure, as they can not accept traffic without new configuration.

N N
Diameter Gateway Y Y --(fixed set of replicas) 50% N Y Y HIGH

The failure of this services impacts all diameter related traffic in a site.

Y

Enforce overload control for backend services.

DB status is tracked (through appInfo) using helm configuration, to determine the readiness status of gateway pod.
Egress Gateway Y Y 0%/80% 1 N Y Y HIGH

The loss of this service means

  • NRF marks site as "SUSPENDED" due to loss of HB.
N None
Igress Gateway Y Y 0%/80% 1 N Y Y HIGH

The loss of this service impacts connectivity from SCP, NRF or other peers for ingress flow. Hence it will indicate site failure to peers at network/transport level.

Note: There shall not be signaling failure, if consumers perform alternate routing to alternate site to achieve session continuity.

Y

Enforce overload control for backend services.

Backend as well as IGW rate limiting support.

None
NRF Client NF Management Service Y(Policy need to enable multi-pod support - currently set to 1) Y 0%/80% 25% N Y Y HIGH

This service is responsible for NF profile registration with NRF. It also performs NRF HB and NRF health check functionality. Loss of this service for HB Timer interval, means that NRF can mark a given PCF instance as SUSPENDED.

As soon as Nrf-mgmt pod becomes available, it will automatically refresh Nf's profile at NRF and bring site back to REGISTERED state (if NRF state was suspended).

None None
Perf-Info Y Y --(Fixed set of replicas) 50% N Y Y MEDIUM

This service is responsible to calculate load and overload level. This is used by services like:

  • Nrf-client for load reporting

    - When perfInfo pod is down, Nrfclient currently use the last known load level fetched from PerfInfo. However, if NrfClient pod also restarts, then it will loose its cache information and hence will report load level as zero.

  • Ingress gateway/Diameter-gateway to track overload status of back end services.

    When perfInfo pod is down, these services currently use the last known state reported by perfInfo.

None None
Query Service Y Y 0%/80% 50% N Y Y LOW The loss of this service means, operator will not be able to perform query/session viewers functionality. HA to provide better serviceability. None None