Resilient Policy Microservices with Kubernetes

A Resilient Policy Microservices with Kubernetes

Policy microservices pods run on Kubernetes cluster. Occasionally, Kubernetes pod disruptions may occur within the cluster, either from voluntary or involuntary causes. This results in pod/service failing, and the cluster experiences some outage resulting in discontinuity of the service. In order to mitigate these disruptions, the policy Kubernetes cluster and the services running on it are designed to be resilient by adapting the recommendations and strategies from Kubernetes framework. Thus, improving the availability, prevent/minimize the downtime and outages before they occur.

Described below are the various failure points that might occur in the policy cluster, and the resilience model that is adopted to handle it.

Failure Point: Worker Node failure
Recovery - Multiple pod By having multiple nodes in a cluster, it provides high availability by scheduling pods in different nodes, removing the single point of failure. By running multiple copies of policy services/pods reduces the chances of outages and service degradation. For more information about this functionality, see section Policy Services.
Recovery - Anti-Affinity rules The placement of the pod and any of its replicas can be controlled using Kubernetes pod affinity and anti-affinity rules. Pod anti-affinity rule is used to instruct Kubernetes not to co-locate pods on the same node. This avoids an outage due to the loss of a node. For more information about this functionality, see section Anti-affinity Approach to Assign Pods to Nodes.

Failure Point: Worker Node failure

Recovery - Multiple pod

By having multiple nodes in a cluster, it provides high availability by scheduling pods in different nodes, removing the single point of failure. By running multiple copies of policy services/pods reduces the chances of outages and service degradation.

For more information about this functionality, see section Policy Services.

Recovery - Anti-Affinity rules

The placement of the pod and any of its replicas can be controlled using Kubernetes pod affinity and anti-affinity rules. Pod anti-affinity rule is used to instruct Kubernetes not to co-locate pods on the same node. This avoids an outage due to the loss of a node.

For more information about this functionality, see section Anti-affinity Approach to Assign Pods to Nodes.

Failure Point: Physical Server (Hosting Worker Node/s) failure
Recovery - Pod Topology Spread (PTS) Pod topology spread constraints tells the Kubernetes scheduler how to spread pods across nodes in a cluster. It can be across nodes, zones, regions, or other user-defined topology domains. They allow users to use labels to split nodes into groups. Then users can specify pods using a label selector and indicate to the scheduler how evenly or unevenly those pods can be distributed.

Failure Point: Physical Server (Hosting Worker Node/s) failure

Recovery - Pod Topology Spread (PTS)

Pod topology spread constraints tells the Kubernetes scheduler how to spread pods across nodes in a cluster. It can be across nodes, zones, regions, or other user-defined topology domains. They allow users to use labels to split nodes into groups. Then users can specify pods using a label selector and indicate to the scheduler how evenly or unevenly those pods can be distributed.

Failure Point: Cluster needs to be upgraded or needs to shut down

Failure Point: Cluster needs to be upgraded or needs to shut down
Recovery - PodDisruptionBudget (PDB) Setting PDB ensures that the cluster have a sufficient number of available replicas, to keep it functioning even during maintenance. Using the PDB, we define a number (or percentage) of pods that can be terminated. With PDB configured, Kubernetes drains a node following the configured disruption schedule. New pods is deployed on other available nodes. This approach ensures Kubernetes schedules workloads in an optimal way while controlling the disruption based on PDB configuration. For more information about this functionality, see section PodDisruptionBudget Configuration.
Recovery - Terminate gracefully When a pod is evicted, it is gracefully terminated honoring the termination `gracePeriod` setting in the custom yaml file.

Recovery - PodDisruptionBudget (PDB)

Setting PDB ensures that the cluster have a sufficient number of available replicas, to keep it functioning even during maintenance. Using the PDB, we define a number (or percentage) of pods that can be terminated. With PDB configured, Kubernetes drains a node following the configured disruption schedule. New pods is deployed on other available nodes. This approach ensures Kubernetes schedules workloads in an optimal way while controlling the disruption based on PDB configuration.

For more information about this functionality, see section PodDisruptionBudget Configuration.

Recovery - Terminate gracefully

When a pod is evicted, it is gracefully terminated honoring the termination gracePeriod setting in the custom yaml file.

Failure Point: Pod/Application failure
Recovery - Kubernetes Probes Kubernetes provides probes i.e health checks to monitor and act on the state of pods (Containers) and to make sure only healthy pods serve traffic. With help of probes, we can control when a pod should be deemed started, ready for service, or live to serve traffic. Kubernetes gives three types of health checks probes: Liveness probes let Kubernetes know whether the application is running or not. Readiness probes let Kubernetes know when the application is ready to serve traffic. Startup probes let Kubernetes know whether the application has properly started or not.

Failure Point: Pod/Application failure

Recovery - Kubernetes Probes

Kubernetes provides probes i.e health checks to monitor and act on the state of pods (Containers) and to make sure only healthy pods serve traffic. With help of probes, we can control when a pod should be deemed started, ready for service, or live to serve traffic. Kubernetes gives three types of health checks probes:

Liveness probes let Kubernetes know whether the application is running or not.
Readiness probes let Kubernetes know when the application is ready to serve traffic.
Startup probes let Kubernetes know whether the application has properly started or not.

Failure Point: High traffic Rate
Recovery - Horizontal Pod Auto-Scaling (HPA) When there is an increase or drop in the traffic, Kubernetes can automatically increase or decrease the pod replicas that serve the traffic. Horizontal scaling means that the response to increased load is to deploy more pods. If the load decreases, and the number of pods is above the configured minimum, the HorizontalPodAutoscaler instructs the work load resource to scale back down.

Failure Point: High traffic Rate

Recovery - Horizontal Pod Auto-Scaling (HPA)

When there is an increase or drop in the traffic, Kubernetes can automatically increase or decrease the pod replicas that serve the traffic. Horizontal scaling means that the response to increased load is to deploy more pods. If the load decreases, and the number of pods is above the configured minimum, the HorizontalPodAutoscaler instructs the work load resource to scale back down.

Any(including intra/inter-NF communication Failures)
Recovery - All policy service support metrics/logs to capture the behavior.

Policy Microservices Resilience details

The criticality of the service failures is indicated as HIGH, MEDIUM and LOW, and they mean:

HIGH- Service failure impacts the traffic, and it can not be handled successfully.
MEDIUM- Service failure impacts the traffic, and it cannot be handled successfully by the default processing model.
LOW- Service failure does not impact the traffic directly.

Note:

The performance and capacity of the Policy system may vary based on the Call model, Feature/Interface configuration, underlying CNE and hardware environment, including but not limited to the complexity of deployed policies, policy table size , object expression and custom json usage in policy design.

Table A-1 Policy Kubernetes cluster Resiliency details

Service Name	Multi-pod	Affinity/Anti-affinityRule	HPA	PDB	PTS	Node Selector	Serviceability status tracking¹	Criticality	Impact of Service loss/failure	Overload Control/ Protection	Dependent service tracking and reporting
Alternate Route Service	Y	Y	0%/80%	1	N	Y	Y	HIGH	On DNS-SRV based alternate routing is enabled: Handles subsequent messages on failure of initial producer. Handles notifications on failure of consumer. SRV based lookup for NRF and SCP.	N	N
AM Service	Y	Y	0%/30%	20%	N	Y	Y	HIGH	The loss of this service leads to AM call failures for a site. NF will be marked as de-registered at NRF.	N	N
App-info	Y	Y	3%/80%	20%	N	Y	Y	HIGH	This service tracks status of all services and cnDbTier. This is used by services like: Nrf-client for NF registration: On appInfo pod failure, Nrfclient uses the last known state fetched from AppInfo. However, if NrfClient pod also restarts, cache data is lost, NF service will be suspended at NRF. Diameter-gateway to track readiness status of cnDbtier: On appInfo pod failure, Diameter-gateway uses the last known state fetched from AppInfo. However, if diameter-gateway pod also restarts, then it will fail to detect DB availability and will not be able to accept signaling traffic.	N	N
Audit Service	N	Y	1%/60%	50%		N	Y	LOW	This service handles stale session cleanup and retry binding create operation. Loss of this service leads to large number of stale records and failure of retry binding sessions.	N	N
Binding Service	Y	Y	0%/60%	20%	N	Y	Y	HIGH	This service is responsible for creating binding with BSF. Failure of this service means failure of N5/Rx flows.	N	N
Bulwark Service	Y	Y	0%/60%	0	N	Y	Y	MEDIUM	This service provides concurrency across various interfaces in policy. Failure of this service means there can be concurrency issues when processing requests for same subscriber over same/multiple interfaces.	N	N
CHF-Connector	Y	Y	0%/50%	20%	N		Y	MEDIUM	Failure of this service means Spending limit flow with CHF will be impacted, hence spending counter based policies can not be enforced. However SM session is created or updated without spendinglimit data.	N	N
CM Service	Y	Y	--(fixed set of replicas)	20%	N	Y	Y	HIGH	This service provides user interface to make configuration changes for policy (including common service configuration e.g. Ingress gateway, Egress gateway, NrfClient etc). If this service fails, other services will not be able to fetch the configuration data. Common services pods can continue to run with existing configurations, but it will get impacted on pod restart cases.	N	N
Config Server	Y	Y	7%/80%	20%	N	Y	Y	HIGH	This service is responsible for providing any configuration change to other services. Other services continue to work with existing configuration data, but container restart or pod scaling will lead to readiness failure, as they can not accept traffic without new configuration.	N	N
Diameter Connector	Y	Y	<unknown>/40%	20%	N		Y	HIGH		N	N
Diameter Gateway	Y	Y	--(fixed set of replicas)	20%	N	Y	Y	HIGH	The failure of this services impacts all diameter related traffic in a site. PCF mode: BSF/AF can perform alternate routing due to connectivity failure. PCRF mode: PCEF and CSCF can perform alternate routing to select alternate site. Egress flows e.g. RAR over Gx/Rx will be impacted.	Y Enforce overload control for backend services.	DB status is tracked (through appInfo) using helm configuration, to determine the readiness status of gateway pod.
Egress Gateway	Y	Y	0%/80%	1	N	Y	Y	HIGH	The loss of this service means All Egress gateway flows over HTTP (i.e. UDR, CHF, BSF, SMF notification, AMF notifications, Nrf management and discovery flows) is impacted. NRF marks site as "SUSPENDED" due to loss of HB.	N	N
Igress Gateway	Y	Y	0%/80%	1	N	Y	Y	HIGH	The loss of this service impacts connectivity from SCP, NRF or other peers for ingress flow. Hence it will indicate site failure to peers at network/transport level. Note: There shall not be signaling failure, if consumers perform alternate routing to alternate site to achieve session continuity.	Y - Enforce overload control for backend services. - Backend as well as IGW rate limiting support.	N
LDAP Gateway			0%/60%	20%		Y
Notifier Service	Y	Y	0%/60%	20%	N	Y	Y	LOW	This service is responsible for custom notifications. Since PRE to notifier is fire-and-forget, thus loss for such notifications shall not cause any functionality loss. There is no impact of 3gpp signaling.	N	N
NRF Client NF Discovery	Y	Y	0%/80%	25%	N	Y	Y	MEDIUM	The loss of this service means On-demand discovery procedures is directly impacted. Signaling flows are impacted. Policy performs on-demand discovery of UDR, thus failure will lead to missing subscriber profile information. Note: Based on configuration, SM/AM/UE may accept service requests.	N	N
NRF Client NF Management	Y(Policy need to enable multi-pod support - currently set to 1)	Y	0%/80%	NA	N	Y	Y	HIGH	This service is responsible for NF profile registration with NRF. It also performs NRF HB and NRF health check functionality. Loss of this service for HB Timer interval, means that NRF can mark a given PCF instance as SUSPENDED. As soon as Nrf-mgmt pod becomes available, it will automatically refresh Nf's profile at NRF and bring site back to REGISTERED state (if NRF state was suspended).	N	N
PCRF-Core	Y	Y	0%/40%	20%	N	Y	Y	HIGH	This service is responsible for all 4G PCRF flows. The failure of this service impacts all PCRF flows. Diameter peers can detect error response from diameter-gateway and can retry those sessions at alternate site.	N	N
Perf-Info	Y	Y	--(Fixed set of replicas)	20%	N	Y	Y	MEDIUM	This service is responsible to calculate load and overload level. This is used by services like: Nrf-client for load reporting - When perfInfo pod is down, Nrfclient currently use the last known load level fetched from PerfInfo. However, if NrfClient pod also restarts, then it will loose its cache information and hence will report load level as zero. Ingress gateway/Diameter-gateway to track overload status of back end services. - When perfInfo pod is down, these services currently use the last known state reported by perfInfo.	N	N
Policy Data Source (PDS)	Y	Y	0%/60%	20%	N	Y	Y	MEDIUM	This service is responsible for UDR, LDAP, SOAP etc communication and caching subscriber/context information. Failure for this service means fail to handle subscriber data, which is crucial for policy signaling. Each core services (e.g. SM/AM/UE) even with missing data, can handle the service requests gracefully.	N	N
PRE (Policy Run Time)	Y	Y	0%/80%	20%	N	Y	Y	MEDIUM	This service is responsible for policy evaluation. Without this, core service will run their default policy evaluation and hence operator defined policies will not be applied.	N	N
PRE-Test	N	Y	--	--	N		Y	LOW	Test service for test projects.	N	N
Query Service	Y	Y	0%/80%	20%	N	Y	Y	LOW	The loss of this service means, operator will not be able to perform query/session viewers functionality. HA to provide better serviceability.	N	N
SM Service	Y	Y	0%/50%	20%	N	Y	Y	HIGH	This service is responsible for handling N7 and N5 requests. On loss of this service, PCF will not be able to handle signaling traffic from SMF.	N	N
SOAP Connector			0%/60%	20%		Y
UDR Connector	Y	Y	0%/50%	20%	N		Y	MEDIUM	This is a critical service for signaling flow with UDR. On loss of this service, it will have impact on signaling traffic. But using the configurations in core services, sessions can be processed without subscriber profile data.	N	N
UE Service	Y	Y	0%/30%	20%	N	Y	Y	HIGH	This service is responsible for handling UE policy flow. On failure of this service, PCF will not be able to handle AMF flow for UE policies.	N	N
Usage Monitoring Service	Y	Y	0%/80%	20%	N	Y	Y	MEDIUM	This service is responsible for usage monitoring and grant related functions. Failure of this service means usage monitoring functionality will be impacted. However upon this service failure. SM/PCRF sessions will continue to function.	N	N

Service status tracking model:
- AppInfo monitors state of policy services and its publish is available through appinfo_service_running metric.
- Alert "POLICY_SERVICES_DOWN" will be raised, if service is down (i.e. 0 running pods for this service).