Configuring Service Communication Proxy Alert using SCPAlertrules.yaml file

Note:

Default NameSpace is scpsvc for Service Communication Proxy. You can update the NameSpace as per the deployment.

Following is a sample yaml file.

apiVersion: v1
data:
  alertsscp: |
    groups:
    - name: SCPAlerts
      rules:
         #Alerts for SCP Ingress Traffic Rate, it uses namespace of spc deployed
      - alert: SCPIngressTrafficRateAboveMinorThreshold
        annotations:
          description: 'Ingress Traffic Rate at Locality: "{{$labels.ocscp_locality}}"  is above minor threshold (i.e. 1400 mps)'
          summary: 'Namespace: {{$labels.kubernetes_namespace}}, Pod: {{$labels.kubernetes_pod_name}}: Current Ingress Traffic Rate is {{ $value | printf "%.2f" }} mps which is above 70 Percent of Max MPS(2000)'
         # Provide app and kubernetes_namespace of scp deployed
        expr: sum(rate(ocscp_metric_total_http_rx_req{app="scp-worker",kubernetes_namespace="scpsvc"}[2m])) by (kubernetes_namespace,ocscp_locality,kubernetes_pod_name) >= 1400 < 1600
        labels:
          severity: Minor
      - alert: SCPIngressTrafficRateAboveMajorThreshold
        annotations:
          description: 'Ingress Traffic Rate at Locality: {{$labels.ocscp_locality}} and is above major threshold (i.e. 1600 mps)'
          summary: 'Namespace: {{$labels.kubernetes_namespace}}, Pod: {{$labels.kubernetes_pod_name}}: Current Ingress Traffic Rate is {{ $value | printf "%.2f" }} mps which is above 80 Percent of Max MPS(2000)'
         # Provide app and kubernetes_namespace of scp deployed
        expr: sum(rate(ocscp_metric_total_http_rx_req{app="scp-worker",kubernetes_namespace="scpsvc"}[2m])) by (kubernetes_namespace,ocscp_locality,kubernetes_pod_name)  >= 1600 < 1800
        labels:
          severity: Major
      - alert: SCPIngressTrafficRateAboveCriticalThreshold
        annotations:
          description: 'Ingress Traffic Rate at Locality: {{$labels.ocscp_locality}} and is above critical threshold (i.e. 1800 mps)'
          summary: 'Namespace: {{$labels.kubernetes_namespace}}, Pod: {{$labels.kubernetes_pod_name}}: Current Ingress Traffic Rate is {{ $value | printf "%.2f" }} mps which is above 95 Percent of Max MPS(1000)'
         # Provide app and kubernetes_namespace of scp deployed
        expr: sum(rate(ocscp_metric_total_http_rx_req{app="scp-worker",kubernetes_namespace="scpsvc"}[2m])) by (kubernetes_namespace,ocscp_locality,kubernetes_pod_name) >= 1800
        labels:
          severity: Critical
      - alert: SCPRoutingFailedForService
        annotations:
          description: 'Routing failed for service'
          summary: 'Routing failed for service: NFService Type = "{{$labels.ocscp_nf_service_type}}", NFType = "{{$labels.ocscp_nf_type}}", Locality = "{{$labels.ocscp_locality}}" and value = "{{ $value }}" '
         # Provide app and kubernetes_namespace of scp deployed
        expr: ocscp_metric_total_routing_send_fail{app="scp-worker",kubernetes_namespace="scpsvc"}
        labels:
          severity: Minor
      - alert: SCPSoothsayerPodMemoryUsage
         # Provide kubernetes_namespace of scp deployed and pod name substring as its regex match of pod name
        expr: sum(container_memory_usage_bytes{image!="",namespace="scpsvc",pod_name=~".+soothsayer.+"})  by (pod_name, namespace, instance) > 8589934592
        for: 2m
        labels:
          severity: Warning
        annotations:
          summary: "Instance: {{$labels.instance}}, NameSpace: {{$labels.namespace}}, Pod: {{$labels.pod_name}}: Soothsayer Pod High Memory usage detected"
          description: "Instance: {{$labels.instance}}, Namespace: {{$labels.namespace}},Pod: {{$labels.pod_name}}: Memory usage is above 8G (current value is: {{ $value }})"
      - alert: SCPWorkerPodMemoryUsage
         # Provide kubernetes_namespace of scp deployed and pod name substring as its regex match of pod name
        expr: sum(container_memory_usage_bytes{image!="",namespace="scpsvc",pod_name=~".+worker.+"})  by (pod_name, namespace, instance) > 4294967296
        for: 2m
        labels:
          severity: Warning
        annotations:
          summary: "Instance: {{$labels.instance}}, NameSpace: {{$labels.namespace}}, Pod: {{$labels.pod_name}}: Worker Pod High Memory usage detected"
          description: "Instance: {{$labels.instance}}, Namespace: {{$labels.namespace}},Pod: {{$labels.pod_name}}: Memory usage is above 4G (current value is: {{ $value }})"
      - alert: SCPPilotPodMemoryUsage
         # Provide kubernetes_namespace of scp deployed and pod name substring as its regex match of pod name
        expr: sum(container_memory_usage_bytes{image!="",namespace="scpsvc",pod_name=~".+pilot.+"})  by (pod_name, namespace, instance)  > 6442450944
        for: 2m
        labels:
          severity: Warning
        annotations:
          summary: "Instance: {{$labels.instance}}, NameSpace: {{$labels.namespace}}, Pod: {{$labels.pod_name}}: Pilot Pod High Memory usage detected"
          description: "Instance: {{$labels.instance}}, Namespace: {{$labels.namespace}},Pod: {{$labels.pod_name}}: Memory usage is above 6G (current value is: {{ $value }})"

Alerts Details:

Description and Summary are added by prometheus alert manager.

Alerts are supported for three different resources/routing crosses threshold.

SCPIngress Traffic Rate Above Threshold
- Has three threshold level Minor (above 1400 mps to 2000mps), Major (1600 to 1800 mps), Critical (above 1800 mps). These values are configurable.
- In the description, information is presented similar to: "Ingress Traffic Rate at Locality: <Locality of scp> is above <threshold level (minor/major/critical> threshold (i.e. <value of threshold>)"
- In Summary: "Namespace: <Namespace of scp deployment that Locality>, Pod: <SCP-worker Pod name>: Current Ingress Traffic Rate is <Current rate of Ingress traffic > mps which is above 70 Percent of Max MPS(<upper limit of ingress traffic rate per pod>)"
  Note:
  Ingress traffic rate is per scp-worker pod in a namespace at particular SCP-Locality. Currently, 2000mps is the upper limit for per scp-worker pod.
SCP Routing Failed For Service
- It alerts for which NF Service Type and NF Type at particular locality, Routing failed
- Description:- "Routing failed for service"
- Summary: - "Routing failed for service: NFService Type = <Message NF Service Type>, NFType = <Message NF Type>, Locality = <SCP Locality where Routing Failed> and value = <Accumulated failure till now, of such message for NFType and NFService Type>"
  Note:
  The value field currently does not provide number of failures in particular time interval, instead it provides the total number of Routing failures.
SCP Pod Memory Usage:- Three type of alerts namely SCPSoothsayerPodMemoryUsage, SCPWorkerPodMemoryUsage, SCPPilotPodMemoryUsage
- Pod memory usage for SCP Pods (Soothsayer, Worker and Pilot) deployed at a particular node instance is provided.
- The Soothsayer pod threshold is 8 GB
- The Worker pod threshold is 4 GB
- The Pilot pod threshold is 6GB
- Summary: Instance: "<Node Instance name>, NameSpace: <Namespace of SCP deployment>, Pod: <(Soothsayer/Worker/Pilot) Pod name>: <Soothsayer/Worker/Pilot> Pod High Memory usage detected"
- Summary: "Instance: "<Node Instance name>, Namespace: <Namespace of SCP deployment>, Pod: <(Soothsayer/Worker/Pilot) Pod name>: Memory usage is above <threshold value>G (current value is: <current value of memory usage>)"