Configuring Service Communication Proxy Alert using SCPAlertrules.yaml file
Note:
Default NameSpace is scpsvc for Service Communication Proxy. You can update the NameSpace as per the deployment.Following is a sample yaml file.
apiVersion: v1 data: alertsscp: | groups: - name: SCPAlerts rules: #Alerts for SCP Ingress Traffic Rate, it uses namespace of spc deployed - alert: SCPIngressTrafficRateAboveMinorThreshold annotations: description: 'Ingress Traffic Rate at Locality: "{{$labels.ocscp_locality}}" is above minor threshold (i.e. 1400 mps)' summary: 'Namespace: {{$labels.kubernetes_namespace}}, Pod: {{$labels.kubernetes_pod_name}}: Current Ingress Traffic Rate is {{ $value | printf "%.2f" }} mps which is above 70 Percent of Max MPS(2000)' # Provide app and kubernetes_namespace of scp deployed expr: sum(rate(ocscp_metric_total_http_rx_req{app="scp-worker",kubernetes_namespace="scpsvc"}[2m])) by (kubernetes_namespace,ocscp_locality,kubernetes_pod_name) >= 1400 < 1600 labels: severity: Minor - alert: SCPIngressTrafficRateAboveMajorThreshold annotations: description: 'Ingress Traffic Rate at Locality: {{$labels.ocscp_locality}} and is above major threshold (i.e. 1600 mps)' summary: 'Namespace: {{$labels.kubernetes_namespace}}, Pod: {{$labels.kubernetes_pod_name}}: Current Ingress Traffic Rate is {{ $value | printf "%.2f" }} mps which is above 80 Percent of Max MPS(2000)' # Provide app and kubernetes_namespace of scp deployed expr: sum(rate(ocscp_metric_total_http_rx_req{app="scp-worker",kubernetes_namespace="scpsvc"}[2m])) by (kubernetes_namespace,ocscp_locality,kubernetes_pod_name) >= 1600 < 1800 labels: severity: Major - alert: SCPIngressTrafficRateAboveCriticalThreshold annotations: description: 'Ingress Traffic Rate at Locality: {{$labels.ocscp_locality}} and is above critical threshold (i.e. 1800 mps)' summary: 'Namespace: {{$labels.kubernetes_namespace}}, Pod: {{$labels.kubernetes_pod_name}}: Current Ingress Traffic Rate is {{ $value | printf "%.2f" }} mps which is above 95 Percent of Max MPS(1000)' # Provide app and kubernetes_namespace of scp deployed expr: sum(rate(ocscp_metric_total_http_rx_req{app="scp-worker",kubernetes_namespace="scpsvc"}[2m])) by (kubernetes_namespace,ocscp_locality,kubernetes_pod_name) >= 1800 labels: severity: Critical - alert: SCPRoutingFailedForService annotations: description: 'Routing failed for service' summary: 'Routing failed for service: NFService Type = "{{$labels.ocscp_nf_service_type}}", NFType = "{{$labels.ocscp_nf_type}}", Locality = "{{$labels.ocscp_locality}}" and value = "{{ $value }}" ' # Provide app and kubernetes_namespace of scp deployed expr: ocscp_metric_total_routing_send_fail{app="scp-worker",kubernetes_namespace="scpsvc"} labels: severity: Minor - alert: SCPSoothsayerPodMemoryUsage # Provide kubernetes_namespace of scp deployed and pod name substring as its regex match of pod name expr: sum(container_memory_usage_bytes{image!="",namespace="scpsvc",pod_name=~".+soothsayer.+"}) by (pod_name, namespace, instance) > 8589934592 for: 2m labels: severity: Warning annotations: summary: "Instance: {{$labels.instance}}, NameSpace: {{$labels.namespace}}, Pod: {{$labels.pod_name}}: Soothsayer Pod High Memory usage detected" description: "Instance: {{$labels.instance}}, Namespace: {{$labels.namespace}},Pod: {{$labels.pod_name}}: Memory usage is above 8G (current value is: {{ $value }})" - alert: SCPWorkerPodMemoryUsage # Provide kubernetes_namespace of scp deployed and pod name substring as its regex match of pod name expr: sum(container_memory_usage_bytes{image!="",namespace="scpsvc",pod_name=~".+worker.+"}) by (pod_name, namespace, instance) > 4294967296 for: 2m labels: severity: Warning annotations: summary: "Instance: {{$labels.instance}}, NameSpace: {{$labels.namespace}}, Pod: {{$labels.pod_name}}: Worker Pod High Memory usage detected" description: "Instance: {{$labels.instance}}, Namespace: {{$labels.namespace}},Pod: {{$labels.pod_name}}: Memory usage is above 4G (current value is: {{ $value }})" - alert: SCPPilotPodMemoryUsage # Provide kubernetes_namespace of scp deployed and pod name substring as its regex match of pod name expr: sum(container_memory_usage_bytes{image!="",namespace="scpsvc",pod_name=~".+pilot.+"}) by (pod_name, namespace, instance) > 6442450944 for: 2m labels: severity: Warning annotations: summary: "Instance: {{$labels.instance}}, NameSpace: {{$labels.namespace}}, Pod: {{$labels.pod_name}}: Pilot Pod High Memory usage detected" description: "Instance: {{$labels.instance}}, Namespace: {{$labels.namespace}},Pod: {{$labels.pod_name}}: Memory usage is above 6G (current value is: {{ $value }})"
Alerts Details:
Description and Summary are added by prometheus alert manager.
Alerts are supported for three different resources/routing crosses
threshold.
- SCPIngress Traffic Rate
Above Threshold
- Has three threshold level Minor (above 1400 mps to 2000mps), Major (1600 to 1800 mps), Critical (above 1800 mps). These values are configurable.
- In the description, information is presented similar to: "Ingress Traffic Rate at Locality: <Locality of scp> is above <threshold level (minor/major/critical> threshold (i.e. <value of threshold>)"
- In Summary:
"Namespace: <Namespace of scp deployment that Locality>, Pod:
<SCP-worker Pod name>: Current Ingress Traffic Rate is <Current rate
of Ingress traffic > mps which is above 70 Percent of Max MPS(<upper
limit of ingress traffic rate per pod>)"
Note:
Ingress traffic rate is per scp-worker pod in a namespace at particular SCP-Locality. Currently, 2000mps is the upper limit for per scp-worker pod.
- SCP Routing Failed For
Service
- It alerts for which NF Service Type and NF Type at particular locality, Routing failed
- Description:- "Routing failed for service"
- Summary: - "Routing
failed for service: NFService Type = <Message NF Service Type>, NFType =
<Message NF Type>, Locality = <SCP Locality where Routing Failed>
and value = <Accumulated failure till now, of such message for NFType and
NFService Type>"
Note:
The value field currently does not provide number of failures in particular time interval, instead it provides the total number of Routing failures.
- SCP Pod Memory Usage:-
Three type of alerts namely SCPSoothsayerPodMemoryUsage,
SCPWorkerPodMemoryUsage, SCPPilotPodMemoryUsage
- Pod memory usage for SCP Pods (Soothsayer, Worker and Pilot) deployed at a particular node instance is provided.
- The Soothsayer pod threshold is 8 GB
- The Worker pod threshold is 4 GB
- The Pilot pod threshold is 6GB
- Summary: Instance: "<Node Instance name>, NameSpace: <Namespace of SCP deployment>, Pod: <(Soothsayer/Worker/Pilot) Pod name>: <Soothsayer/Worker/Pilot> Pod High Memory usage detected"
- Summary: "Instance: "<Node Instance name>, Namespace: <Namespace of SCP deployment>, Pod: <(Soothsayer/Worker/Pilot) Pod name>: Memory usage is above <threshold value>G (current value is: <current value of memory usage>)"