6 Configuring Alerts

To configure Provisioning Gateway alerts on the Prometheus server:

Note:

In the below procedure, _NAME_ is the Helm Chart Release Name and _Namespace_ is the Prometheus NameSpace.
  1. Execute the following command to take the backup of current config map of Prometheus.

    kubectl get configmaps occne-prometheus-server -o yaml -n occne-infra > /tmp/tempConfig.yaml

  2. Check and add provisioning gateway alert file name inside Prometheus config map as shown below:
    sed -i '/etc\/config\/alertsprovgw/d' /tmp/tempConfig.yaml
    sed -i '/rule_files:/a\    \- /etc/config/alertsprovgw'  /tmp/tempConfig.yaml
  3. Execute the following command to update the config map with updated file name of provgw alert file.

    kubectl replace configmap occne-prometheus-server -f /tmp/tempConfig.yaml

  4. Execute the following command to add provgw alert rules in config map under file name of provgw alert file.

    kubectl patch configmap occne-prometheus-server -n occne-infra --type merge --patch "$(cat ~/ProvgwAlertrules.yaml)"

    Note:

    Prometheus server takes updated config map, which reloads automatically after sometime (~20 sec).

Provisioning Gateway Alert Config Details

This section shares the alert config details of the ProvgwAlertrules.yaml file.

Note:

The default nameSpace of Provisioning Gateway is provgw. Update it according to the deployment.
apiVersion: v1
data:
  alertsudr: |
    groups:
    - name: ProvgwAlerts
      rules:
      - alert: ProvgwTrafficRateAboveMinorThreshold
        annotations:
          description: 'Ingress traffic Rate is above minor threshold i.e. 800 requests 
          per second (current value is: {{ $value }})'
          summary: 'Traffic Rate is above 80 Percent of Max requests per second(1000)'
        expr: sum(rate(oc_ingressgateway_http_requests_total{app_kubernetes_io_name=
        "ingressgateway",kubernetes_namespace="provgw"}[20m])) >= 800 < 900
        labels:
          severity: Minor
      - alert: ProvgwTrafficRateAboveMajorThreshold
        annotations:
          description: 'Ingress traffic Rate is above major threshold i.e. 900 requests 
          per second (current value is: {{ $value }})'
          summary: 'Traffic Rate is above 90 Percent of Max requests per second(1000)'
        expr: sum(rate(oc_ingressgateway_http_requests_total{app_kubernetes_io_name=
        "ingressgateway",kubernetes_namespace="provgw"}[20m])) >= 900 < 950
        labels:
          severity: Major
      - alert: ProvgwTrafficRateAboveCriticalThreshold
        annotations:
          description: 'Ingress traffic Rate is above critical threshold i.e. 950 requests
          per second (current value is: {{ $value }})'
          summary: 'Traffic Rate is above 95 Percent of Max requests per second(1000)'
        expr: sum(rate(oc_ingressgateway_http_requests_total{app_kubernetes_io_name=
        "ingressgateway",kubernetes_namespace="provgw"}[20m])) >= 950
        labels:
          severity: Critical
      - alert: ProvgwTransactionErrorRateAbove0.1Percent
        annotations:
          description: 'Transaction Error rate is above 0.1 Percent of Total Transactions
         (current value is {{ $value }})'
          summary: 'Transaction Error Rate detected above 0.1 Percent of Total 
          Transactions'
        expr: (sum(rate(oc_ingressgateway_http_responses_total{Status!~"2.*",
        app_kubernetes_io_name="ingressgateway",kubernetes_namespace="provgw"}[20m]) 
        or (up * 0 ) ) )/sum(rate(oc_ingressgateway_http_responses_total{
        app_kubernetes_io_name="ingressgateway",kubernetes_namespace="provgw"}[20m])) 
        * 100 >= 0.1 < 1
        labels:
          severity: Warning
      - alert: ProvgwTransactionErrorRateAbove1Percent
        annotations:
          description: 'Transaction Error rate is above 1 Percent of Total Transactions 
        (current value is {{ $value }})'
          summary: 'Transaction Error Rate detected above 1 Percent of Total Transactions'
        expr: (sum(rate(oc_ingressgateway_http_responses_total{Status!~"2.*",
        app_kubernetes_io_name="ingressgateway",kubernetes_namespace="provgw"}[20m])
        or (up * 0 ) ) )/sum(rate(oc_ingressgateway_http_responses_total
        {app_kubernetes_io_name="ingressgateway",kubernetes_namespace="provgw"}[20m])) 
        * 100 >= 1 < 10
        labels:
          severity: Warning
      - alert: ProvgwTransactionErrorRateAbove10Percent
        annotations:
          description: 'Transaction Error rate is above 10 Percent of Total Transactions
          (current value is {{ $value }})'
          summary: 'Transaction Error Rate detected above 10 Percent of Total 
          Transactions'
        expr: (sum(rate(oc_ingressgateway_http_responses_total{Status!~"2.*",
        app_kubernetes_io_name="ingressgateway",kubernetes_namespace="provgw"}
        [20m]) or (up * 0 ) ) )/sum(rate(oc_ingressgateway_http_responses_total
        {app_kubernetes_io_name="ingressgateway",kubernetes_namespace="provgw"}[20m]))
         * 100 >= 10 < 25
        labels:
          severity: Minor
      - alert: ProvgwTransactionErrorRateAbove25Percent
        annotations:
          description: 'Transaction Error Rate detected above 25 Percent of Total 
        Transactions (current value is {{ $value }})'
          summary: 'Transaction Error Rate detected above 25 Percent of Total 
        Transactions'
        expr: (sum(rate(oc_ingressgateway_http_responses_total{Status!~"2.*",
        app_kubernetes_io_name="ingressgateway",kubernetes_namespace="provgw"}[20m]) 
        or (up * 0 ) ) )/sum(rate(oc_ingressgateway_http_responses_total
        {app_kubernetes_io_name="ingressgateway",kubernetes_namespace="provgw"}[20m])) *
         100 >= 25 < 50
        labels:
          severity: Major
      - alert: ProvgwTransactionErrorRateAbove50Percent
        annotations:
          description: 'Transaction Error Rate detected above 50 Percent of Total 
        Transactions (current value is {{ $value }})'
          summary: 'Transaction Error Rate detected above 50 Percent of Total 
        Transactions'
        expr: (sum(rate(oc_ingressgateway_http_responses_total{Status!~"2.*",
        app_kubernetes_io_name="ingressgateway",kubernetes_namespace="provgw"}[20m]) or
        (up * 0 ) ) )/sum(rate(oc_ingressgateway_http_responses_total
        {app_kubernetes_io_name="ingressgateway",kubernetes_namespace="provgw"}[20m]))
        * 100 >= 50
        labels:
          severity: Critical
      - alert: ProvgwTransientErrorAbove1Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 1% of 
        ingress traffic'
          summary: 'Total number of response if subscriber not found is about 1% of 
        ingress traffic'
        expr: (sum(rate(udr_rest_transient_error{kubernetes_namespace="provgw"}[10m]))
        /sum(rate(oc_ingressgateway_http_requests_total{kubernetes_namespace="provgw"}
        [10m])))*100 >= 1 < 10
        labels:
          severity: Warning
      - alert: ProvgwTransientErrorAbove10Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 10% of 
        ingress traffic'
          summary: 'Total number of response if subscriber not found is about 10% of 
        ingress traffic'
        expr: (sum(rate(udr_rest_transient_error{kubernetes_namespace="provgw"}[10m]))/
        sum(rate(oc_ingressgateway_http_requests_total{kubernetes_namespace="provgw"}
        [10m])))*100 >= 10 < 25
        labels:
          severity: Minor
      - alert: ProvgwTransientErrorAbove25Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 25% of 
          ingress traffic'
          summary: 'Total number of response if subscriber not found is about 25% of 
          ingress traffic'
        expr: (sum(rate(udr_rest_transient_error{kubernetes_namespace="provgw"}[10m]))
        /sum(rate(oc_ingressgateway_http_requests_total{kubernetes_namespace="provgw"}
        [10m])))*100 >= 25 < 50
        labels:
          severity: Major
      - alert: ProvgwTransientErrorAbove50Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of
          ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
        expr: (sum(rate(udr_rest_transient_error{kubernetes_namespace="provgw"}[10m]))
        /sum(rate(oc_ingressgateway_http_requests_total{kubernetes_namespace="provgw"}
        [10m])))*100 >= 50
        labels:
          severity: Critical
      - alert: ProvgwSegmentDownAbove1Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
        expr: (sum(rate(udr_rest_service_unavailable{kubernetes_namespace="provgw"}
        [10m]))/sum(rate(oc_ingressgateway_http_requests_total{kubernetes_namespace=
        "provgw"}[10m])))*100 >= 1 < 10
        labels:
          severity: Warning
      - alert: ProvgwSegmentDownAbove10Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
        expr: (sum(rate(udr_rest_service_unavailable{kubernetes_namespace="provgw"}[10m]))
        /sum(rate(oc_ingressgateway_http_requests_total{kubernetes_namespace="provgw"}
        [10m])))*100 >= 10 < 25
        labels:
          severity: Minor
      - alert: ProvgwSegmentDownAbove25Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
        expr: (sum(rate(udr_rest_service_unavailable{kubernetes_namespace="provgw"}
        [10m]))/sum(rate(oc_ingressgateway_http_requests_total{kubernetes_namespace=
        "provgw"}[10m])))*100 >= 25 < 50
        labels:
          severity: Major
      - alert: ProvgwSegmentDownAbove50Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
        expr: (sum(rate(udr_rest_service_unavailable{kubernetes_namespace="provgw"}
        [10m]))/sum(rate(oc_ingressgateway_http_requests_total{kubernetes_namespace=
        "provgw"}[10m])))*100 >= 50
        labels:
          severity: Critical
      - alert: ProvgwAuditMismatchAbove1Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
        expr: (sum(rate(provgw_audit_responsemismatch{kubernetes_namespace="provgw"}
        [10m]))/sum(rate(provgw_audit_total{kubernetes_namespace="provgw"}[10m])))*100 
        >= 1 < 10
        labels:
          severity: Warning
      - alert: ProvgwAuditMismatchAbove10Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
        expr: (sum(rate(provgw_audit_responsemismatch{kubernetes_namespace="provgw"}
        [10m]))/sum(rate(provgw_audit_total{kubernetes_namespace="provgw"}[10m])))*100 >=
         10 < 25
        labels:
          severity: Minor
      - alert: ProvgwAuditMismatchAbove25Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
        expr: (sum(rate(provgw_audit_responsemismatch{kubernetes_namespace="provgw"}
        [10m]))/sum(rate(provgw_audit_total{kubernetes_namespace="provgw"}[10m])))*100 
        >= 25 < 50
        labels:
          severity: Major
      - alert: ProvgwAuditMismatchAbove50Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
        expr: (sum(rate(provgw_audit_responsemismatch{kubernetes_namespace="provgw"}
        [10m]))/sum(rate(provgw_audit_total{kubernetes_namespace="provgw"}[10m])))*100 >= 50
        labels:
          severity: Critical
      - alert: ProvgwAuditTransientErrorAbove1Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
        expr: (sum(rate(provgw_audit_transient_error{kubernetes_namespace="provgw"}
        [10m]))/sum(rate(provgw_audit_total{kubernetes_namespace="provgw"}[10m])))*100 
        >= 1 < 10
        labels:
          severity: Warning
      - alert: ProvgwAuditTransientErrorAbove10Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
        ingress traffic'
        expr: (sum(rate(provgw_audit_transient_error{kubernetes_namespace="provgw"}
        [10m]))/sum(rate(provgw_audit_total{kubernetes_namespace="provgw"}[10m])))*100 >= 10 < 25
        labels:
          severity: Minor
      - alert: ProvgwAuditTransientErrorAbove25Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
        expr: (sum(rate(provgw_audit_transient_error{kubernetes_namespace="provgw"}[10m]))
        /sum(rate(provgw_audit_total{kubernetes_namespace="provgw"}[10m])))*100 >= 25 < 50
        labels:
          severity: Major
      - alert: ProvgwAuditTransientErrorAbove50Percent
        annotations:
          description: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
          summary: 'Total number of response if subscriber not found is about 50% of 
          ingress traffic'
        expr: (sum(rate(provgw_audit_transient_error{kubernetes_namespace="provgw"}[10m]))
        /sum(rate(provgw_audit_total{kubernetes_namespace="provgw"}[10m])))*100 >= 50
        labels:
          severity: Critical