10 OCNWDAF Alerts

This chapter describes the following information about OCNWDAF alerts:

10.1 OCNWDAF Alert Configuration

This section describes the measurement based alert rules configuration for OCNWDAF. The Alert Manager uses the Prometheus measurements values as reported by microservices in conditions under alert rules to trigger alerts.

OCNWDAF Alert configuration in Prometheus

The following procedure is used to configure alerts in Prometheus:

  1. Download the ocn-nwdaf-alerting-rules.yaml file. Edit this file to configure the alert rules. The parameters in the file that can be edited include name of the alert, rules for the alert including alert name and the expression expr defined to trigger the alert.
  2. Copy the updated ocn-nwdaf-alerting-rules.yaml file to Bastion Host.
  3. Run the following command:

    kubectl apply -f ocn-nwdaf-alerting-rules.yaml -n ocn-nwdaf

  4. To verify if the Custom Resource Definition (CRD) is created, run the following command:

    kubectl get prometheusrule -n ocn-nwdaf

  5. Verify the alerts in the Prometheus GUI, the alert name and expression is listed. See example below:

    Figure 10-1 Prometheus GUI


    Prometheus GUI

Alert Rules

The alerts are configured on the Prometheus server. The metrics scraped correspond to a pod that runs a single microservice, so each alert belongs to one of the pods running. Prometheus continously collects metrics and when any of the alerting rules are met, the alert is triggered. All the alert rules are written in one or multiple .yml files and deployed as described in procedure OCNWDAF Alert configuration in Prometheus. Listed below are the alert rules for the various alerts captured for OCNWDAF:

Status Alert Rule
- name: <ALERT NAME>
    rules:
    - alert: <ALERT NAME>
      expr: up{app="SERVICE LABEL"} == 0
Example:
 - name: OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING
    rules:
    - alert: OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING
      expr: up{app="ocn-nwdaf-data-collection"} == 0
Traffic Alert Rule
  • Request rate rule:

    - name: <ALERT NAME>
        rules:
        - alert: <ALERT NAME>
          expr: >
          sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="<URI ENDPOINT>"}[1m])) > 1000
    
    Example:
      - name: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE
        rules:
        - alert: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE
          expr: sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR"}[1m])) > 1000
  • Failure rate request rule:

    - name: <ALERT NAME>
        rules:
        - alert: <ALERT NAME>
          expr: >
         (sum without(method,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="<URI ENDPOINT>",status=~"[4-5].."}[1m]))/ ignoring(status) sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="<URI ENDPOINT>"}[1m]))) * 100 > 70
    Example:
     - name: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE
        rules:
        - alert: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE
          expr: (sum without(method,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR",status=~"[4-5].."}[1m]))/ ignoring(status) sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR"}[1m]))) * 100 > 70
CPU Alert Rule
- name: <ALERT NAME>
    rules:
    - alert: <ALERT NAME>
      expr: system_cpu_usage{app="<SERVICE LABEL>"} * 100 > 80
Example:
 - name: OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD
    rules:
    - alert: OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD
      expr: system_cpu_usage{app="ocn-nwdaf-data-collection"} * 100 > 80
JVM Memory Usage Alert Rule
 - name: <ALERT NAME>
    rules:
    - alert: <ALERT NAME>
      expr: >

      (sum(avg_over_time(jvm_memory_used_bytes{area="heap",app="<SERVICE LABEL>"} [1m]))/sum(avg_over_time(jvm_memory_max_bytes{area="heap",app="<SERVICE LABEL>"}[1m]))) * 100 > 80
Example:
 - name: OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE
    rules:
    - alert: OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE
      expr: (sum(avg_over_time(jvm_memory_used_bytes{area="heap",app="ocn-nwdaf-data-collection"} [1m]))/sum(avg_over_time(jvm_memory_max_bytes{area="heap",app="ocn-nwdaf-data-collection"}[1m]))) * 100 > 80

10.2 System Level Alerts

This section lists the system level alerts.

OCN_NWDAF_ANALYTICS_HIGH_CPU_LOAD

Table 10-1 OCN_NWDAF_ANALYTICS_HIGH_CPU_LOAD

Field Details
Description CPU load is high at the pod where the microservice is running.
Affected Functions All
Cause CPU load is more than 80% of the allocated resources.

OCN_NWDAF_COMMUNICATION_HIGH_CPU_LOAD

Table 10-2 OCN_NWDAF_COMMUNICATION_HIGH_CPU_LOAD

Field Details
Description CPU load is high at the pod where the microservice is running.
Affected Functions All
Cause CPU load is more than 80% of the allocated resources.

OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_CPU_LOAD

Table 10-3 OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_CPU_LOAD

Field Details
Description CPU load is high at the pod where the microservice is running.
Affected Functions All
Cause CPU load is more than 80% of the allocated resources.

OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD

Table 10-4 OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD

Field Details
Description CPU load is high at the pod where the microservice is running.
Affected Functions All
Cause CPU load is more than 80% of the allocated resources.

OCN_NWDAF_GATEWAY_HIGH_CPU_LOAD

Table 10-5 OCN_NWDAF_GATEWAY_HIGH_CPU_LOAD

Field Details
Description CPU load is high at the pod where the microservice is running.
Affected Functions All
Cause CPU load is more than 80% of the allocated resources.

OCN_NWDAF_MTLF_HIGH_CPU_LOAD

Table 10-6 OCN_NWDAF_MTLF_HIGH_CPU_LOAD

Field Details
Description CPU load is high at the pod where the microservice is running.
Affected Functions All
Cause CPU load is more than 80% of the allocated resources.

OCN_NWDAF_SUBSCRIPTION_HIGH_CPU_LOAD

Table 10-7 OCN_NWDAF_SUBSCRIPTION_HIGH_CPU_LOAD

Field Details
Description CPU load is high at the pod where the microservice is running.
Affected Functions All
Cause CPU load is more than 80% of the allocated resources.

OCN_NWDAF_ANALYTICS_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-8 OCN_NWDAF_ANALYTICS_HIGH_JVM_HEAP_MEMORY_USAGE

Field Details
Description The average of the memory heap usage is high.
Affected Functions All
Cause The heap memory usage is more than the 80%.

OCN_NWDAF_COMMUNICATION_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-9 OCN_NWDAF_COMMUNICATION_HIGH_JVM_HEAP_MEMORY_USAGE

Field Details
Description The average of the memory heap usage is high.
Affected Functions All
Cause The heap memory usage is more than the 80%.

OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-10 OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_JVM_HEAP_MEMORY_USAGE

Field Details
Description The average of the memory heap usage is high.
Affected Functions All
Cause The heap memory usage is more than the 80%.

OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-11 OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE

Field Details
Description The average of the memory heap usage is high.
Affected Functions All
Cause The heap memory usage is more than the 80%.

OCN_NWDAF_GATEWAY_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-12 OCN_NWDAF_GATEWAY_HIGH_JVM_HEAP_MEMORY_USAGE

Field Details
Description The average of the memory heap usage is high.
Affected Functions All
Cause The heap memory usage is more than the 80%.

OCN_NWDAF_MTLF_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-13 OCN_NWDAF_MTLF_HIGH_JVM_HEAP_MEMORY_USAGE

Field Details
Description The average of the memory heap usage is high.
Affected Functions All
Cause The heap memory usage is more than the 80%.

OCN_NWDAF_SUBSCRIPTION_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-14 OCN_NWDAF_SUBSCRIPTION_HIGH_JVM_HEAP_MEMORY_USAGE

Field Details
Description The average of the memory heap usage is high.
Affected Functions All
Cause The heap memory usage is more than the 80%.

10.3 Application Level Alerts

This section lists the application level alerts.

OCN_NWDAF_ANALYTICS_NOT_RUNNING

Table 10-15 OCN_NWDAF_ANALYTICS_NOT_RUNNING

Field Details
Description The microservice is not available or not reachable.
Cause Microservice ocn-nwdaf-analytics is down.

OCN_NWDAF_COMMUNICATION_NOT_RUNNING

Table 10-16 OCN_NWDAF_COMMUNICATION_NOT_RUNNING

Field Details
Description The microservice is not available or not reachable.
Cause Microservice ocn-nwdaf-communication is down.

OCN_NWDAF_CONFIGURATION_SERVICE_NOT_RUNNING

Table 10-17 OCN_NWDAF_CONFIGURATION_SERVICE_NOT_RUNNING

Field Details
Description The microservice is not available or not reachable.
Cause Microservice ocn-nwdaf-configuration-service is down.

OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING

Table 10-18 OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING

Field Details
Description The microservice is not available or not reachable.
Cause Microservice ocn-nwdaf-data-collection is down.

OCN_NWDAF_GATEWAY_NOT_RUNNING

Table 10-19 OCN_NWDAF_GATEWAY_NOT_RUNNING

Field Details
Description The microservice is not available or not reachable.
Cause Microservice ocn-nwdaf-gateway is down.

OCN_NWDAF_MTLF_NOT_RUNNING

Table 10-20 OCN_NWDAF_MTLF_NOT_RUNNING

Field Details
Description The microservice is not available or not reachable.
Cause Microservice ocn-nwdaf-mtlf is down.

OCN_NWDAF_SUBSCRIPTION_NOT_RUNNING

Table 10-21 OCN_NWDAF_SUBSCRIPTION_NOT_RUNNING

Field Details
Description The microservice is not available or not reachable.
Cause Microservice ocn-nwdaf-subscription is down.

HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE

Table 10-22 HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE

Field Details
Description The number of requests received per second is high.
Cause Traffic is high, above 1000 requests per second.
URI Endpoint nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR
Affected Functions ABNORMAL_BEHAVIOUR

HIGH_UE_MOBILITY_REQUEST_RATE

Table 10-23 HIGH_UE_MOBILITY_REQUEST_RATE

Field Details
Description The number of requests received per second is high.
Cause Traffic is high, above 1000 requests per second.
URI Endpoint nnwdaf-analyticsinfo/v1/analytics?event-id=UE_MOBILITY
Affected Functions UE_MOBILITY

HIGH_EVENT_SUBSCRIPTION_REQUEST_RATE

Table 10-24 HIGH_EVENT_SUBSCRIPTION_REQUEST_RATE

Field Details
Description The number of requests received per second is high.
Cause Traffic is high, above 1000 requests per second.
URI Endpoint nnwdaf-eventssubscription/v1/subscriptions
Affected Functions UE_MOBILITY, SLICE_LOAD_LEVEL, ABNORMAL_BEHAVIOUR

HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE

Table 10-25 HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE

Field Details
Description The number of requests failing per second is high.
Cause The request failing rate is more than the 70%.
URI Endpoint nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR
Affected Functions ABNORMAL_BEHAVIOUR

HIGH_UE_MOBILITY_REQUEST_FAILURE_RATE

Table 10-26 HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE

Field Details
Description The number of requests failing per second is high.
Cause The request failing rate is more than the 70%.
URI Endpoint nnwdaf-analyticsinfo/v1/analytics?event-id=UE_MOBILITY
Affected Functions UE_MOBILITY

HIGH_EVENT_SUBSCRIPTION_REQUEST_FAILURE_RATE

Table 10-27 HIGH_EVENT_SUBSCRIPTION_REQUEST_FAILURE_RATE

Field Details
Description The number of requests failing per second is high.
Cause The request failing rate is more than the 70%.
URI Endpoint nnwdaf-eventssubscription/v1/subscriptions
Affected Functions UE_MOBILITY, SLICE_LOAD_LEVEL, ABNORMAL_BEHAVIOUR