OCNWDAF Alerts

10 OCNWDAF Alerts

This chapter describes the following information about OCNWDAF alerts:

10.1 OCNWDAF Alert Configuration

This section describes the measurement based alert rules configuration for OCNWDAF. The Alert Manager uses the Prometheus measurements values as reported by microservices in conditions under alert rules to trigger alerts.

OCNWDAF Alert configuration in Prometheus

The following procedure is used to configure alerts in Prometheus:

Download the ocn-nwdaf-alerting-rules.yaml file. Edit this file to configure the alert rules. The parameters in the file that can be edited include name of the alert, rules for the alert including alert name and the expression expr defined to trigger the alert.
Copy the updated ocn-nwdaf-alerting-rules.yaml file to Bastion Host.
Run the following command:
kubectl apply -f ocn-nwdaf-alerting-rules.yaml -n ocn-nwdaf
To verify if the Custom Resource Definition (CRD) is created, run the following command:
kubectl get prometheusrule -n ocn-nwdaf
Verify the alerts in the Prometheus GUI, the alert name and expression is listed. See example below:

Figure 10-1 Prometheus GUI

Alert Rules

The alerts are configured on the Prometheus server. The metrics scraped correspond to a pod that runs a single microservice, so each alert belongs to one of the pods running. Prometheus continously collects metrics and when any of the alerting rules are met, the alert is triggered. All the alert rules are written in one or multiple .yml files and deployed as described in procedure OCNWDAF Alert configuration in Prometheus. Listed below are the alert rules for the various alerts captured for OCNWDAF:

Status Alert Rule

- name: <ALERT NAME>
    rules:
    - alert: <ALERT NAME>
      expr: up{app="SERVICE LABEL"} == 0

Example:

 - name: OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING
    rules:
    - alert: OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING
      expr: up{app="ocn-nwdaf-data-collection"} == 0

Traffic Alert Rule

Request rate rule:

- name: <ALERT NAME>
    rules:
    - alert: <ALERT NAME>
      expr: >
      sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="<URI ENDPOINT>"}[1m])) > 1000

Example:

  - name: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE
    rules:
    - alert: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE
      expr: sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR"}[1m])) > 1000

Failure rate request rule:

- name: <ALERT NAME>
    rules:
    - alert: <ALERT NAME>
      expr: >
     (sum without(method,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="<URI ENDPOINT>",status=~"[4-5].."}[1m]))/ ignoring(status) sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="<URI ENDPOINT>"}[1m]))) * 100 > 70

Example:

 - name: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE
    rules:
    - alert: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE
      expr: (sum without(method,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR",status=~"[4-5].."}[1m]))/ ignoring(status) sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR"}[1m]))) * 100 > 70

CPU Alert Rule

- name: <ALERT NAME>
    rules:
    - alert: <ALERT NAME>
      expr: system_cpu_usage{app="<SERVICE LABEL>"} * 100 > 80

Example:

 - name: OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD
    rules:
    - alert: OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD
      expr: system_cpu_usage{app="ocn-nwdaf-data-collection"} * 100 > 80

JVM Memory Usage Alert Rule

 - name: <ALERT NAME>
    rules:
    - alert: <ALERT NAME>
      expr: >

      (sum(avg_over_time(jvm_memory_used_bytes{area="heap",app="<SERVICE LABEL>"} [1m]))/sum(avg_over_time(jvm_memory_max_bytes{area="heap",app="<SERVICE LABEL>"}[1m]))) * 100 > 80

Example:

 - name: OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE
    rules:
    - alert: OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE
      expr: (sum(avg_over_time(jvm_memory_used_bytes{area="heap",app="ocn-nwdaf-data-collection"} [1m]))/sum(avg_over_time(jvm_memory_max_bytes{area="heap",app="ocn-nwdaf-data-collection"}[1m]))) * 100 > 80

10.2 System Level Alerts

This section lists the system level alerts.

OCN_NWDAF_ANALYTICS_HIGH_CPU_LOAD

Table 10-1 OCN_NWDAF_ANALYTICS_HIGH_CPU_LOAD

Field	Details
Description	CPU load is high at the pod where the microservice is running.
Affected Functions	All
Cause	CPU load is more than 80% of the allocated resources.

OCN_NWDAF_COMMUNICATION_HIGH_CPU_LOAD

Table 10-2 OCN_NWDAF_COMMUNICATION_HIGH_CPU_LOAD

Field	Details
Description	CPU load is high at the pod where the microservice is running.
Affected Functions	All
Cause	CPU load is more than 80% of the allocated resources.

OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_CPU_LOAD

Table 10-3 OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_CPU_LOAD

Field	Details
Description	CPU load is high at the pod where the microservice is running.
Affected Functions	All
Cause	CPU load is more than 80% of the allocated resources.

OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD

Table 10-4 OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD

Field	Details
Description	CPU load is high at the pod where the microservice is running.
Affected Functions	All
Cause	CPU load is more than 80% of the allocated resources.

OCN_NWDAF_GATEWAY_HIGH_CPU_LOAD

Table 10-5 OCN_NWDAF_GATEWAY_HIGH_CPU_LOAD

Field	Details
Description	CPU load is high at the pod where the microservice is running.
Affected Functions	All
Cause	CPU load is more than 80% of the allocated resources.

OCN_NWDAF_MTLF_HIGH_CPU_LOAD

Table 10-6 OCN_NWDAF_MTLF_HIGH_CPU_LOAD

Field	Details
Description	CPU load is high at the pod where the microservice is running.
Affected Functions	All
Cause	CPU load is more than 80% of the allocated resources.

OCN_NWDAF_SUBSCRIPTION_HIGH_CPU_LOAD

Table 10-7 OCN_NWDAF_SUBSCRIPTION_HIGH_CPU_LOAD

Field	Details
Description	CPU load is high at the pod where the microservice is running.
Affected Functions	All
Cause	CPU load is more than 80% of the allocated resources.

OCN_NWDAF_ANALYTICS_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-8 OCN_NWDAF_ANALYTICS_HIGH_JVM_HEAP_MEMORY_USAGE

Field	Details
Description	The average of the memory heap usage is high.
Affected Functions	All
Cause	The heap memory usage is more than the 80%.

OCN_NWDAF_COMMUNICATION_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-9 OCN_NWDAF_COMMUNICATION_HIGH_JVM_HEAP_MEMORY_USAGE

Field	Details
Description	The average of the memory heap usage is high.
Affected Functions	All
Cause	The heap memory usage is more than the 80%.

OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-10 OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_JVM_HEAP_MEMORY_USAGE

Field	Details
Description	The average of the memory heap usage is high.
Affected Functions	All
Cause	The heap memory usage is more than the 80%.

OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-11 OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE

Field	Details
Description	The average of the memory heap usage is high.
Affected Functions	All
Cause	The heap memory usage is more than the 80%.

OCN_NWDAF_GATEWAY_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-12 OCN_NWDAF_GATEWAY_HIGH_JVM_HEAP_MEMORY_USAGE

Field	Details
Description	The average of the memory heap usage is high.
Affected Functions	All
Cause	The heap memory usage is more than the 80%.

OCN_NWDAF_MTLF_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-13 OCN_NWDAF_MTLF_HIGH_JVM_HEAP_MEMORY_USAGE

Field	Details
Description	The average of the memory heap usage is high.
Affected Functions	All
Cause	The heap memory usage is more than the 80%.

OCN_NWDAF_SUBSCRIPTION_HIGH_JVM_HEAP_MEMORY_USAGE

Table 10-14 OCN_NWDAF_SUBSCRIPTION_HIGH_JVM_HEAP_MEMORY_USAGE

Field	Details
Description	The average of the memory heap usage is high.
Affected Functions	All
Cause	The heap memory usage is more than the 80%.

10.3 Application Level Alerts

This section lists the application level alerts.

OCN_NWDAF_ANALYTICS_NOT_RUNNING

Table 10-15 OCN_NWDAF_ANALYTICS_NOT_RUNNING

Field	Details
Description	The microservice is not available or not reachable.
Cause	Microservice ocn-nwdaf-analytics is down.

OCN_NWDAF_COMMUNICATION_NOT_RUNNING

Table 10-16 OCN_NWDAF_COMMUNICATION_NOT_RUNNING

Field	Details
Description	The microservice is not available or not reachable.
Cause	Microservice ocn-nwdaf-communication is down.

OCN_NWDAF_CONFIGURATION_SERVICE_NOT_RUNNING

Table 10-17 OCN_NWDAF_CONFIGURATION_SERVICE_NOT_RUNNING

Field	Details
Description	The microservice is not available or not reachable.
Cause	Microservice ocn-nwdaf-configuration-service is down.

OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING

Table 10-18 OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING

Field	Details
Description	The microservice is not available or not reachable.
Cause	Microservice ocn-nwdaf-data-collection is down.

OCN_NWDAF_GATEWAY_NOT_RUNNING

Table 10-19 OCN_NWDAF_GATEWAY_NOT_RUNNING

Field	Details
Description	The microservice is not available or not reachable.
Cause	Microservice ocn-nwdaf-gateway is down.

OCN_NWDAF_MTLF_NOT_RUNNING

Table 10-20 OCN_NWDAF_MTLF_NOT_RUNNING

Field	Details
Description	The microservice is not available or not reachable.
Cause	Microservice ocn-nwdaf-mtlf is down.

OCN_NWDAF_SUBSCRIPTION_NOT_RUNNING

Table 10-21 OCN_NWDAF_SUBSCRIPTION_NOT_RUNNING

Field	Details
Description	The microservice is not available or not reachable.
Cause	Microservice ocn-nwdaf-subscription is down.

HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE

Table 10-22 HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE

Field	Details
Description	The number of requests received per second is high.
Cause	Traffic is high, above 1000 requests per second.
URI Endpoint	`nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR`
Affected Functions	ABNORMAL_BEHAVIOUR

HIGH_UE_MOBILITY_REQUEST_RATE

Table 10-23 HIGH_UE_MOBILITY_REQUEST_RATE

Field	Details
Description	The number of requests received per second is high.
Cause	Traffic is high, above 1000 requests per second.
URI Endpoint	`nnwdaf-analyticsinfo/v1/analytics?event-id=UE_MOBILITY`
Affected Functions	UE_MOBILITY

HIGH_EVENT_SUBSCRIPTION_REQUEST_RATE

Table 10-24 HIGH_EVENT_SUBSCRIPTION_REQUEST_RATE

Field	Details
Description	The number of requests received per second is high.
Cause	Traffic is high, above 1000 requests per second.
URI Endpoint	`nnwdaf-eventssubscription/v1/subscriptions`
Affected Functions	UE_MOBILITY, SLICE_LOAD_LEVEL, ABNORMAL_BEHAVIOUR

HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE

Table 10-25 HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE

Field	Details
Description	The number of requests failing per second is high.
Cause	The request failing rate is more than the 70%.
URI Endpoint	`nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR`
Affected Functions	ABNORMAL_BEHAVIOUR

HIGH_UE_MOBILITY_REQUEST_FAILURE_RATE

Table 10-26 HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE

Field	Details
Description	The number of requests failing per second is high.
Cause	The request failing rate is more than the 70%.
URI Endpoint	`nnwdaf-analyticsinfo/v1/analytics?event-id=UE_MOBILITY`
Affected Functions	UE_MOBILITY

HIGH_EVENT_SUBSCRIPTION_REQUEST_FAILURE_RATE

Table 10-27 HIGH_EVENT_SUBSCRIPTION_REQUEST_FAILURE_RATE

Field	Details
Description	The number of requests failing per second is high.
Cause	The request failing rate is more than the 70%.
URI Endpoint	`nnwdaf-eventssubscription/v1/subscriptions`
Affected Functions	UE_MOBILITY, SLICE_LOAD_LEVEL, ABNORMAL_BEHAVIOUR