10 OCNWDAF Alerts
10.1 OCNWDAF Alert Configuration
This section describes the measurement based alert rules configuration for OCNWDAF. The Alert Manager uses the Prometheus measurements values as reported by microservices in conditions under alert rules to trigger alerts.
OCNWDAF Alert configuration in Prometheus
The following procedure is used to configure alerts in Prometheus:
- Download the
ocn-nwdaf-alerting-rules.yaml
file. Edit this file to configure the alert rules. The parameters in the file that can be edited includename
of the alert,rules
for the alert includingalert
name and the expressionexpr
defined to trigger the alert. - Copy the updated
ocn-nwdaf-alerting-rules.yaml
file to Bastion Host. - Run the following command:
kubectl apply -f ocn-nwdaf-alerting-rules.yaml -n ocn-nwdaf
- To verify if the Custom Resource Definition (CRD) is created, run the following command:
kubectl get prometheusrule -n ocn-nwdaf
- Verify the alerts in the Prometheus GUI, the alert name and expression is listed. See example below:
Figure 10-1 Prometheus GUI
Alert Rules
The alerts are configured on the Prometheus server. The metrics scraped correspond to a pod that runs a single microservice, so each alert belongs to one of the pods running. Prometheus continously collects metrics and when any of the alerting rules are met, the alert is triggered. All the alert rules are written in one or multiple .yml
files and deployed as described in procedure OCNWDAF Alert configuration in Prometheus. Listed below are the alert rules for the various alerts captured for OCNWDAF:
- name: <ALERT NAME>
rules:
- alert: <ALERT NAME>
expr: up{app="SERVICE LABEL"} == 0
- name: OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING
rules:
- alert: OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING
expr: up{app="ocn-nwdaf-data-collection"} == 0
-
Request rate rule:
- name: <ALERT NAME> rules: - alert: <ALERT NAME> expr: > sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="<URI ENDPOINT>"}[1m])) > 1000
Example:- name: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE rules: - alert: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE expr: sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR"}[1m])) > 1000
-
Failure rate request rule:
- name: <ALERT NAME> rules: - alert: <ALERT NAME> expr: > (sum without(method,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="<URI ENDPOINT>",status=~"[4-5].."}[1m]))/ ignoring(status) sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="<URI ENDPOINT>"}[1m]))) * 100 > 70
Example:- name: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE rules: - alert: HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE expr: (sum without(method,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR",status=~"[4-5].."}[1m]))/ ignoring(status) sum without(method,status,outcome,exception,app,instance,container,pod,pod_template_hash) (rate(http_server_requests_seconds_count{uri="nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR"}[1m]))) * 100 > 70
- name: <ALERT NAME>
rules:
- alert: <ALERT NAME>
expr: system_cpu_usage{app="<SERVICE LABEL>"} * 100 > 80
- name: OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD
rules:
- alert: OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD
expr: system_cpu_usage{app="ocn-nwdaf-data-collection"} * 100 > 80
- name: <ALERT NAME>
rules:
- alert: <ALERT NAME>
expr: >
(sum(avg_over_time(jvm_memory_used_bytes{area="heap",app="<SERVICE LABEL>"} [1m]))/sum(avg_over_time(jvm_memory_max_bytes{area="heap",app="<SERVICE LABEL>"}[1m]))) * 100 > 80
- name: OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE
rules:
- alert: OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE
expr: (sum(avg_over_time(jvm_memory_used_bytes{area="heap",app="ocn-nwdaf-data-collection"} [1m]))/sum(avg_over_time(jvm_memory_max_bytes{area="heap",app="ocn-nwdaf-data-collection"}[1m]))) * 100 > 80
10.2 System Level Alerts
This section lists the system level alerts.
OCN_NWDAF_ANALYTICS_HIGH_CPU_LOAD
Table 10-1 OCN_NWDAF_ANALYTICS_HIGH_CPU_LOAD
Field | Details |
---|---|
Description | CPU load is high at the pod where the microservice is running. |
Affected Functions | All |
Cause | CPU load is more than 80% of the allocated resources. |
OCN_NWDAF_COMMUNICATION_HIGH_CPU_LOAD
Table 10-2 OCN_NWDAF_COMMUNICATION_HIGH_CPU_LOAD
Field | Details |
---|---|
Description | CPU load is high at the pod where the microservice is running. |
Affected Functions | All |
Cause | CPU load is more than 80% of the allocated resources. |
OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_CPU_LOAD
Table 10-3 OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_CPU_LOAD
Field | Details |
---|---|
Description | CPU load is high at the pod where the microservice is running. |
Affected Functions | All |
Cause | CPU load is more than 80% of the allocated resources. |
OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD
Table 10-4 OCN_NWDAF_DATA_COLLECTION_HIGH_CPU_LOAD
Field | Details |
---|---|
Description | CPU load is high at the pod where the microservice is running. |
Affected Functions | All |
Cause | CPU load is more than 80% of the allocated resources. |
OCN_NWDAF_GATEWAY_HIGH_CPU_LOAD
Table 10-5 OCN_NWDAF_GATEWAY_HIGH_CPU_LOAD
Field | Details |
---|---|
Description | CPU load is high at the pod where the microservice is running. |
Affected Functions | All |
Cause | CPU load is more than 80% of the allocated resources. |
OCN_NWDAF_MTLF_HIGH_CPU_LOAD
Table 10-6 OCN_NWDAF_MTLF_HIGH_CPU_LOAD
Field | Details |
---|---|
Description | CPU load is high at the pod where the microservice is running. |
Affected Functions | All |
Cause | CPU load is more than 80% of the allocated resources. |
OCN_NWDAF_SUBSCRIPTION_HIGH_CPU_LOAD
Table 10-7 OCN_NWDAF_SUBSCRIPTION_HIGH_CPU_LOAD
Field | Details |
---|---|
Description | CPU load is high at the pod where the microservice is running. |
Affected Functions | All |
Cause | CPU load is more than 80% of the allocated resources. |
OCN_NWDAF_ANALYTICS_HIGH_JVM_HEAP_MEMORY_USAGE
Table 10-8 OCN_NWDAF_ANALYTICS_HIGH_JVM_HEAP_MEMORY_USAGE
Field | Details |
---|---|
Description | The average of the memory heap usage is high. |
Affected Functions | All |
Cause | The heap memory usage is more than the 80%. |
OCN_NWDAF_COMMUNICATION_HIGH_JVM_HEAP_MEMORY_USAGE
Table 10-9 OCN_NWDAF_COMMUNICATION_HIGH_JVM_HEAP_MEMORY_USAGE
Field | Details |
---|---|
Description | The average of the memory heap usage is high. |
Affected Functions | All |
Cause | The heap memory usage is more than the 80%. |
OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_JVM_HEAP_MEMORY_USAGE
Table 10-10 OCN_NWDAF_CONFIGURATION_SERVICE_HIGH_JVM_HEAP_MEMORY_USAGE
Field | Details |
---|---|
Description | The average of the memory heap usage is high. |
Affected Functions | All |
Cause | The heap memory usage is more than the 80%. |
OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE
Table 10-11 OCN_NWDAF_DATA_COLLECTION_HIGH_JVM_HEAP_MEMORY_USAGE
Field | Details |
---|---|
Description | The average of the memory heap usage is high. |
Affected Functions | All |
Cause | The heap memory usage is more than the 80%. |
OCN_NWDAF_GATEWAY_HIGH_JVM_HEAP_MEMORY_USAGE
Table 10-12 OCN_NWDAF_GATEWAY_HIGH_JVM_HEAP_MEMORY_USAGE
Field | Details |
---|---|
Description | The average of the memory heap usage is high. |
Affected Functions | All |
Cause | The heap memory usage is more than the 80%. |
OCN_NWDAF_MTLF_HIGH_JVM_HEAP_MEMORY_USAGE
Table 10-13 OCN_NWDAF_MTLF_HIGH_JVM_HEAP_MEMORY_USAGE
Field | Details |
---|---|
Description | The average of the memory heap usage is high. |
Affected Functions | All |
Cause | The heap memory usage is more than the 80%. |
OCN_NWDAF_SUBSCRIPTION_HIGH_JVM_HEAP_MEMORY_USAGE
Table 10-14 OCN_NWDAF_SUBSCRIPTION_HIGH_JVM_HEAP_MEMORY_USAGE
Field | Details |
---|---|
Description | The average of the memory heap usage is high. |
Affected Functions | All |
Cause | The heap memory usage is more than the 80%. |
10.3 Application Level Alerts
This section lists the application level alerts.
OCN_NWDAF_ANALYTICS_NOT_RUNNING
Table 10-15 OCN_NWDAF_ANALYTICS_NOT_RUNNING
Field | Details |
---|---|
Description | The microservice is not available or not reachable. |
Cause | Microservice ocn-nwdaf-analytics is down. |
OCN_NWDAF_COMMUNICATION_NOT_RUNNING
Table 10-16 OCN_NWDAF_COMMUNICATION_NOT_RUNNING
Field | Details |
---|---|
Description | The microservice is not available or not reachable. |
Cause | Microservice ocn-nwdaf-communication is down. |
OCN_NWDAF_CONFIGURATION_SERVICE_NOT_RUNNING
Table 10-17 OCN_NWDAF_CONFIGURATION_SERVICE_NOT_RUNNING
Field | Details |
---|---|
Description | The microservice is not available or not reachable. |
Cause | Microservice ocn-nwdaf-configuration-service is down. |
OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING
Table 10-18 OCN_NWDAF_DATA_COLLECTION_NOT_RUNNING
Field | Details |
---|---|
Description | The microservice is not available or not reachable. |
Cause | Microservice ocn-nwdaf-data-collection is down. |
OCN_NWDAF_GATEWAY_NOT_RUNNING
Table 10-19 OCN_NWDAF_GATEWAY_NOT_RUNNING
Field | Details |
---|---|
Description | The microservice is not available or not reachable. |
Cause | Microservice ocn-nwdaf-gateway is down. |
OCN_NWDAF_MTLF_NOT_RUNNING
Table 10-20 OCN_NWDAF_MTLF_NOT_RUNNING
Field | Details |
---|---|
Description | The microservice is not available or not reachable. |
Cause | Microservice ocn-nwdaf-mtlf is down. |
OCN_NWDAF_SUBSCRIPTION_NOT_RUNNING
Table 10-21 OCN_NWDAF_SUBSCRIPTION_NOT_RUNNING
Field | Details |
---|---|
Description | The microservice is not available or not reachable. |
Cause | Microservice ocn-nwdaf-subscription is down. |
HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE
Table 10-22 HIGH_ABNORMAL_BEHAVIOUR_REQUEST_RATE
Field | Details |
---|---|
Description | The number of requests received per second is high. |
Cause | Traffic is high, above 1000 requests per second. |
URI Endpoint | nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR |
Affected Functions | ABNORMAL_BEHAVIOUR |
HIGH_UE_MOBILITY_REQUEST_RATE
Table 10-23 HIGH_UE_MOBILITY_REQUEST_RATE
Field | Details |
---|---|
Description | The number of requests received per second is high. |
Cause | Traffic is high, above 1000 requests per second. |
URI Endpoint | nnwdaf-analyticsinfo/v1/analytics?event-id=UE_MOBILITY |
Affected Functions | UE_MOBILITY |
HIGH_EVENT_SUBSCRIPTION_REQUEST_RATE
Table 10-24 HIGH_EVENT_SUBSCRIPTION_REQUEST_RATE
Field | Details |
---|---|
Description | The number of requests received per second is high. |
Cause | Traffic is high, above 1000 requests per second. |
URI Endpoint | nnwdaf-eventssubscription/v1/subscriptions |
Affected Functions | UE_MOBILITY, SLICE_LOAD_LEVEL, ABNORMAL_BEHAVIOUR |
HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE
Table 10-25 HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE
Field | Details |
---|---|
Description | The number of requests failing per second is high. |
Cause | The request failing rate is more than the 70%. |
URI Endpoint | nnwdaf-analyticsinfo/v1/analytics?event-id=ABNORMAL_BEHAVIOUR |
Affected Functions | ABNORMAL_BEHAVIOUR |
HIGH_UE_MOBILITY_REQUEST_FAILURE_RATE
Table 10-26 HIGH_ABNORMAL_BEHAVIOUR_REQUEST_FAILURE_RATE
Field | Details |
---|---|
Description | The number of requests failing per second is high. |
Cause | The request failing rate is more than the 70%. |
URI Endpoint | nnwdaf-analyticsinfo/v1/analytics?event-id=UE_MOBILITY |
Affected Functions | UE_MOBILITY |
HIGH_EVENT_SUBSCRIPTION_REQUEST_FAILURE_RATE
Table 10-27 HIGH_EVENT_SUBSCRIPTION_REQUEST_FAILURE_RATE
Field | Details |
---|---|
Description | The number of requests failing per second is high. |
Cause | The request failing rate is more than the 70%. |
URI Endpoint | nnwdaf-eventssubscription/v1/subscriptions |
Affected Functions | UE_MOBILITY, SLICE_LOAD_LEVEL, ABNORMAL_BEHAVIOUR |