OCNADD KPIs, Metric, and Alerts

15 OCNADD KPIs, Metric, and Alerts

This chapter provides information on OCNADD Metrics, KPIs, and Alerts.

15.1 OCNADD Metrics

This section includes information about Dimensions and Common Attributes of metrics for Oracle Communications Network Analytics Data Director (OCNADD).

Dimension Description

Table 15-1 Dimensions

Dimension	Values	Description
HttpVersion	HTTP/2.0	Specifies HTTP protocol version.
Method	GET, PUT, POST, DELETE, PATCH	HTTP method.
Scheme	HTTP, HTTPS, UNKNOWN	Specifies the HTTP protocol scheme.
route_path	NA	Path predicate that matched the current request.
status	NA	HTTP response code.
quantile	Integer values	It captures the latency values with ranges as 10ms, 20ms, 40ms, 80ms, 100ms, 200ms, 500ms, 1000ms, and 5000ms.
instance_identifier	Prefix configured in Helm, UNKNOWN	Prefix of the pod configured in Helm when there are multiple instances in the same deployment.
3rd_party_consumer_name or consumer_name	-	Name of the 3rd-party consumer application as configured from the UI.
destination_endpoint	IP/FQDN	Destination IP address or FQDN.
processor_node_id	-	Stream processor node ID in aggregation service.
serviceId	serviceType-N	It is the identifier for the service instance used for registration with the health monitoring service.
serviceType	CONSUMER_ADAPTER, CONFIGURATION, ALARM, AGGREGATION-NRF, AGGREGATION-SCP, AGGREGATION-SEPP, AGGREGATION-BSF, AGGREGATION-PCF, AGGREGATION-NON-ORACLE, OCNADD-ADMIN	The ocnadd service type.
service	ocnaddnrfaggregation, ocnaddseppaggregation, ocnaddscpaggregation, ocnaddbsfaggregation, etc.	The name of the Data Director microservice service.
request_type	HTTP2, H2C, TCP, TCP_SECURED	Type of the data feed created using UI; this will be used to identify if the feed is for HTTP2 or synthetic packets.
destination_endpoint	URI	It is the REST URI for the 3rd-party monitoring application configured on the data feed.
nf_feed_type	SCP, NRF, PCF, BSF, SEPP	The source NF for the feed.
error_reason	-	The error reason for the failure of the HTTP request sent to the 3rd-party application from the egress adapter.
correlation-id	-	Taken from correlation ID present in the metadata list.
way	-	It is taken from the message direction present in the metadata list.
srcIP	-	It is taken from the source IP present in the metadata list, global L3L4 configuration, or in the least-priority address configured in the feed.
dstIP	-	It is taken from the destination IP present in the metadata list, global L3L4 configuration, or in the least-priority address configured in the feed.
srcPort	-	It is taken from the source port present in the metadata list, global L3L4 configuration, or in the least-priority address configured in the feed.
dstPort	-	It is taken from the destination port present in the metadata list, global L3L4 configuration, or in the least-priority address configured in the feed.
MD	-	It provides information if the value is taken from the metadata list. It is attached with either srcIP, dstIP, srcPort, or dstPort based on mapping.
LP	-	It provides information if the value is taken from the least-priority address configured in the feed. It is attached with either srcIP, dstIP, srcPort, or dstPort based on mapping.
L3L4	-	It provides information if the value is taken from the global L3L4 configuration. It is attached with either srcIP, dstIP, srcPort, or dstPort based on mapping.
worker_group	String	Name of the worker group in which the corresponding traffic processing service is running.

The following table includes information about common attributes for OCNADD:

Table 15-2 Attributes

Attribute	Description
application	The name of the application that the microservice is a part of
microservice	The name of the microservice.
namespace	The Kubernetes namespace in which microservice is running.
node	The name of the worker node that the microservice is running on.
pod	The name of the Kubernetes POD.

OCNADD Metrics

The following table lists important metrics related to OCNADD:

Table 15-3 Metrics

Metric Name	Description	Dimensions
kafka_stream_processor_node_process_total	The total number of records processed by a source node. The metric will be pegged by the aggregation services (SCP, SEPP, BSF, PCF and NRF) and egress adapter in the data directory and denote the total number of records consumed from the source topic. Metric Type: Counter Namespace will identify the worker group for the corresponding Kafka Cluster	application container service namespace processor_node_id task_id thread_id name microservice
kafka_stream_processor_node_process_rate	The average number of records processed per second by a source node. The metric will be pegged by the aggregation services (SCP, SEPP, BSF, PCF and NRF) and egress adapter in the data director and denotes the records consumed per sec from the source topic. Metric Type: Gauge Namespace will identify the worker group for the corresponding Kafka Cluster	application container service namespace processor_node_id task_id thread_id name microservice
kafka_stream_task_dropped_records_total	The total number of records dropped within the stream processing task. The metric will be pegged by the aggregation services (SCP, SEPP, BSF, PCF and NRF) and egress adapter in the data director. Metric Type: Counter Namespace will identify the worker group for the corresponding Kafka Cluster	application container service namespace processor_node_id task_id thread_id name microservice
kafka_stream_task_dropped_records_rate	The average number of records dropped within the stream processing task. The metric will be pegged by the aggregation services (SCP, SEPP, BSF, PCF and NRF) and egress adapter in the data director. Metric Type: Gauge Namespace will identify the worker group for the corresponding Kafka Cluster	application container service namespace processor_node_id task_id thread_id name microservice
ocnadd_egress_requests_total	This metric will be pegged as soon as the request reaches the ocnadd egress adapter. This metric peg count of total request which has to be forwarded to third party. This metric is used for the Egress MPS at DD. Metric Type: Counter	method instance_identifier nf_feed_type request_type Third_party_consumer_name worker_group destination_endpoint
ocnadd_egress_responses_total	This metric will be pegged on the Egress adapter when the response is received at egress adapter. This metric is pegged by ocnadd adapter service. This metric pegs count of total responses (success or failed) received from the third party Metric Type: Counter	method status instance_identifier destination_endpoint nf_feed_type request_type Third_party_consumer_name worker_group
ocnadd_egress_failed_request_total	This metric pegs count of total request which are failed to send to third party application. This metric is pegged by egress adapter service Metric Type: Counter	destination_endpoint Third_party_consumer_name instance_identifier error-reason nf_feed_type request_type worker_group
ocnadd_egress_e2e_request_processing_latency_seconds_bucket	The metric is pegged on the egress adapter service. It is the latency between the packet timestamp provided by producer NF and the egress adapter when the request packet is sent to the third-party application. This latency is calculated for each message. Metric Type: Histogram	instance_identifier quantile(or le) Third_party_consumer_name worker_group
ocnadd_egress_e2e_request_processing_latency_seconds_sum	This is the sum of end-to-end request processing time for all the requests in seconds. It is the latency between the packet timestamp provided by the producer NF and the egress adapter when the packet is sent out. Metric Type: Counter	instance_identifier Third_party_consumer_name worker_group
ocnadd_egress_e2e_request_processing_latency_seconds_count	This is the count of batches of messages for which the processing time is summed up. It is the latency between the packet timestamp provided by producer NF and the egress adapter when the packet is sent out. Metric Type: Counter	instance_identifier Third_party_consumer_name worker_group
ocnadd_egress_service_request_processing_latency_seconds_bucket	The metric is pegged on the egress adapter service. It is the egress adapter service processing latency and is pegged when a request is sent out from the egress gateway to the third-party application. This latency is calculated for each of the messages. Metric Type: Histogram	instance_identifier quantile(or le) worker_group
ocnadd_egress_service_request_processing_latency_seconds_sum	The metric is pegged on the egress adapter service. It is the sum of the egress adapter service processing time for all the requests in seconds and is pegged when a request is sent out from the egress adapter to the third-party application. Metric Type: Counter	instance_identifier Third_party_consumer_name worker_group
ocnadd_egress_service_request_processing_latency_seconds_count	The metric is pegged on the egress adapter service. It is the count for which the processing time is summed up and is pegged when a request is sent out from the egress adapter to the third-party application. Metric Type: Counter	instance_identifier Third_party_consumer_name worker_group
ocnadd_adapter_synthetic_packet_generation_count_total	This metric pegs the count of synthetic packets generated with either success or failed status. Metric Type: Counter	instance_identifier Third_party_consumer_name status nf_feed_type worker_group
ocnadd_egress_filtered_message_total	This metric pegs the count of messages that is matching with the filter criteria for egress. Metric Type: Counter Filter criteria have been enhanced to provide a description with action and filter rules.	service_name filter_name filter_association_type filter_criteria worker_group
ocnadd_egress_unmatched_filter_message_total	This metric pegs the count of messages that do not match the filter criteria for egress. Metric Type: Counter Filter criteria have been enhanced to provide a description with action and filter rules.	service_name filter_name filter_association_type filter_criteria worker_group
ocnadd_ingress_filtered_message_total	This metric pegs the count of messages that is matching with the filter criteria for ingress. Metric Type: Counter Filter criteria have been enhanced to provide a description with action and filter rules.	service_name filter_name filter_association_type filter_criteria worker_group
ocnadd_ingress_unmatched_filter_message_total	This metric pegs the count of messages that do not match the filter criteria for ingress. Metric Type: Counter Filter criteria have been enhanced to provide a description with action and filter rules.	service_name filter_name filter_association_type filter_criteria worker_group
ocnadd_health_total_alarm_raised_total	This metric will be pegged whenever a new alarm is raised to the alarm service from the Health Monitoring service. Metric Type: Counter	serviceId serviceType
ocnadd_health_total_alarm_cleared_total	This metric will be pegged whenever a clear alarm is invoked from the Health Monitoring service. Metric Type: Counter	serviceId serviceType
ocnadd_health_total_active_number_of_alarm_raised_total	This metric will be pegged whenever a raise/clear alarm is raised to the alarm service from the Health Monitoring service. This denotes the active alarms raised by the healthmonitoring service. Metric Type: Counter	serviceId serviceType
ocnadd_l3l4mapping_info_count_total	This metric will be pegged to provide information about L3L4 mapping in synthetic messages. By, default it is disabled in the chart. Metric Type: Counter	correlation_id dstIP dstPort srcIP srcPort way service worker_group
ocnadd_ext_kafka_feed_record_total	This metric will be pegged by the admin service to provide the total consumed messages by the external Kafka consumer application. The admin service retrieves the consumer offsets count from all the partitions of the aggregated topic and pegs the metric periodically. Metric Type: Counter	feed_name worker_group
ocnadd_data_export_failure_records_count	The metric will be pegged by the export service to provide the total number of records or messages that were not exported successfully. Metric Type: Counter	filelocation configurationname correlationfeedname reason exporttype namespace
ocnadd_xdr_database_records_sent	The metric will be pegged by the storage adapter service to provide the total number of xDRs sent to the XDR database. The xDRs will be pegged for the correlation service corresponding to the worker group. Metric Type: Counter	app worker_group
ocnadd_ingress_request_total	This metric will be pegged as soon as the request reaches the ocnadd ingress adapter. This metric pegs the count of total requests received by the ocnadd ingress adapter from non-Oracle NFs. It is used for the Ingress MPS at OCNADD with respect to non-Oracle NFs. Metric Type: Counter	method scheme http_version instance_identifier source_host status responsecode error_reason worker_group
ocnadd_ingress_message_processed_total	This metric will be pegged as soon as the request is processed at the ocnadd ingress adapter. This metric pegs the count of total requests that were processed successfully, failed, or discarded by the ocnadd ingress adapter. Metric Type: Counter	status error_reason worker_group instance_identifier
ocnadd_ingress_service_request_processing_latency_seconds_sum	The metric is pegged on the ingress adapter service. It is the sum of the ingress adapter service processing latency in seconds and is pegged when a request is completely processed at the ingress adapter. This is pegged for each message. Metric Type: Counter	instance_identifier worker_group
ocnadd_ingress_service_request_processing_latency_seconds_count	The metric is pegged on the ingress adapter service. It is the cumulative count of the messages processed at the ingress adapter service instance. Metric Type: Counter	instance_identifier worker_group

15.2 OCNADD KPIs

This section provides information about Key Performance Indicators (KPIs) used for Oracle Communications Network Analytics Data Director (OCNADD).

Note:

The "namespace" in the KPIs should be updated to reflect the current namespace used in the Data Director deployment.

Ensure that the queries are tailored per worker group wherever applicable, such as for KPIs related to ingress and egress MPS, failure/success rate, packet drop, etc. Utilize the "worker_group" label to filter based on the worker group name in the KPI queries.

For queries, adhere to PromQL syntax for CNE-based deployments and MQL syntax for OCI-based deployments.

The following KPIs are added in OCNADD 25.1.200.

Table 15-4 ocnadd_ingress_record_count_by_service

KPI Detail	Measures the total ingress records in Kafka source topics per aggregation service at the current time
Metric Used for the KPI (CNE)	PromQL: sum by (service)(kafka_stream_processor_node_process_total{namespace="$NAMESPACE", service=~".aggregation."})
Metric Used for the KPI (OCI)	MQL: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation", k8Namespace="$NAMESPACE"}.groupby(microservice).sum()
Service Operation	NA
Response Code	NA

Table 15-5 ocnadd_ingress_record_count_total

KPI Detail	Measures the total ingress records in Kafka source topics at the current time
Metric Used for the KPI (CNE)	PromQL: sum (kafka_stream_processor_node_process_total{namespace="$NAMESPACE", service=~".aggregation."})
Metric Used for the KPI (OCI)	MQL: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation", k8Namespace="$NAMESPACE"}.sum()
Service Operation	NA
Response Code	NA

Table 15-6 ocnadd_ingress_mps_per_service_10mAgg

KPI Detail	Measures the ingress MPS per service aggregated over 10min
Metric Used for the KPI (CNE)	PromQL: sum by (service)(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".aggregation."}[10m]))
Metric Used for the KPI (OCI)	MQL: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupby(k8Namespace,microservice).sum()
Service Operation	NA
Response Code	NA

Table 15-7 ocnadd_ingress_mps_10mAgg

KPI Detail	Measures the ingress MPS aggregated over 10min
Metric Used for the KPI (CNE)	PromQL: sum(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".aggregation."}[10m]))
Metric Used for the KPI (OCI)	MQL: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation", k8Namespace="$NAMESPACE"}.rate().grouping().sum()
Service Operation	NA
Response Code	NA

Table 15-8 ocnadd_ingress_mps_per_service_10mAgg_last_24h

KPI Detail	Measures the ingress MPS per service aggregated over 10min for the last 24 hours
Metric Used for the KPI (CNE)	PromQL: sum by (service)(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".aggregation."}[10m]))[24h:5m]
Metric Used for the KPI (OCI)	MQL: No valid MQL equivalent is available
Service Operation	NA
Response Code	NA

Table 15-9 ocnadd_ingress_record_count_per_service_10mAgg_last_24h

KPI Detail	Measures the ingress messages per service aggregated over 10min for the last 24 hours
Metric Used for the KPI (CNE)	PromQL: sum by (service)(increase(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".aggregation."}[10m]))[24h:5m]
Metric Used for the KPI (OCI)	MQL: No valid MQL equivalent is available
Service Operation	NA
Response Code	NA

Table 15-10 ocnadd_kafka_ingress_record_drop_rate_10minAgg

KPI Detail	Measures the total ingress message drop rate aggregated over 10min
Metric Used for the KPI (CNE)	PromQL: sum(rate(kafka_stream_task_dropped_records_total{namespace="$NAMESPACE",service=~".aggregation."}[10m]))
Metric Used for the KPI (OCI)	MQL: kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().grouping().sum()
Service Operation	NA
Response Code	NA

Table 15-11 ocnadd_kafka_ingress_record_drop_rate_per_service_10minAgg

KPI Detail	Measures the total ingress message drop rate per service aggregated over 10min
Metric Used for the KPI (CNE)	PromQL: sum(rate(kafka_stream_task_dropped_records_total{namespace="$NAMESPACE",service=~".aggregation.}[10m])) by (service,pod)
Metric Used for the KPI (OCI)	MQL: kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().groupby(nodeName, microservice).sum()
Service Operation	NA
Response Code	NA

Table 15-12 ocnadd_egress_request_count_total_by_3rdparty_destination_endpoint

KPI Detail	Total egress requests per third-party application per destination endpoint
Metric Used for the KPI (CNE)	PromQL: sum by (instance_identifier,destination_endpoint)(ocnadd_egress_requests_total{namespace="$NAMESPACE"})
Metric Used for the KPI (OCI)	MQL: ocnadd_egress_requests_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier,destination_endpoint).sum()
Service Operation	NA
Response Code	NA

Table 15-13 ocnadd_egress_response_count_total_by_3rdparty_destination_endpoint

KPI Detail	Total egress responses per third-party application per destination endpoint
Metric Used for the KPI (CNE)	PromQL: sum by (instance_identifier,destination_endpoint)(ocnadd_egress_responses_total{namespace="$NAMESPACE"})
Metric Used for the KPI (OCI)	MQL: ocnadd_egress_responses_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier,destination_endpoint).sum()
Service Operation	NA
Response Code	NA

Table 15-14 ocnadd_egress_failure_count_total_by_3rdparty_destination_endpoint

KPI Detail	Total egress failure count per third-party application per destination endpoint
Metric Used for the KPI (CNE)	PromQL: sum by (instance_identifier,destination_endpoint)(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"})
Metric Used for the KPI (OCI)	MQL: ocnadd_egress_failed_request_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier,destination_endpoint).sum()
Service Operation	NA
Response Code	NA

Table 15-15 ocnadd_egress_request_rate_by_3rdparty_10mAgg

KPI Detail	Total egress request rate per third-party application in 10min Aggregation
Metric Used for the KPI (CNE)	PromQL: sum by (instance_identifier)(rate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[10m]))
Metric Used for the KPI (OCI)	MQL: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupby(worker_group,app).sum()
Service Operation	NA
Response Code	NA

Table 15-16 ocnadd_egress_failure_rate_by_3rdparty_10mAgg

KPI Detail	Total egress failure rate per third-party application in 10min Aggregation
Metric Used for the KPI (CNE)	PromQL: sum by (instance_identifier)(irate(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"}[10m])) / sum by (instance_identifier) (irate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[10m]))
Metric Used for the KPI (OCI)	MQL: ocnadd_egress_failed_request_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier).sum() / ocnadd_egress_requests_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier).sum()
Service Operation	NA
Response Code	NA

Table 15-17 ocnadd_egress_failure_rate_by_3rdparty_per_destination_endpoint_10mAgg

KPI Detail	Total egress failure rate per third-party application per destination endpoint in 10min Aggregation
Metric Used for the KPI (CNE)	PromQL: sum by (instance_identifier, destination_endpoint)(irate(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"}[10m])) / sum by (instance_identifier, destination_endpoint) (irate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[10m]))
Metric Used for the KPI (OCI)	MQL: ocnadd_egress_failed_request_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier,destination_endpoint).sum() / ocnadd_egress_requests_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier,destination_endpoint).sum()
Service Operation	NA
Response Code	NA

Table 15-18 ocnadd_e2e_avg_latency_by_3rdparty

KPI Detail	Total e2e average latency per third-party application in 10min Aggregation
Metric Used for the KPI (CNE)	PromQL: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[10m])) by (instance_identifier) / (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[10m])) by (instance_identifier)))
Metric Used for the KPI (OCI)	MQL: ocnadd_egress_service_request_processing_latency_seconds_sum[10m]{app=~"adapter"}.rate().groupby(app, worker_group).sum() / ocnadd_egress_service_request_processing_latency_seconds_count[10m]{app=~"adapter"}.rate().groupby(app, worker_group).sum()
Service Operation	NA
Response Code	NA

Table 15-19 ocnadd_e2e_avg_latency_by_3rdparty_per_adapter_pod

KPI Detail	Total e2e average latency per third-party application per egress adapter POD in 10min aggregation
Metric Used for the KPI (CNE)	PromQL: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[10m])) by (instance_identifier,pod) / (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[10m])) by (instance_identifier,pod)))
Metric Used for the KPI (OCI)	MQL: (ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum() / ocnadd_egress_e2e_request_processing_latency_seconds_count[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum())
Service Operation	NA
Response Code	NA

Table 15-20 ocnadd_egress_adapter_processing_avg_latency_by_3rdparty_per_adapter_pod

KPI Detail	Total service processing average latency per third-party application per adapter POD in 10min aggregation
Metric Used for the KPI (CNE)	PromQL: (sum (irate(ocnadd_egress_service_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[10m])) by (instance_identifier,pod) / (sum (irate(ocnadd_egress_service_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[10m])) by (instance_identifier,pod)))
Metric Used for the KPI (OCI)	MQL: ocnadd_egress_service_request_processing_latency_seconds_sum[10m]{app=~"adapter"}.rate().groupby(app, worker_group).sum() / ocnadd_egress_service_request_processing_latency_seconds_count[10m]{app=~"adapter"}.rate().groupby(app, worker_group).sum()
Service Operation	NA
Response Code	NA

Table 15-21 ocnadd_egress_e2e_avg_latency_buckets

KPI Detail	The latency buckets for the feed in a worker group namespace
Metric Used for the KPI (CNE)	PromQL: sum(rate(ocnadd_egress_e2e_request_processing_latency_seconds_bucket{app=~".adapter."}[10m])) by (le,namespace,service)
Metric Used for the KPI (OCI)	MQL: (ocnadd_egress_e2e_request_processing_latency_seconds_bucket[10m]{app=~"adapter"}.rate().groupby(k8Namespace,app,le).sum())
Service Operation	NA
Response Code	NA

Table 15-22 ocnadd_ext_kafka_feed_record_total per external feed rate(MPS)

KPI Detail	The rate of messages consumed per sec per external Kafka consumer, calculated over a period of 5min
Metric Used for the KPI (CNE)	PromQL: sum(rate(ocnadd_ext_kafka_feed_record_total{namespace="$Namespace"}[5m])) by (feed_name)
Metric Used for the KPI (OCI)	MQL: ocnadd_ext_kafka_feed_record_total[10m].rate().groupby(k8Namespace,feed_name).sum()
Service Operation	NA
Response Code	NA

Memory Usage per POD

This KPI should be used in the context of the management group and worker group, the namespaces may differ for the management and worker group if there is no default worker group

Table 15-23 Memory Usage per POD

KPI Detail	Measures the memory usage per POD
Metric Used for the KPI (CNE)	PromQL: sum(container_memory_working_set_bytes{namespace=~"$Namespace",image!=""}/(102410241024)) by (pod)
Metric Used for the KPI (OCI)	MQL: (container_memory_working_set_bytes[10m]{container=~\"ocnadd\|zookeeper\|kafka\|adapter\|corr*\"}.groupby(namespace,pod).mean())/1000000
Service Operation	NA
Response Code	NA

CPU Usage per POD

This KPI should be used in the context of the management group and worker group, the namespaces may differ for the management and worker group if there is no default worker group

Table 15-24 CPU Usage per POD

KPI Detail	Measures the CPU usage per POD
Metric Used for the KPI (CNE)	PromQL: sum(rate(container_cpu_usage_seconds_total{namespace=~"$Namespace",image!=""}[2m])) by (pod) * 1000
Metric Used for the KPI (OCI)	MQL: container_cpu_usage_seconds_total[10m]{pod=~\"ocnadd\|kafka\|zookeeper\|adapter\|corr\|export\"}.rate().groupby(namespace,pod).sum()
Service Operation	NA
Response Code	NA

Service Status

Table 15-25 Service Status

KPI Detail	Provide the status of each of the Data Director service running in the namespace provided
Metric Used for the KPI (CNE)	PromQL: up{namespace="$NAMESPACE"}
Metric Used for the KPI (OCI)	MQL:podStatus[10m]{podOwner=~"adapter\|ocnadd\|kafka\|zookeeper\|corr\|export*\"}.groupby(clusterNamespace,podName).mean()
Service Operation	NA
Response Code	NA

15.3 OCNADD Alerts

This section provides information on Oracle Communications Network Analytics Data Director (OCNADD) alerts and their configuration.

Alerts Interpretation

The following table defines the alerts severity interpretation based on the infrastructure.

Table 15-26 Alerts Interpretation

CNE	OCI
Critical	Critical
Major	Error
Minor	Error
Warning	Warning
Info	Info

Note:

Alert OIDs are deprecated for OCI deployments.

15.3.1 OCNADD Alert Configuration

This section describes how to configure alerts in OCNADD.

OCNADD on OCCNE

If OCNADD is deployed on the OCCNE setup, all services will be monitored by Prometheus by default. No modifications in the helm charts are required. Update all Prometheus Alert Rules present in the Helm Chart.

Note:

The label used to update the Prometheus Server is "role: cnc-alerting-rules," which is added by default in helm charts.

OCNADD on OCI

Alerts on OCI are made available by the OCI Alarm service. The monitoring service on OCI fetches metrics from OCNADD services, and the Alarm service triggers alarms when the defined threshold is breached. Metrics on OCI are fetched using MQL, and MQL queries are used in the Alarm template on OCI. Alarms can be created using the OCI GUI. OCNADD provides a Terraform script to create supported alarms on OCI:

Extract the Terraform script provided in the OCNADD package under <release-name>/custom-templates/oci/terraform.
Follow these steps:
1. Log in to the OCI console.
2. Click Hamburger menu and select Developer Services.
3. Under Developer Services, select Resource Manager.
4. Under Resource Manager, select Stacks.
5. Click Create stack button.
6. Select the default My Configuration radio button.
7. Under Stack configuration, click on the folder radio button and upload the Terraform package <release-name>/custom-templates/oci/terraform.
8. Enter the Name and Description and select the compartment.
9. Click Next.
Provide appropriate values for the parameters requested in the Terraform script as shown in the following screenshots:

OCNADD supports alarm subscription through email on OCI, and here are some important points for configuring alarms:

Alarm Categories in OCI: Alarms in OCI are categorized into critical, warning, info, and error. Note that the error category is not available in Prometheus alert rules. Therefore, alarms with severity minor and major in Prometheus are converted to error in OCI. For more information, see OCI Alert Template.
Notification and Topic Setup: During the execution of the Terraform script, notifications and topics for the alerts will be automatically created.
User Modification/Deletion: If users need to create new alarms or modify and delete the alarms added through Terraform, they can perform these actions by editing the corresponding alarm definitions through the OCI Console.
OCI Notification Reference: For more information on OCI Notification, see OCI Notification.

OCNADD Configuration When Prometheus is Deployed Without Operator

This section covers the steps to follow when Prometheus is deployed without Operator support (occne-nf-cnc-servicemonitor service), in order to receive all metrics on the OCNADD UI.

Changes in Custom Values of Management Group:

PROMETHEUS_API: http://<prometheus-service-name>.<prometheus-namespace>.svc.<cluster-domain>:80
# Replace the placeholders with correct information. 
# Example: PROMETHEUS_API: http://occne-kube-prom-stack-kube-prometheus.occne-infra.ocnadd:80

DD_PROMETHEUS_PATH: /prometheus/api/v1/query_range
# Replace the default DD_PROMETHEUS_PATH with this

Add Prometheus Annotations in All Deployments and StatefulSets:
Steps to Update Annotations in All Deployments and StatefulSets:
1. Run: kubectl edit deployment <deployment-name> -n <namespace>
  Add the Prometheus annotations as shown below to the respective deployments:
  
  Edit Deployment Example for Adapter
  Before:
```
...
template:
  metadata:
    creationTimestamp: null
    labels:
      app: app-1-adapter
      role: adapter
...
```
  After Adding Annotations:
```
...
template:
  metadata:
    annotations:                                  # Add these Prometheus annotations to charts
      prometheus.io/path: /actuator/prometheus
      prometheus.io/port: "9000"
      prometheus.io/scrape: "true"
    creationTimestamp: null
...
```
2. Edit the chart files at the specified locations (paths mentioned in the table below) to include the same Prometheus annotations, ensuring changes persist during upgrades.
3. Verification of Changes:
  1. Run the following to verify annotations are applied:
```
kubectl describe deployments.apps -n <namespace> app-1-adapter | grep "prometheus"
```
    Expected Output:
    Annotations: prometheus.io/path: /actuator/prometheus prometheus.io/port: 9000 prometheus.io/scrape: true
  2. Verify metrics availability in Prometheus.
  3. Confirm "ACTIVE" status of feeds on the DD UI when traffic is successfully flowing.

Chart paths for adding annotations manually:

Services	Path
kafka	ocnadd/charts/ocnaddkafka/templates/ocnaddkafkaBroker.yaml
zookeeper	ocnadd/charts/ocnaddkafka/templates/ocnadd-zookeeper.yaml
admin svc	ocnadd/charts/ocnaddadminsvc/templates/ocnaddadminservice.yaml
correlation svc	ocnadd/charts/ocnaddadminsvc/templates/correlation-deploy.yaml
storage adapter	ocnadd/charts/ocnaddadminsvc/templates/ocnaddstorageadapter-deploy.yaml
consumer adapter	ocnadd/charts/ocnaddadminsvc/templates/ocnaddingressadapter-deploy.yaml
alarm svc	ocnadd/charts/ocnaddalarm/templates/ocnadd-alarm.yaml
configuration svc	ocnadd/charts/ocnaddconfiguration/templates/ocnadd-configuration.yaml
healthmonitoring svc	ocnadd/charts/ocnaddhealthmonitoring/templates/ocnadd-health.yaml
aggregation svc	ocnadd/charts/ocnaddaggregation/templates/ocnadd-<NF>aggregation.yaml (NF - scp,sepp,pcf,nrf,bsf)
export svc	ocnadd/charts/ocnaddexport/templates/ocnadd-export.yaml

15.3.2 List of Alerts

This section provides detailed information about the alert rules defined for OCNADD.

15.3.2.1 System Level Alerts

This section lists the system level alerts for OCNADD.

Table 15-27 OCNADD_POD_CPU_USAGE_ALERT

Field	Details
Triggering Condition	POD CPU usage is above the set threshold (default 85%)
Severity	Major
Description	OCNADD Pod High CPU usage detected for a continuous period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: CPU usage is {{ "{{" }} $value \| printf "%.2f" }} which is above threshold {{ .Values.global.cluster.cpu_threshold }} % ' PromQL Expression: expr: (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".aggregation."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}2) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".kafka."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}6) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".kraft."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".adapter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}3) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".correlation."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".filter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".configuration."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".admin."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".health."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".alarm."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ui."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".export."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}4) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".storageadapter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}3) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ingressadapter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3)
Alert Details OCI	Summary: Alarm "OCNADD_POD_CPU_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "pod_cpu_usage_seconds_total[10m]{pod=~"alarm\|admin\|health\|config\|kraft"}.rate().groupby(namespace,pod).sum()100>=85\|\|pod_cpu_usage_seconds_total[10m]{pod=~"ui\|aggregation\|filter"}.rate().groupby(namespace,pod).sum()100>=285\|\|pod_cpu_usage_seconds_total[10m]{pod=~"corr"}.rate().groupby(namespace,pod).sum()100>=385\|\|pod_cpu_usage_seconds_total[10m]{pod=~"kafka"}.rate().groupby(namespace,pod).sum()100>=685\|\| pod_cpu_usage_seconds_total[10m]{pod=~"export"}.rate().groupby(namespace,pod).sum()100>=485\|\| pod_cpu_usage_seconds_total[10m]{pod=~"storageadapter"}.rate().groupby(namespace,pod).sum()100>=385\|\| pod_cpu_usage_seconds_total[10m]{pod=~"ingressadapter"}.rate().groupby(namespace,pod).sum()100>=385", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: pod_cpu_usage_seconds_total[10m]{pod=~"alarm\|admin\|health\|config\|kraft"}.rate().groupby(namespace,pod).sum()100>={{ CPU Threshold }}\|\|pod_cpu_usage_seconds_total[10m]{pod=~"ui\|aggregation\|filter"}.rate().groupby(namespace,pod).sum()100>=2{{ CPU Threshold }}\|\|pod_cpu_usage_seconds_total[10m]{pod=~"corr"}.rate().groupby(namespace,pod).sum()100>=3{{ CPU Threshold }}\|\|pod_cpu_usage_seconds_total[10m]{pod=~"kafka"}.rate().groupby(namespace,pod).sum()100>=6{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"export"}.rate().groupby(namespace,pod).sum()100>=4{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"storageadapter"}.rate().groupby(namespace,pod).sum()100>=3{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"ingressadapter"}.rate().groupby(namespace,pod).sum()100>=3*{{ CPU Threshold }} Note: CPU Threshold will be assigned will executing the terraform script
OID	1.3.6.1.4.1.323.5.3.53.1.29.4002
Metric Used	container_cpu_usage_seconds_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system
Resolution	The alert gets cleared when the CPU utilization is below the critical threshold. Note: The threshold is configurable in the `ocnadd-custom-values.yaml` file. If guidance is required, contact My Oracle Support.

Table 15-28 OCNADD_POD_MEMORY_USAGE_ALERT

Field	Details
Triggering Condition	POD Memory usage is above set threshold (default 90%)
Severity	Major
Description	OCNADD Pod High Memory usage detected for a continuous period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Memory usage is {{ "{{" }} $value \| printf "%.2f" }} which is above threshold {{ .Values.global.cluster.memory_threshold }} % ' PromQL Expression: expr: (sum(container_memory_working_set_bytes{image!="" , pod=~".aggregation."}) by (pod,namespace) > 3102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".kafka."}) by (pod,namespace) > 64102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".kraft."}) by (pod,namespace) > 1102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".filter."}) by (pod,namespace) > 3102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".adapter."}) by (pod,namespace) > 4102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".correlation."}) by (pod,namespace) > 4102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".configuration."}) by (pod,namespace) > 4102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".admin."}) by (pod,namespace) > 2102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".health."}) by (pod,namespace) > 2102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".alarm."}) by (pod,namespace) > 2102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ui."}) by (pod,namespace) > 2102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".export."}) by (pod,namespace) > 64102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".storageadapter."}) by (pod,namespace) > 64102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ingressadapter."}) by (pod,namespace) > 8102410241024{{ .Values.global.cluster.memory_threshold }}/100)
Alert Details OCI	Summary: Alarm "OCNADD_POD_MEMORY_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "container_memory_usage_bytes[5m]{pod=~"adapter\|kafka\|kraft\|ocnadd\|corr\|export\|storageadapter\|ingressadapter"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[5m]{pod=~"adapter\|kafka\|kraft\|ocnadd\|corr\|export\|storageadapter\|ingressadapter"}.groupby(namespace,pod).sum()100>=90", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: container_memory_usage_bytes[10m]{pod=~"adapter\|kafka\|kraft\|ocnadd\|corr\|export\|storageadapter\|ingressadapter"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[10m]{pod=~"adapter\|kafka\|kraft\|ocnadd\|corr\|export\|storageadapter\|ingressadapter"}.groupby(namespace,pod).sum()100>={{ Memory Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.4005
Metric Used	container_memory_working_set_bytes Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert gets cleared when the memory utilization is below the critical threshold. Note: The threshold is configurable in the `ocnadd_custom_values.yaml` file. If guidance is required, contact My Oracle Support.

Table 15-29 OCNADD_POD_RESTARTED

Field	Details
Triggering Condition	A POD has restarted
Severity	Minor
Description	A POD has restarted in the last 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: A Pod has restarted' PromQL Expression: expr: kube_pod_container_status_restarts_total{namespace="{{ .Values.global.cluster.nameSpace.name }}"} > 1
Alert Details OCI	MQL Expression: No MQL equivalent is available
OID	1.3.6.1.4.1.323.5.3.53.1.29.5006
Metric Used	kube_pod_container_status_restarts_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically if the specific pod is up. Steps: 1. Check the application logs. Check for database related failures such as connectivity, Kubernetes secrets, and so on. 2. Run the following command to check orchestration logs for liveness or readiness probe failures: kubectl get po -n <namespace> Note the full name of the pod that is not running, and use it in the following command: kubectl describe pod <desired full pod name> -n <namespace> 3. Check the database status. For more information, see "Oracle Communications Cloud Native Core DBTier User Guide". 4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

15.3.2.2 Application Level Alerts

This section lists the application level alerts for OCNADD.

Table 15-30 OCNADD_CONFIG_SVC_DOWN

Field	Details
Triggering Condition	The configuration service went down or not accessible
Severity	Critical
Description	OCNADD Configuration service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddconfiguration service is down' PromQL Expression: expr: up{service="ocnaddconfiguration"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_CONFIG_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddconfiguration"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddconfiguration"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.20.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Configuration service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support..

Table 15-31 OCNADD_ALARM_SVC_DOWN

Field	Details
Triggering Condition	The alarm service went down or not accessible
Severity	Critical
Description	OCNADD Alarm service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddalarm service is down' PromQL Expression: expr: up{service="ocnaddalarm"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_ALARM_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddalarm"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddalarm"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.24.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Alarm service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-32 OCNADD_HEALTH_MONITORING_SVC_DOWN

Field	Details
Triggering Condition	The health monitoring service went down or not accessible
Severity	Critical
Description	OCNADD Health monitoring service not available for more than 2 min
Alert Details CNE	Summary: summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddhealthmonitoring service is down' PromQL Expression: expr: up{service="ocnaddhealthmonitoring"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_HEALTH_MONITORING_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddhealthmonitoring"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddhealthmonitoring"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.28.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Health monitoring service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-33 OCNADD_SCP_AGGREGATION_SVC_DOWN

Field	Details
Triggering Condition	The SCP Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD SCP Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddscpaggregation service is down' PromQL Expression: expr: up{service="ocnaddscpaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_SCP_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddscpaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddscpaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.22.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD SCP Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-34 OCNADD_NRF_AGGREGATION_SVC_DOWN

Field	Details
Triggering Condition	The NRF Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD NRF Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddnrfaggregation service is down' PromQL Expression: expr: up{service="ocnaddnrfaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_NRF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddnrfaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddnrfaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.31.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD NRF Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-35 OCNADD_SEPP_AGGREGATION_SVC_DOWN

Field	Details
Triggering Condition	The SEPP Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD SEPP Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddseppaggregation service is down' PromQL Expression: expr: up{service="ocnaddseppaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_SEPP_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddseppaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddseppaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.32.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD SEPP Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-36 OCNADD_BSF_AGGREGATION_SVC_DOWN

Triggering Condition	The BSF Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD BSF Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddbsfaggregation service is down' PromQL Expression: expr: up{service="ocnaddbsfaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_BSF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddbsfaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddbsfaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.40.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD BSF Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-37 OCNADD_PCF_AGGREGATION_SVC_DOWN

Triggering Condition	The PCF Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD PCF Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddpcfaggregation service is down' PromQL Expression: expr: up{service="ocnaddpcfaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_PCF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddpcfaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddpcfaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.41.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD PCF Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

Table 15-38 OCNADD_NON_ORACLE_AGGREGATION_SVC_DOWN

Triggering Condition	The Non Oracle Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD Non Oracle Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddnonoracleaggregation service is down' PromQL Expression: expr: up{service="ocnaddnonoracleaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_NON_ORACLE_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddnonoracleaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddnonoracleaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.37.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Non Oracle Aggregation service starts instance becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-39 OCNADD_ADMIN_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Admin service went down or not accessible
Severity	Critical
Description	OCNADD Admin service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddadminservice service is down' PromQL Expression: expr: up{service="ocnaddadminservice"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_ADMIN_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="`ocnaddadminservice`"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddadminservice"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.30.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Admin service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-40 OCNADD_CONSUMER_ADAPTER_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Consumer Adapter service went down or not accessible
Severity	Critical
Description	OCNADD Consumer Adapter service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Consumer Adapter service is down' PromQL Expression: expr: up{service=~".adapter.", role="adapter"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_CONSUMER_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="correlation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner=~"adapter*"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.25.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Consumer Adapter service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-41 OCNADD_FILTER_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Filter service went down or not accessible
Severity	Critical
Description	OCNADD Filter service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Filter service is down' PromQL Expression: expr: up{service=~".filter."} != 1
Alert Details OCI	Summary: Alarm "OCNADD_FILTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddfilter"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="`ocnaddfilter`"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.34.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Filter service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-42 OCNADD_CORRELATION_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Correlation service went down or not accessible
Severity	Critical
Description	OCNADD Correlation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Correlation service is down' PromQL Expression: expr: up{service=~".correlation."} != 1
Alert Details OCI	Summary: Alarm "OCNADD_CORRELATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="correlation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="correlation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.33.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Correlation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-43 OCNADD_EXPORT_SVC_DOWN

Triggering Condition	The OCNADD Export service went down or not accessible
Severity	Critical
Description	OCNADD Export service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Export service is down' PromQL Expression: expr: up{service=~".export."} != 1
Alert Details OCI	Summary: Alarm "OCNADD_EXPORT_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="export"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="export"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.39.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD export service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-44 OCNADD_STORAGE_ADAPTER_SVC_DOWN

Triggering Condition	The OCNADD Storage adapter service went down or not accessible
Severity	Critical
Description	OCNADD Storage adapter service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Storage adapter service is down' PromQL Expression: expr: up{service=~".storage-adapter.", role="storageAdapter"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_STORAGE_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="storageadapter"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="storageadapter"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.38.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Storage adapter service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-45 OCNADD_INGRESS_ADAPTER_SVC_DOWN

Triggering Condition	The OCNADD Ingress Adapter service went down or not accessible
Severity	Critical
Description	OCNADD Ingress Adapter service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Ingress Adapter service is down' PromQL Expression: expr: up{service=~".ingress-adapter.", role="ingressadapter"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_INGRESS_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ingressadapter"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ingressadapter"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.36.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Ingress Adapter service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: `kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>` 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-46 OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total ingress MPS crossed the warning threshold of 80% of the supported MPS
Severity	Warn
Description	Total Ingress Message Rate is above the configured warning threshold (80%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80 Percent of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace) > 0.8*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>.8{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>.8{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5007
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the MPS rate goes below the warning threshold level of 80%.

Table 15-47 OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total ingress MPS crossed the minor threshold alert level of 90% of the supported MPS
Severity	Minor
Description	Total Ingress Message Rate is above configured minor threshold alert (90%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90 Percent of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace) > 0.9*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>.9{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>.9{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5008
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the MPS rate goes below the minor threshold alert level of 90%.

Table 15-48 OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total ingress MPS crossed the major threshold alert level of 95% of the supported MPS
Severity	Major
Description	Total Ingress Message Rate is above the configured major threshold alert (95%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95 Percent of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>0.95{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>0.95{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5009
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the MPS rate goes below the major threshold alert level of 95%.

Table 15-49 OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total ingress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity	Critical
Description	Total Ingress Message Rate is above configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>1.0{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>1.0{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.36.5010
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

Table 15-50 OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total ingress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity	Critical
Description	Total Ingress Message Rate is above configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>1.0{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>1.0{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5010
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

Table 15-51 OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total egress MPS crossed the warning threshold alert level of 80% of the supported MPS
Severity	Warn
Description	The total Egress Message Rate is above the configured warning threshold alert (80%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80% of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.80*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.80{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.80{{ MPS Threshold }}"
OID	1.3.6.1.4.1.323.5.3.53.1.29.5011
Metric Used	ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the MPS rate goes below the warning threshold alert level of 80% of supported MPS

Table 15-52 OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total egress MPS crossed the minor threshold alert level of 90% of the supported MPS
Severity	Minor
Description	The total Egress Message Rate is above the configured minor threshold alert (90%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90% of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.90*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.90{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.90{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5012
Metric Used	ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the MPS rate goes below the minor threshold alert level of 90% of supported MPS

Table 15-53 OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total egress MPS crossed the major threshold alert level of 95% of the supported MPS
Severity	Major
Description	The total Egress Message Rate is above the configured major threshold alert (95%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95% of Max messages per second:{{ .Values.global.cluster.mps }}' Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.95{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.95{{ MPS Threshold }}"
OID	1.3.6.1.4.1.323.5.3.53.1.29.5013
Metric Used	ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the MPS rate goes below the major threshold alert level of 95% of supported MPS

Table 15-54 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity	Critical
Description	The total Egress Message Rate is above the configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>1.0{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>1.0{{ MPS Threshold }}"
OID	1.3.6.1.4.1.323.5.3.53.1.29.5014
Metric Used	ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

Table 15-55 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER

Field	Details
Triggering Condition	The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS for a consumer
Severity	Critical
Description	The total Egress Message Rate is above the configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}' Expression: expr: sum (rate(ocnadd_egress_requests_total[5m])) by (namespace, instance_identifier) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>1.0 {{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>1.0 {{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5015
Metric Used	ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

Table 15-56 OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total observed latency is above the configured warning threshold alert level of 80%
Severity	Warn
Description	Average E2E Latency is above the configured warning threshold alert level (80%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 80% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms' Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .80{{ .Values.global.cluster.max_latency }} <= .90{{ .Values.global.cluster.max_latency }}
Alert Details OCI	Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.040&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.045", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.040&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.045 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05
OID	1.3.6.1.4.1.323.5.3.53.1.29.5016
Metric Used	ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution	The alert is cleared automatically when the average latency goes below the warning threshold alert level of 80% of permissible latency

Table 15-57 OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total observed latency is above the configured minor threshold alert level of 90%
Severity	Minor
Description	Average E2E Latency is above the configured minor threshold alert level (90%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 90% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms' PromQL Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .90{{ .Values.global.cluster.max_latency }} <= 0.95{{ .Values.global.cluster.max_latency }}
Alert Details OCI	Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.045&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.0475", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.045&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.0475 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05
OID	1.3.6.1.4.1.323.5.3.53.1.29.5017
Metric Used	ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution	The alert is cleared automatically when the average latency goes below the minor threshold alert level of 90% of permissible latency

Table 15-58 OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total observed latency is above the configured major threshold alert level of 95%
Severity	Major
Description	Average E2E Latency is above the configured minor threshold alert level (95%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 95% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms' PromQL Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .95{{ .Values.global.cluster.max_latency }} <= 1.0{{ .Values.global.cluster.max_latency }}
Alert Details OCI	Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.0475&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.05", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.0475&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.05 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05
OID	1.3.6.1.4.1.323.5.3.53.1.29.5018
Metric Used	ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution	The alert is cleared automatically when the average latency goes below the major threshold alert level of 95% of permissible latency

Table 15-59 OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total observed latency is above the configured critical threshold alert level of 100%
Severity	Critical
Description	Average E2E Latency is above the configured critical threshold alert level (100%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 100% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms' PromQL Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > 1.0*{{ .Values.global.cluster.max_latency }}
Alert Details OCI	Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.05", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.05 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05
OID	1.3.6.1.4.1.323.5.3.53.1.29.5019
Metric Used	ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution	The alert is cleared automatically when the average latency goes below the critical threshold alert level of permissible latency

Table 15-60 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS

Field	Details
Triggering Condition	The packet drop rate in Kafka streams is above the configured major threshold of 1% of the total supported MPS
Severity	Major
Description	The packet drop rate in Kafka streams is above the configured major threshold of 1% of total supported MPS in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 1% threshold of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(rate(kafka_stream_task_dropped_records_total{service=~".aggregation."}[5m])) by (namespace) > 0.01*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_task_dropped_records_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()100> {{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_task_dropped_records_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()100> {{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5020
Metric Used	kafka_stream_task_dropped_records_total
Resolution	The alert is cleared automatically when the packet drop rate goes below the major threshold (1%) alert level of supported MPS

Table 15-61 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS

Field	Details
Triggering Condition	The packet drop rate in Kafka streams is above the configured critical threshold of 10% of the total supported MPS
Severity	Critical
Description	The packet drop rate in Kafka streams is above the configured critical threshold of 10% of total supported MPS in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 10% threshold of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(rate(kafka_stream_task_dropped_records_total{service=~".aggregation."}[5m])) by (namespace) > 0.1*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_task_dropped_records_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()100>10{{MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_task_dropped_records_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()100>10{{MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5021
Metric Used	kafka_stream_task_dropped_records_total
Resolution	The alert is cleared automatically when the packet drop rate goes below the critical threshold (10%) alert level of supported MPS

Table 15-62 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_0.1PERCENT

Field	Details
Triggering Condition	The Egress adapter failure rate towards the 3rd party application is above the configured threshold of 0.1% of the total supported MPS
Severity	Info
Description	Egress external connection failure rate towards 3rd party application is crossing the info threshold of 0.1% in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 0.1 Percent of Total Egress external connections' PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 0.1 < 10
Alert Details OCI	Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_01PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=0.1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<1", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=0.1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<1
OID	1.3.6.1.4.1.323.5.3.53.1.29.5022
Metric Used	ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (0.1%) alert level of supported MPS

Table 15-63 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT

Field	Details
Triggering Condition	The Egress adapter failure rate towards the third-party application is above the configured threshold of 1% of the total supported MPS
Severity	Warn
Description	Egress external connection failure rate towards 3rd party application is crossing the warning threshold of 1% in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 1 Percent of Total Egress external connections' PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 1 < 10
Alert Details OCI	Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<10", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<10
OID	1.3.6.1.4.1.323.5.3.53.1.29.5023
Metric Used	ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (1%) alert level of supported MPS

Table 15-64 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT

Field	Details
Triggering Condition	The Egress adapter failure rate towards the third-party application is above the configured threshold of 10% of the total supported MPS
Severity	Minor
Description	Egress external connection failure rate towards 3rd party application is crossing a minor threshold of 10% in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 10 Percent of Total Egress external connections' PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 10 < 25
Alert Details OCI	Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=10&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<25", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=10&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<25
OID	1.3.6.1.4.1.323.5.3.53.1.29.5024
Metric Used	ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (10%) alert level of supported MPS

Table 15-65 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT

Field	Details
Triggering Condition	The Egress adapter failure rate towards the third-party application is above the configured threshold of 25% of the total supported MPS
Severity	Major
Description	Egress external connection failure rate towards 3rd party application is crossing the major threshold of 25% in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 25 Percent of Total Egress external connections' PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 25 < 50
Alert Details OCI	Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=25&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<50", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=25&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<50
OID	1.3.6.1.4.1.323.5.3.53.1.29.5025
Metric Used	ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (25%) alert level of supported MPS

Table 15-66 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT

Field	Details
Triggering Condition	The Egress adpater failure rate towards the 3rd party application is above the configured threshold of 50% of the total supported MPS
Severity	Critical
Description	Egress external connection failure rate towards 3rd party application is crossing the critical threshold of 50% in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 50 Percent of Total Egress external connections' PromQL Expression: expr:(sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 50
Alert Details OCI	Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=50", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=50
OID	1.3.6.1.4.1.323.5.3.53.1.29.5026
Metric Used	ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (50%) alert level of supported MPS

Table 15-67 OCNADD_INGRESS_TRAFFIC_RATE_INCREASE_SPIKE_10PERCENT

Field	Details
Triggering Condition	The ingress traffic increase is more than 10% of the supported MPS
Severity	Major
Description	The ingress traffic increase is more than 10% of the supported MPS in the last 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS increase is more than 10 Percent of current supported MPS' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m] offset 5m)) by (namespace) >= 1.1
Alert Details OCI	Not Available
OID	1.3.6.1.4.1.323.5.3.53.1.29.5027
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the increase in MPS comes back to lower than 10% of the supported MPS

Table 15-68 OCNADD_INGRESS_TRAFFIC_RATE_DECREASE_SPIKE_10PERCENT

Field	Details
Triggering Condition	The ingress traffic decrease is more than 10% of the supported MPS
Severity	Major
Description	The ingress traffic decrease is more than 10% of the supported MPS in the last 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS decrease is more than 10 Percent of current supported MPS' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m] offset 5m)) by (namespace) <= 0.9
Alert Details OCI	Not Available
OID	1.3.6.1.4.1.323.5.3.53.1.29.5028
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the decrease in MPS comes back to lower than 10% of the supported MPS

Table 15-69 OCNADD_TRANSACTION_SUCCESS_CRITICAL_THRESHOLD_DROPPED

Field	Details
Triggering Condition	The total transaction success xDRs rate has dropped the critical threshold alert level of 90%
Severity	Critical
Description	The total transaction success xDRs rate has dropped the critical threshold alert level of 90% for the period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Transaction Success Rate is below 90% per hour:{{ .Values.global.cluster.mps }}' Expression: expr: sum(irate(ocnadd_total_transactions_total{service=~".correlation.",status="SUCCESS"}[5m]))by (namespace,service) / sum(irate(ocnadd_total_transactions_total{service=~".correlation."}[5m]))by (namespace,service) *100 < 90
Alert Details OCI	Summary: Alarm "OCNADD_TRANSACTION_SUCCESS_CRITICAL_THRESHOLD_DROPPED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_total_transactions_total[10m]{status="SUCCESS",app=~"corr"}.rate().groupBy(workername,app).sum()/ocnadd_total_transactions_total[10m]{app=~"corr"}.rate().groupBy(workername,app).sum()100<90", with a trigger delay of 1 minute MQL Expression: ocnadd_total_transactions_total[10m]{status="SUCCESS",app=~"corr"}.rate().groupBy(workername,app).sum()/ocnadd_total_transactions_total[10m]{app=~"corr"}.rate().groupBy(workername,app).sum()100<90
OID	1.3.6.1.4.1.323.5.3.53.1.33.5029
Metric Used	ocnadd_total_transactions_total
Resolution	The alert is cleared automatically when the transaction success rate goes above the critical threshold alert level of 90%

15.3.3 ADDING SNMP SUPPORT

OCNADD forwards the Prometheus alerts as Simple Network Management Protocol (SNMP) traps to the southbound SNMP servers. OCNADD uses two SNMP MIB files to generate the traps. The alert manager configuration is modified by updating the alertmanager.yaml file. In the alertmanager.yaml file, the alerts can be grouped based on podname, alertname, severity, namespace, and so on. The Prometheus alert manager is integrated with Oracle Communications Cloud Native Core, Cloud Native Environment (CNE) snmp-notifier service. The external SNMP servers are set up to receive the Prometheus alerts as SNMP traps. The operator must update the MIB files along with the alert manager file to fetch the SNMP traps in their environment.

Note:

SNMP is not supported on OCI.
Only a user with admin privileges can perform the following procedures.

Alert Manager Configuration

Run the following command to obtain the Alert Manager Secret configuration from the Bastion Host and save it to a file:

$ kubectl get secret alertmanager-occne-kube-prom-stack-kube-alertmanager -o yaml -n occne-infra > alertmanager-secret-k8s.yaml

Sample output:

apiVersion: v1
data:
  alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCnJvdXRlOgogIGdyb3VwX2J5OgogIC0gam9iCiAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgZ3JvdXBfd2FpdDogMzBzCiAgcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbA==
kind: Secret
metadata:
  annotations:
    meta.helm.sh/release-name: occne-kube-prom-stack
    meta.helm.sh/release-namespace: occne-infra
  creationTimestamp: "2022-01-24T22:46:34Z"
  labels:
    app: kube-prometheus-stack-alertmanager
    app.kubernetes.io/instance: occne-kube-prom-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 18.0.1
    chart: kube-prometheus-stack-18.0.1
    heritage: Helm
    release: occne-kube-prom-stack
  name: alertmanager-occne-kube-prom-stack-kube-alertmanager
  namespace: occne-infra
  resourceVersion: "5175"
  uid: a38eb420-a4d0-4020-a375-ab87421defde
type: Opaque

Extract the Alert Manager configuration. The third line of the alertmanager.yaml file contains Alert Manager configuration encoded in base64 format. To extract the Alert Manager configuration, decode the alertmanager.yaml file. Run the following command:

echo 'Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCnJvdXRlOgogIGdyb3VwX2J5OgogIC0gam9iCiAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgZ3JvdXBfd2FpdDogMzBzCiAgcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbA=='  | base64 --decode

Sample output:

global:
  resolve_timeout: 5m
receivers:
- name: default-receiver
  webhook_configs:
  - url: http://occne-snmp-notifier:9464/alerts
route:
  group_by:
  - job
  group_interval: 5m
  group_wait: 30s
  receiver: default-receiver
  repeat_interval: 12h
  routes:
  - match:
      alertname: Watchdog
    receiver: default-receiver
templates:
- /etc/alertmanager/config/*.tmpl

Update the alertmanager.yaml file, alerts can be grouped based on the following:
- podname
- alertname
- severity
- namespace
Save the changes to alertmanager.yaml file.

For example:
```
route:
  group_by: [podname, alertname, severity, namespace]
  group_interval: 5m
  group_wait: 30s
  receiver: default-receiver
  repeat_interval: 12h
```

Encode the updated alertmanager.yaml file, run the following command:

$ cat alertmanager.yaml | base64 -w0
Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCi0gbmFtZTogbmV3LXJlY2VpdmVyLTEKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyLTE6OTQ2NS9hbGVydHMKcm91dGU6CiAgZ3JvdXBfYnk6CiAgLSBqb2IKICBncm91cF9pbnRlcnZhbDogNW0KICBncm91cF93YWl0OiAzMHMKICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgogIHJlcGVhdF9pbnRlcnZhbDogMTJoCiAgcm91dGVzOgogIC0gcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICAgIGdyb3VwX3dhaXQ6IDMwcwogICAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIC0gcmVjZWl2ZXI6IG5ldy1yZWNlaXZlci0xCiAgICBncm91cF93YWl0OiAzMHMKICAgIGdyb3VwX2ludGVydmFsOiA1bQogICAgcmVwZWF0X2ludGVydmFsOiAxMmgKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgogIC0gbWF0Y2g6CiAgICAgIGFsZXJ0bmFtZTogV2F0Y2hkb2cKICAgIHJlY2VpdmVyOiBuZXctcmVjZWl2ZXItMQp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbAo=

Edit the alertmanager-secret-k8s.yaml file created in step 1. Replace the alertmanager.yaml encoded content with the output generated in the previous step.

For example:

$ vi alertmanager-secret-k8s.yaml
apiVersion: v1
data:
  alertmanager.yaml: <paste here the encoded content of alertmanager.yaml>
kind: Secret
metadata:
  annotations:
    meta.helm.sh/release-name: occne-kube-prom-stack
    meta.helm.sh/release-namespace: occne-infra
  creationTimestamp: "2023-02-16T09:44:58Z"
  labels:
    app: kube-prometheus-stack-alertmanager
    app.kubernetes.io/instance: occne-kube-prom-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 36.2.0
    chart: kube-prometheus-stack-36.2.0
    heritage: Helm
    release: occne-kube-prom-stack
  name: alertmanager-occne-kube-prom-stack-kube-alertmanager
  namespace: occne-infra
  resourceVersion: "8211"
  uid: 9b499b32-6ad2-4754-8691-70665f9daab4
type: Opaque

Run the following command:

$ kubectl apply -f alertmanager-secret-k8s.yaml -n occne-infra

Integrate the Alert Manager with snmp-notifier Service

Update the SNMP client destination in the occne-snmp-notifier service with the SNMP destination client IP.

Note:

For a persistent client configuration, edit the values of the snmp-notifier in Helm charts and perform a Helm upgrade.

Add "warn" in alert severity to receive warning alerts from OCNADD. Run the following command:

$ kubectl edit deployment -n occne-infra occne-snmp-notifier
 
1. update the field "--snmp.destination=<IP>:<port>" inside the args of container and add the snmp-client destination IP.
   Example:
 
    spec:
      containers:
      - args:
        - --snmp.destination=10.20.30.40:162
 
2. "warn" should also be added to the severity list as some of the DD alerts are raised with severity: warn.
   Exmaple:
 
    - --alert.severities=critical,major,minor,warning,info,clear,warn

Verifying SNMP notification

Update the SNMP client destination in the occne-snmp-notifier service with the SNMP destination client IP.

Note:

For a persistent client configuration, edit the values of the snmp-notifier in Helm charts and perform a Helm upgrade.

Add "alert.severities" in the container arguments for the occne-snmp-notifier to receive alerts from OCNADD. Run the following command:

$ kubectl edit deployment -n occne-infra occne-snmp-notifier
  
1. update the field "--snmp.destination=<IP>:<port>" inside the args of container and add the snmp-client destination IP.
   Example:
  
    spec:
      containers:
      - args:
        - --snmp.destination=10.20.30.40:162
  
2. Add the "alert.severities" parameter in the container arguments. Add the below line in the container arguments:
  alert.severities=critical,major,minor,warning,info,clear,warn
 
   Exmaple:
    spec:
      containers:
      - args:
        - --snmp.destination=10.20.30.40:162
        - --alert.severities=critical,major,minor,warning,info,clear,warn

To verify the SNMP notification, see the new notifications in the pod logs of occne snmp notifier. Run the following command to see the logs:

$ kubectl logs -n occne-infra <occne-snmp-notifier-pod-name>

Sample output:

10.20.30.50 - - [26/Mar/2023:13:58:14 +0000] "POST /alerts HTTP/1.1" 200 0
10.20.30.60 - - [26/Mar/2023:14:02:51 +0000] "POST /alerts HTTP/1.1" 200 0
10.20.30.70 - - [26/Mar/2023:14:03:14 +0000] "POST /alerts HTTP/1.1" 200 0
10.20.30.80 - - [26/Mar/2023:14:07:51 +0000] "POST /alerts HTTP/1.1" 200 0
10.20.30.90 - - [26/Mar/2023:14:08:14 +0000] "POST /alerts HTTP/1.1" 200 0

OCNADD MIB Files

Two OCNADD MIB files are used to generate the traps. The operator has to update the MIB files and the alert manager file to obtain the traps in their environment. The files are:

OCNADD-MIB-TC-25.1.200.mib: This is a top level mib file, where the objects and their data types are defined.
OCNADD-MIB-25.1.200.mib: This file fetches the objects from the top level mib file and based on the alert notification, the objects are selected for display.

Note:

MIB files are packaged along with OCNADD Custom Templates. Download the files from MOS. See Oracle Communications Cloud Native Core Network Analytics Data Director Installation, Upgrade, and Fault Recovery Guide for more information.