15 OCNADD KPIs, Metric, and Alerts

This chapter provides information on OCNADD Metrics, KPIs, and Alerts.

15.1 OCNADD Metrics

This section includes information about Dimensions and Common Attributes of metrics for Oracle Communications Network Analytics Data Director (OCNADD).

Dimension Description

Table 15-1 Dimensions

Dimension Values Description
HttpVersion HTTP/2.0 Specifies HTTP protocol version.
Method GET, PUT, POST, DELETE, PATCH HTTP method.
Scheme HTTP, HTTPS, UNKNOWN Specifies the HTTP protocol scheme.
route_path NA Path predicate that matched the current request.
status NA HTTP response code.
quantile Integer values It captures the latency values with ranges as 10ms, 20ms, 40ms, 80ms, 100ms, 200ms, 500ms, 1000ms, and 5000ms.
instance_identifier Prefix configured in Helm, UNKNOWN Prefix of the pod configured in Helm when there are multiple instances in the same deployment.
3rd_party_consumer_name or consumer_name - Name of the 3rd-party consumer application as configured from the UI.
destination_endpoint IP/FQDN Destination IP address or FQDN.
processor_node_id - Stream processor node ID in aggregation service.
serviceId serviceType-N It is the identifier for the service instance used for registration with the health monitoring service.
serviceType CONSUMER_ADAPTER, CONFIGURATION, ALARM, AGGREGATION-NRF, AGGREGATION-SCP, AGGREGATION-SEPP, AGGREGATION-BSF, AGGREGATION-PCF, AGGREGATION-NON-ORACLE, OCNADD-ADMIN The ocnadd service type.
service ocnaddnrfaggregation, ocnaddseppaggregation, ocnaddscpaggregation, ocnaddbsfaggregation, etc. The name of the Data Director microservice service.
request_type HTTP2, H2C, TCP, TCP_SECURED Type of the data feed created using UI; this will be used to identify if the feed is for HTTP2 or synthetic packets.
destination_endpoint URI It is the REST URI for the 3rd-party monitoring application configured on the data feed.
nf_feed_type SCP, NRF, PCF, BSF, SEPP The source NF for the feed.
error_reason - The error reason for the failure of the HTTP request sent to the 3rd-party application from the egress adapter.
correlation-id - Taken from correlation ID present in the metadata list.
way - It is taken from the message direction present in the metadata list.
srcIP - It is taken from the source IP present in the metadata list, global L3L4 configuration, or in the least-priority address configured in the feed.
dstIP - It is taken from the destination IP present in the metadata list, global L3L4 configuration, or in the least-priority address configured in the feed.
srcPort - It is taken from the source port present in the metadata list, global L3L4 configuration, or in the least-priority address configured in the feed.
dstPort - It is taken from the destination port present in the metadata list, global L3L4 configuration, or in the least-priority address configured in the feed.
MD - It provides information if the value is taken from the metadata list. It is attached with either srcIP, dstIP, srcPort, or dstPort based on mapping.
LP - It provides information if the value is taken from the least-priority address configured in the feed. It is attached with either srcIP, dstIP, srcPort, or dstPort based on mapping.
L3L4 - It provides information if the value is taken from the global L3L4 configuration. It is attached with either srcIP, dstIP, srcPort, or dstPort based on mapping.
worker_group String Name of the worker group in which the corresponding traffic processing service is running.
The following table includes information about common attributes for OCNADD:

Table 15-2 Attributes

Attribute Description
application The name of the application that the microservice is a part of
microservice The name of the microservice.
namespace The Kubernetes namespace in which microservice is running.
node The name of the worker node that the microservice is running on.
pod The name of the Kubernetes POD.

OCNADD Metrics

The following table lists important metrics related to OCNADD:

Table 15-3 Metrics

Metric Name Description Dimensions
kafka_stream_processor_node_process_total

The total number of records processed by a source node. The metric will be pegged by the aggregation services (SCP, SEPP, BSF, PCF and NRF) and egress adapter in the data directory and denote the total number of records consumed from the source topic.

Metric Type: Counter

Namespace will identify the worker group for the corresponding Kafka Cluster

  • application
  • container
  • service
  • namespace
  • processor_node_id
  • task_id
  • thread_id
  • name
  • microservice
kafka_stream_processor_node_process_rate The average number of records processed per second by a source node.

The metric will be pegged by the aggregation services (SCP, SEPP, BSF, PCF and NRF) and egress adapter in the data director and denotes the records consumed per sec from the source topic.

Metric Type: Gauge

Namespace will identify the worker group for the corresponding Kafka Cluster

  • application
  • container
  • service
  • namespace
  • processor_node_id
  • task_id
  • thread_id
  • name
  • microservice
kafka_stream_task_dropped_records_total

The total number of records dropped within the stream processing task. The metric will be pegged by the aggregation services (SCP, SEPP, BSF, PCF and NRF) and egress adapter in the data director.

Metric Type: Counter

Namespace will identify the worker group for the corresponding Kafka Cluster

  • application
  • container
  • service
  • namespace
  • processor_node_id
  • task_id
  • thread_id
  • name
  • microservice
kafka_stream_task_dropped_records_rate

The average number of records dropped within the stream processing task. The metric will be pegged by the aggregation services (SCP, SEPP, BSF, PCF and NRF) and egress adapter in the data director.

Metric Type: Gauge

Namespace will identify the worker group for the corresponding Kafka Cluster

  • application
  • container
  • service
  • namespace
  • processor_node_id
  • task_id
  • thread_id
  • name
  • microservice
ocnadd_egress_requests_total

This metric will be pegged as soon as the request reaches the ocnadd egress adapter.

This metric peg count of total request which has to be forwarded to third party. This metric is used for the Egress MPS at DD.

Metric Type: Counter

  • method
  • instance_identifier
  • nf_feed_type
  • request_type
  • Third_party_consumer_name
  • worker_group
  • destination_endpoint
ocnadd_egress_responses_total

This metric will be pegged on the Egress adapter when the response is received at egress adapter. This metric is pegged by ocnadd adapter service. This metric pegs count of total responses (success or failed) received from the third party

Metric Type: Counter

  • method
  • status
  • instance_identifier
  • destination_endpoint
  • nf_feed_type
  • request_type
  • Third_party_consumer_name
  • worker_group
ocnadd_egress_failed_request_total

This metric pegs count of total request which are failed to send to third party application. This metric is pegged by egress adapter service

Metric Type: Counter

  • destination_endpoint
  • Third_party_consumer_name
  • instance_identifier
  • error-reason
  • nf_feed_type
  • request_type
  • worker_group
ocnadd_egress_e2e_request_processing_latency_seconds_bucket

The metric is pegged on the egress adapter service. It is the latency between the packet timestamp provided by producer NF and the egress adapter when the request packet is sent to the third-party application. This latency is calculated for each message.

Metric Type: Histogram

  • instance_identifier
  • quantile(or le)
  • Third_party_consumer_name
  • worker_group
ocnadd_egress_e2e_request_processing_latency_seconds_sum

This is the sum of end-to-end request processing time for all the requests in seconds. It is the latency between the packet timestamp provided by the producer NF and the egress adapter when the packet is sent out.

Metric Type: Counter

  • instance_identifier
  • Third_party_consumer_name
  • worker_group
ocnadd_egress_e2e_request_processing_latency_seconds_count

This is the count of batches of messages for which the processing time is summed up. It is the latency between the packet timestamp provided by producer NF and the egress adapter when the packet is sent out.

Metric Type: Counter

  • instance_identifier
  • Third_party_consumer_name
  • worker_group
ocnadd_egress_service_request_processing_latency_seconds_bucket

The metric is pegged on the egress adapter service. It is the egress adapter service processing latency and is pegged when a request is sent out from the egress gateway to the third-party application. This latency is calculated for each of the messages.

Metric Type: Histogram

  • instance_identifier
  • quantile(or le)
  • worker_group
ocnadd_egress_service_request_processing_latency_seconds_sum

The metric is pegged on the egress adapter service. It is the sum of the egress adapter service processing time for all the requests in seconds and is pegged when a request is sent out from the egress adapter to the third-party application.

Metric Type: Counter

  • instance_identifier
  • Third_party_consumer_name
  • worker_group
ocnadd_egress_service_request_processing_latency_seconds_count

The metric is pegged on the egress adapter service. It is the count for which the processing time is summed up and is pegged when a request is sent out from the egress adapter to the third-party application.

Metric Type: Counter

  • instance_identifier
  • Third_party_consumer_name
  • worker_group
ocnadd_adapter_synthetic_packet_generation_count_total

This metric pegs the count of synthetic packets generated with either success or failed status.

Metric Type: Counter

  • instance_identifier
  • Third_party_consumer_name
  • status
  • nf_feed_type
  • worker_group
ocnadd_egress_filtered_message_total

This metric pegs the count of messages that is matching with the filter criteria for egress.

Metric Type: Counter

Filter criteria have been enhanced to provide a description with action and filter rules.

  • service_name
  • filter_name
  • filter_association_type
  • filter_criteria
  • worker_group
ocnadd_egress_unmatched_filter_message_total

This metric pegs the count of messages that do not match the filter criteria for egress.

Metric Type: Counter

Filter criteria have been enhanced to provide a description with action and filter rules.

  • service_name
  • filter_name
  • filter_association_type
  • filter_criteria
  • worker_group
ocnadd_ingress_filtered_message_total

This metric pegs the count of messages that is matching with the filter criteria for ingress.

Metric Type: Counter

Filter criteria have been enhanced to provide a description with action and filter rules.

  • service_name
  • filter_name
  • filter_association_type
  • filter_criteria
  • worker_group
ocnadd_ingress_unmatched_filter_message_total

This metric pegs the count of messages that do not match the filter criteria for ingress.

Metric Type: Counter

Filter criteria have been enhanced to provide a description with action and filter rules.

  • service_name
  • filter_name
  • filter_association_type
  • filter_criteria
  • worker_group
ocnadd_health_total_alarm_raised_total

This metric will be pegged whenever a new alarm is raised to the alarm service from the Health Monitoring service.

Metric Type: Counter

  • serviceId
  • serviceType
ocnadd_health_total_alarm_cleared_total

This metric will be pegged whenever a clear alarm is invoked from the Health Monitoring service.

Metric Type: Counter

  • serviceId
  • serviceType
ocnadd_health_total_active_number_of_alarm_raised_total

This metric will be pegged whenever a raise/clear alarm is raised to the alarm service from the Health Monitoring service. This denotes the active alarms raised by the healthmonitoring service.

Metric Type: Counter

  • serviceId
  • serviceType
ocnadd_l3l4mapping_info_count_total

This metric will be pegged to provide information about L3L4 mapping in synthetic messages. By, default it is disabled in the chart.

Metric Type: Counter

  • correlation_id
  • dstIP
  • dstPort
  • srcIP
  • srcPort
  • way
  • service
  • worker_group
ocnadd_ext_kafka_feed_record_total This metric will be pegged by the admin service to provide the total consumed messages by the external Kafka consumer application. The admin service retrieves the consumer offsets count from all the partitions of the aggregated topic and pegs the metric periodically.

Metric Type: Counter

  • feed_name
  • worker_group
ocnadd_data_export_failure_records_count

The metric will be pegged by the export service to provide the total number of records or messages that were not exported successfully.

Metric Type: Counter

  • filelocation
  • configurationname
  • correlationfeedname
  • reason
  • exporttype
  • namespace
ocnadd_xdr_database_records_sent

The metric will be pegged by the storage adapter service to provide the total number of xDRs sent to the XDR database. The xDRs will be pegged for the correlation service corresponding to the worker group.

Metric Type: Counter

  • app
  • worker_group
ocnadd_ingress_request_total

This metric will be pegged as soon as the request reaches the ocnadd ingress adapter.

This metric pegs the count of total requests received by the ocnadd ingress adapter from non-Oracle NFs. It is used for the Ingress MPS at OCNADD with respect to non-Oracle NFs.

Metric Type: Counter

  • method
  • scheme
  • http_version
  • instance_identifier
  • source_host
  • status
  • responsecode
  • error_reason
  • worker_group
ocnadd_ingress_message_processed_total

This metric will be pegged as soon as the request is processed at the ocnadd ingress adapter.

This metric pegs the count of total requests that were processed successfully, failed, or discarded by the ocnadd ingress adapter.

Metric Type: Counter

  • status
  • error_reason
  • worker_group
  • instance_identifier
ocnadd_ingress_service_request_processing_latency_seconds_sum

The metric is pegged on the ingress adapter service. It is the sum of the ingress adapter service processing latency in seconds and is pegged when a request is completely processed at the ingress adapter. This is pegged for each message.

Metric Type: Counter

  • instance_identifier
  • worker_group
ocnadd_ingress_service_request_processing_latency_seconds_count

The metric is pegged on the ingress adapter service. It is the cumulative count of the messages processed at the ingress adapter service instance.

Metric Type: Counter

  • instance_identifier
  • worker_group

15.2 OCNADD KPIs

This section provides information about Key Performance Indicators (KPIs) used for Oracle Communications Network Analytics Data Director (OCNADD).

Note:

The "namespace" in the KPIs should be updated to reflect the current namespace used in the Data Director deployment.

Ensure that the queries are tailored per worker group wherever applicable, such as for KPIs related to ingress and egress MPS, failure/success rate, packet drop, etc. Utilize the "worker_group" label to filter based on the worker group name in the KPI queries.

For queries, adhere to PromQL syntax for CNE-based deployments and MQL syntax for OCI-based deployments.

The following KPIs are added in OCNADD 25.1.200.

Table 15-4 ocnadd_ingress_record_count_by_service

KPI Detail Measures the total ingress records in Kafka source topics per aggregation service at the current time
Metric Used for the KPI (CNE) PromQL: sum by (service)(kafka_stream_processor_node_process_total{namespace="$NAMESPACE", service=~".*aggregation.*"})
Metric Used for the KPI (OCI) MQL: kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*", k8Namespace="$NAMESPACE"}.groupby(microservice).sum()
Service Operation NA
Response Code NA

Table 15-5 ocnadd_ingress_record_count_total

KPI Detail Measures the total ingress records in Kafka source topics at the current time
Metric Used for the KPI (CNE) PromQL: sum (kafka_stream_processor_node_process_total{namespace="$NAMESPACE", service=~".*aggregation.*"})
Metric Used for the KPI (OCI) MQL: kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*", k8Namespace="$NAMESPACE"}.sum()
Service Operation NA
Response Code NA

Table 15-6 ocnadd_ingress_mps_per_service_10mAgg

KPI Detail Measures the ingress MPS per service aggregated over 10min
Metric Used for the KPI (CNE) PromQL: sum by (service)(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[10m]))
Metric Used for the KPI (OCI) MQL: kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupby(k8Namespace,microservice).sum()
Service Operation NA
Response Code NA

Table 15-7 ocnadd_ingress_mps_10mAgg

KPI Detail Measures the ingress MPS aggregated over 10min
Metric Used for the KPI (CNE) PromQL: sum(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[10m]))
Metric Used for the KPI (OCI) MQL: kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*", k8Namespace="$NAMESPACE"}.rate().grouping().sum()
Service Operation NA
Response Code NA

Table 15-8 ocnadd_ingress_mps_per_service_10mAgg_last_24h

KPI Detail Measures the ingress MPS per service aggregated over 10min for the last 24 hours
Metric Used for the KPI (CNE) PromQL: sum by (service)(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[10m]))[24h:5m]
Metric Used for the KPI (OCI) MQL: No valid MQL equivalent is available
Service Operation NA
Response Code NA

Table 15-9 ocnadd_ingress_record_count_per_service_10mAgg_last_24h

KPI Detail Measures the ingress messages per service aggregated over 10min for the last 24 hours
Metric Used for the KPI (CNE) PromQL: sum by (service)(increase(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[10m]))[24h:5m]
Metric Used for the KPI (OCI) MQL: No valid MQL equivalent is available
Service Operation NA
Response Code NA

Table 15-10 ocnadd_kafka_ingress_record_drop_rate_10minAgg

KPI Detail Measures the total ingress message drop rate aggregated over 10min
Metric Used for the KPI (CNE) PromQL: sum(rate(kafka_stream_task_dropped_records_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[10m]))
Metric Used for the KPI (OCI) MQL: kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().grouping().sum()
Service Operation NA
Response Code NA

Table 15-11 ocnadd_kafka_ingress_record_drop_rate_per_service_10minAgg

KPI Detail Measures the total ingress message drop rate per service aggregated over 10min
Metric Used for the KPI (CNE) PromQL: sum(rate(kafka_stream_task_dropped_records_total{namespace="$NAMESPACE",service=~".*aggregation.*}[10m])) by (service,pod)
Metric Used for the KPI (OCI) MQL: kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().groupby(nodeName, microservice).sum()
Service Operation NA
Response Code NA

Table 15-12 ocnadd_egress_request_count_total_by_3rdparty_destination_endpoint

KPI Detail Total egress requests per third-party application per destination endpoint
Metric Used for the KPI (CNE) PromQL: sum by (instance_identifier,destination_endpoint)(ocnadd_egress_requests_total{namespace="$NAMESPACE"})
Metric Used for the KPI (OCI) MQL: ocnadd_egress_requests_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier,destination_endpoint).sum()
Service Operation NA
Response Code NA

Table 15-13 ocnadd_egress_response_count_total_by_3rdparty_destination_endpoint

KPI Detail Total egress responses per third-party application per destination endpoint
Metric Used for the KPI (CNE) PromQL: sum by (instance_identifier,destination_endpoint)(ocnadd_egress_responses_total{namespace="$NAMESPACE"})
Metric Used for the KPI (OCI) MQL: ocnadd_egress_responses_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier,destination_endpoint).sum()
Service Operation NA
Response Code NA

Table 15-14 ocnadd_egress_failure_count_total_by_3rdparty_destination_endpoint

KPI Detail Total egress failure count per third-party application per destination endpoint
Metric Used for the KPI (CNE) PromQL: sum by (instance_identifier,destination_endpoint)(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"})
Metric Used for the KPI (OCI) MQL: ocnadd_egress_failed_request_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier,destination_endpoint).sum()
Service Operation NA
Response Code NA

Table 15-15 ocnadd_egress_request_rate_by_3rdparty_10mAgg

KPI Detail Total egress request rate per third-party application in 10min Aggregation
Metric Used for the KPI (CNE) PromQL: sum by (instance_identifier)(rate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[10m]))
Metric Used for the KPI (OCI) MQL: ocnadd_egress_requests_total[10m]{app=~"*adapter*"}.rate().groupby(worker_group,app).sum()
Service Operation NA
Response Code NA

Table 15-16 ocnadd_egress_failure_rate_by_3rdparty_10mAgg

KPI Detail Total egress failure rate per third-party application in 10min Aggregation
Metric Used for the KPI (CNE)

PromQL: sum by (instance_identifier)(irate(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"}[10m]))

/

sum by (instance_identifier) (irate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[10m]))

Metric Used for the KPI (OCI)

MQL: ocnadd_egress_failed_request_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier).sum()

/

ocnadd_egress_requests_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier).sum()

Service Operation NA
Response Code NA

Table 15-17 ocnadd_egress_failure_rate_by_3rdparty_per_destination_endpoint_10mAgg

KPI Detail Total egress failure rate per third-party application per destination endpoint in 10min Aggregation
Metric Used for the KPI (CNE)

PromQL: sum by (instance_identifier, destination_endpoint)(irate(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"}[10m]))

/

sum by (instance_identifier, destination_endpoint) (irate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[10m]))

Metric Used for the KPI (OCI)

MQL: ocnadd_egress_failed_request_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier,destination_endpoint).sum()

/

ocnadd_egress_requests_total[10m]{namespace="$NAMESPACE"}.groupby(instance_identifier,destination_endpoint).sum()

Service Operation NA
Response Code NA

Table 15-18 ocnadd_e2e_avg_latency_by_3rdparty

KPI Detail Total e2e average latency per third-party application in 10min Aggregation
Metric Used for the KPI (CNE)

PromQL: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[10m])) by (instance_identifier)

/

(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[10m])) by (instance_identifier)))

Metric Used for the KPI (OCI) MQL: ocnadd_egress_service_request_processing_latency_seconds_sum[10m]{app=~"*adapter*"}.rate().groupby(app, worker_group).sum() / ocnadd_egress_service_request_processing_latency_seconds_count[10m]{app=~"*adapter*"}.rate().groupby(app, worker_group).sum()
Service Operation NA
Response Code NA

Table 15-19 ocnadd_e2e_avg_latency_by_3rdparty_per_adapter_pod

KPI Detail Total e2e average latency per third-party application per egress adapter POD in 10min aggregation
Metric Used for the KPI (CNE)

PromQL: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[10m])) by (instance_identifier,pod)

/

(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[10m])) by (instance_identifier,pod)))

Metric Used for the KPI (OCI)

MQL: (ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m]{app=~"*adapter*"}.rate().groupBy(worker_group,app).sum()

/

ocnadd_egress_e2e_request_processing_latency_seconds_count[10m]{app=~"*adapter*"}.rate().groupBy(worker_group,app).sum())

Service Operation NA
Response Code NA

Table 15-20 ocnadd_egress_adapter_processing_avg_latency_by_3rdparty_per_adapter_pod

KPI Detail Total service processing average latency per third-party application per adapter POD in 10min aggregation
Metric Used for the KPI (CNE)

PromQL: (sum (irate(ocnadd_egress_service_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[10m])) by (instance_identifier,pod)

/

(sum (irate(ocnadd_egress_service_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[10m])) by (instance_identifier,pod)))

Metric Used for the KPI (OCI)

MQL: ocnadd_egress_service_request_processing_latency_seconds_sum[10m]{app=~"*adapter*"}.rate().groupby(app, worker_group).sum()

/

ocnadd_egress_service_request_processing_latency_seconds_count[10m]{app=~"*adapter*"}.rate().groupby(app, worker_group).sum()

Service Operation NA
Response Code NA

Table 15-21 ocnadd_egress_e2e_avg_latency_buckets

KPI Detail The latency buckets for the feed in a worker group namespace
Metric Used for the KPI (CNE) PromQL: sum(rate(ocnadd_egress_e2e_request_processing_latency_seconds_bucket{app=~".*adapter.*"}[10m])) by (le,namespace,service)
Metric Used for the KPI (OCI) MQL: (ocnadd_egress_e2e_request_processing_latency_seconds_bucket[10m]{app=~"*adapter*"}.rate().groupby(k8Namespace,app,le).sum())
Service Operation NA
Response Code NA

Table 15-22 ocnadd_ext_kafka_feed_record_total per external feed rate(MPS)

KPI Detail The rate of messages consumed per sec per external Kafka consumer, calculated over a period of 5min
Metric Used for the KPI (CNE) PromQL: sum(rate(ocnadd_ext_kafka_feed_record_total{namespace="$Namespace"}[5m])) by (feed_name)
Metric Used for the KPI (OCI) MQL: ocnadd_ext_kafka_feed_record_total[10m].rate().groupby(k8Namespace,feed_name).sum()
Service Operation NA
Response Code NA
Memory Usage per POD

This KPI should be used in the context of the management group and worker group, the namespaces may differ for the management and worker group if there is no default worker group

Table 15-23 Memory Usage per POD

KPI Detail Measures the memory usage per POD
Metric Used for the KPI (CNE) PromQL: sum(container_memory_working_set_bytes{namespace=~"$Namespace",image!=""}/(1024*1024*1024)) by (pod)
Metric Used for the KPI (OCI) MQL: (container_memory_working_set_bytes[10m]{container=~\"*ocnadd*|*zookeeper*|*kafka*|*adapter*|corr*\"}.groupby(namespace,pod).mean())/1000000
Service Operation NA
Response Code NA
CPU Usage per POD

This KPI should be used in the context of the management group and worker group, the namespaces may differ for the management and worker group if there is no default worker group

Table 15-24 CPU Usage per POD

KPI Detail Measures the CPU usage per POD
Metric Used for the KPI (CNE) PromQL: sum(rate(container_cpu_usage_seconds_total{namespace=~"$Namespace",image!=""}[2m])) by (pod) * 1000
Metric Used for the KPI (OCI) MQL: container_cpu_usage_seconds_total[10m]{pod=~\"*ocnadd*|*kafka*|*zookeeper*|*adapter*|*corr*|*export*\"}.rate().groupby(namespace,pod).sum()
Service Operation NA
Response Code NA
Service Status

Table 15-25 Service Status

KPI Detail Provide the status of each of the Data Director service running in the namespace provided
Metric Used for the KPI (CNE) PromQL: up{namespace="$NAMESPACE"}
Metric Used for the KPI (OCI) MQL:podStatus[10m]{podOwner=~"*adapter*|*ocnadd*|*kafka*|*zookeeper*|corr*|*export*\"}.groupby(clusterNamespace,podName).mean()
Service Operation NA
Response Code NA

15.3 OCNADD Alerts

This section provides information on Oracle Communications Network Analytics Data Director (OCNADD) alerts and their configuration.

Alerts Interpretation

The following table defines the alerts severity interpretation based on the infrastructure.

Table 15-26 Alerts Interpretation

CNE OCI
Critical Critical
Major Error
Minor Error
Warning Warning
Info Info

Note:

Alert OIDs are deprecated for OCI deployments.

15.3.1 OCNADD Alert Configuration

This section describes how to configure alerts in OCNADD.

OCNADD on OCCNE

If OCNADD is deployed on the OCCNE setup, all services will be monitored by Prometheus by default. No modifications in the helm charts are required. Update all Prometheus Alert Rules present in the Helm Chart.

Note:

The label used to update the Prometheus Server is "role: cnc-alerting-rules," which is added by default in helm charts.

OCNADD on OCI

Alerts on OCI are made available by the OCI Alarm service. The monitoring service on OCI fetches metrics from OCNADD services, and the Alarm service triggers alarms when the defined threshold is breached. Metrics on OCI are fetched using MQL, and MQL queries are used in the Alarm template on OCI. Alarms can be created using the OCI GUI. OCNADD provides a Terraform script to create supported alarms on OCI:

  1. Extract the Terraform script provided in the OCNADD package under <release-name>/custom-templates/oci/terraform.
  2. Follow these steps:
    1. Log in to the OCI console.
    2. Click Hamburger menu and select Developer Services.
    3. Under Developer Services, select Resource Manager.
    4. Under Resource Manager, select Stacks.
    5. Click Create stack button.
    6. Select the default My Configuration radio button.
    7. Under Stack configuration, click on the folder radio button and upload the Terraform package <release-name>/custom-templates/oci/terraform.
    8. Enter the Name and Description and select the compartment.
    9. Click Next.
  3. Provide appropriate values for the parameters requested in the Terraform script as shown in the following screenshots:


    Tenancy Configuration

    Metric Namespace Configuration

    Notification Configuration

    Alerts Configuration

    Thresholds

OCNADD supports alarm subscription through email on OCI, and here are some important points for configuring alarms:

  1. Alarm Categories in OCI: Alarms in OCI are categorized into critical, warning, info, and error. Note that the error category is not available in Prometheus alert rules. Therefore, alarms with severity minor and major in Prometheus are converted to error in OCI. For more information, see OCI Alert Template.
  2. Notification and Topic Setup: During the execution of the Terraform script, notifications and topics for the alerts will be automatically created.
  3. User Modification/Deletion: If users need to create new alarms or modify and delete the alarms added through Terraform, they can perform these actions by editing the corresponding alarm definitions through the OCI Console.
  4. OCI Notification Reference: For more information on OCI Notification, see OCI Notification.

OCNADD Configuration When Prometheus is Deployed Without Operator

This section covers the steps to follow when Prometheus is deployed without Operator support (occne-nf-cnc-servicemonitor service), in order to receive all metrics on the OCNADD UI.

  1. Changes in Custom Values of Management Group:
    PROMETHEUS_API: http://<prometheus-service-name>.<prometheus-namespace>.svc.<cluster-domain>:80
    # Replace the placeholders with correct information. 
    # Example: PROMETHEUS_API: http://occne-kube-prom-stack-kube-prometheus.occne-infra.ocnadd:80
    
    DD_PROMETHEUS_PATH: /prometheus/api/v1/query_range
    # Replace the default DD_PROMETHEUS_PATH with this
    
  2. Add Prometheus Annotations in All Deployments and StatefulSets:
    Steps to Update Annotations in All Deployments and StatefulSets:
    1. Run: kubectl edit deployment <deployment-name> -n <namespace>

      Add the Prometheus annotations as shown below to the respective deployments:

      Edit Deployment Example for Adapter

      Before:
      ...
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: app-1-adapter
            role: adapter
      ...
      
      After Adding Annotations:
      ...
      template:
        metadata:
          annotations:                                  # Add these Prometheus annotations to charts
            prometheus.io/path: /actuator/prometheus
            prometheus.io/port: "9000"
            prometheus.io/scrape: "true"
          creationTimestamp: null
      ...
      
    2. Edit the chart files at the specified locations (paths mentioned in the table below) to include the same Prometheus annotations, ensuring changes persist during upgrades.
    3. Verification of Changes:
      1. Run the following to verify annotations are applied:
        kubectl describe deployments.apps -n <namespace> app-1-adapter | grep "prometheus"
        
        Expected Output:
        Annotations:      prometheus.io/path: /actuator/prometheus
                          prometheus.io/port: 9000
                          prometheus.io/scrape: true
        
      2. Verify metrics availability in Prometheus.
        Verify metrics on prometheus.

      3. Confirm "ACTIVE" status of feeds on the DD UI when traffic is successfully flowing.
        Confirm "ACTIVE" status of feeds on the DD UI when traffic is successfully flowing.

Chart paths for adding annotations manually:

Services Path
kafka ocnadd/charts/ocnaddkafka/templates/ocnaddkafkaBroker.yaml
zookeeper ocnadd/charts/ocnaddkafka/templates/ocnadd-zookeeper.yaml
admin svc ocnadd/charts/ocnaddadminsvc/templates/ocnaddadminservice.yaml
correlation svc ocnadd/charts/ocnaddadminsvc/templates/correlation-deploy.yaml
storage adapter ocnadd/charts/ocnaddadminsvc/templates/ocnaddstorageadapter-deploy.yaml
consumer adapter ocnadd/charts/ocnaddadminsvc/templates/ocnaddingressadapter-deploy.yaml
alarm svc ocnadd/charts/ocnaddalarm/templates/ocnadd-alarm.yaml
configuration svc ocnadd/charts/ocnaddconfiguration/templates/ocnadd-configuration.yaml
healthmonitoring svc ocnadd/charts/ocnaddhealthmonitoring/templates/ocnadd-health.yaml
aggregation svc ocnadd/charts/ocnaddaggregation/templates/ocnadd-<NF>aggregation.yaml (NF - scp,sepp,pcf,nrf,bsf)
export svc ocnadd/charts/ocnaddexport/templates/ocnadd-export.yaml

15.3.2 List of Alerts

This section provides detailed information about the alert rules defined for OCNADD.

15.3.2.1 System Level Alerts

This section lists the system level alerts for OCNADD.

Table 15-27 OCNADD_POD_CPU_USAGE_ALERT

Field Details
Triggering Condition POD CPU usage is above the set threshold (default 85%)
Severity Major
Description OCNADD Pod High CPU usage detected for a continuous period of 5min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: CPU usage is {{ "{{" }} $value | printf "%.2f" }} which is above threshold {{ .Values.global.cluster.cpu_threshold }} % '

PromQL Expression:

expr:

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*aggregation.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*2) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*kafka.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*6) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*kraft.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*adapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*correlation.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*filter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*configuration.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*admin.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*health.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*alarm.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*ui.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*export.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*4) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*storageadapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*ingressadapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3)

Alert Details OCI

Summary:

Alarm "OCNADD_POD_CPU_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "pod_cpu_usage_seconds_total[10m]{pod=~"*alarm*|*admin*|*health*|*config*|*kraft*"}.rate().groupby(namespace,pod).sum()*100>=85||pod_cpu_usage_seconds_total[10m]{pod=~"*ui*|*aggregation*|*filter*"}.rate().groupby(namespace,pod).sum()*100>=2*85||pod_cpu_usage_seconds_total[10m]{pod=~"*corr*"}.rate().groupby(namespace,pod).sum()*100>=3*85||pod_cpu_usage_seconds_total[10m]{pod=~"*kafka*"}.rate().groupby(namespace,pod).sum()*100>=6*85|| pod_cpu_usage_seconds_total[10m]{pod=~"*export*"}.rate().groupby(namespace,pod).sum()*100>=4*85|| pod_cpu_usage_seconds_total[10m]{pod=~"*storageadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*85|| pod_cpu_usage_seconds_total[10m]{pod=~"*ingressadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*85", with a trigger delay of 1 minute

where, X = FIRING/OK,

n = Different services that violated the rule.

MQL Expression:

pod_cpu_usage_seconds_total[10m]{pod=~"*alarm*|*admin*|*health*|*config*|*kraft*"}.rate().groupby(namespace,pod).sum()*100>={{ CPU Threshold }}||pod_cpu_usage_seconds_total[10m]{pod=~"*ui*|*aggregation*|*filter*"}.rate().groupby(namespace,pod).sum()*100>=2*{{ CPU Threshold }}||pod_cpu_usage_seconds_total[10m]{pod=~"corr*"}.rate().groupby(namespace,pod).sum()*100>=3*{{ CPU Threshold }}||pod_cpu_usage_seconds_total[10m]{pod=~"*kafka*"}.rate().groupby(namespace,pod).sum()*100>=6*{{ CPU Threshold }} || pod_cpu_usage_seconds_total[10m]{pod=~"*export*"}.rate().groupby(namespace,pod).sum()*100>=4*{{ CPU Threshold }} || pod_cpu_usage_seconds_total[10m]{pod=~"*storageadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*{{ CPU Threshold }} || pod_cpu_usage_seconds_total[10m]{pod=~"*ingressadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*{{ CPU Threshold }}

Note: CPU Threshold will be assigned will executing the terraform script

OID 1.3.6.1.4.1.323.5.3.53.1.29.4002
Metric Used

container_cpu_usage_seconds_total

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system

Resolution

The alert gets cleared when the CPU utilization is below the critical threshold.

Note: The threshold is configurable in the ocnadd-custom-values.yaml file. If guidance is required, contact My Oracle Support.

Table 15-28 OCNADD_POD_MEMORY_USAGE_ALERT

Field Details
Triggering Condition POD Memory usage is above set threshold (default 90%)
Severity Major
Description OCNADD Pod High Memory usage detected for a continuous period of 5min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Memory usage is {{ "{{" }} $value | printf "%.2f" }} which is above threshold {{ .Values.global.cluster.memory_threshold }} % '

PromQL Expression:

expr:

(sum(container_memory_working_set_bytes{image!="" , pod=~".*aggregation.*"}) by (pod,namespace) > 3*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*kafka.*"}) by (pod,namespace) > 64*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*kraft.*"}) by (pod,namespace) > 1*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*filter.*"}) by (pod,namespace) > 3*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*adapter.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*correlation.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*configuration.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*admin.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*health.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*alarm.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*ui.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*export.*"}) by (pod,namespace) > 64*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*storageadapter.*"}) by (pod,namespace) > 64*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*ingressadapter.*"}) by (pod,namespace) > 8*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100)

Alert Details OCI

Summary:

Alarm "OCNADD_POD_MEMORY_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "container_memory_usage_bytes[5m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|*corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[5m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|*corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()*100>=90", with a trigger delay of 1 minute

where, X = FIRING/OK,

n = Different services that violated the rule.

MQL Expression:

container_memory_usage_bytes[10m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[10m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()*100>={{ Memory Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.29.4005
Metric Used

container_memory_working_set_bytes

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert gets cleared when the memory utilization is below the critical threshold.

Note: The threshold is configurable in the ocnadd_custom_values.yaml file. If guidance is required, contact My Oracle Support.

Table 15-29 OCNADD_POD_RESTARTED

Field Details
Triggering Condition A POD has restarted
Severity Minor
Description A POD has restarted in the last 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: A Pod has restarted'

PromQL Expression:

expr: kube_pod_container_status_restarts_total{namespace="{{ .Values.global.cluster.nameSpace.name }}"} > 1

Alert Details OCI

MQL Expression:

No MQL equivalent is available

OID 1.3.6.1.4.1.323.5.3.53.1.29.5006
Metric Used

kube_pod_container_status_restarts_total

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically if the specific pod is up.

Steps:

1. Check the application logs. Check for database related failures such as connectivity, Kubernetes secrets, and so on.

2. Run the following command to check orchestration logs for liveness or readiness probe failures:

kubectl get po -n <namespace>

Note the full name of the pod that is not running, and use it in the following command:

kubectl describe pod <desired full pod name> -n <namespace>

3. Check the database status. For more information, see "Oracle Communications Cloud Native Core DBTier User Guide".

4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

15.3.2.2 Application Level Alerts

This section lists the application level alerts for OCNADD.

Table 15-30 OCNADD_CONFIG_SVC_DOWN

Field Details
Triggering Condition The configuration service went down or not accessible
Severity Critical
Description OCNADD Configuration service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddconfiguration service is down'

PromQL Expression:

expr: up{service="ocnaddconfiguration"} != 1

Alert Details OCI

Summary:

Alarm "OCNADD_CONFIG_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddconfiguration"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddconfiguration"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.20.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Configuration service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support..

Table 15-31 OCNADD_ALARM_SVC_DOWN

Field Details
Triggering Condition The alarm service went down or not accessible
Severity Critical
Description OCNADD Alarm service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddalarm service is down'

PromQL Expression:

expr: up{service="ocnaddalarm"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_ALARM_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddalarm"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddalarm"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.24.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Alarm service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-32 OCNADD_HEALTH_MONITORING_SVC_DOWN

Field Details
Triggering Condition The health monitoring service went down or not accessible
Severity Critical
Description OCNADD Health monitoring service not available for more than 2 min
Alert Details CNE

Summary:

summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddhealthmonitoring service is down'

PromQL Expression:

expr: up{service="ocnaddhealthmonitoring"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_HEALTH_MONITORING_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddhealthmonitoring"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddhealthmonitoring"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.28.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Health monitoring service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-33 OCNADD_SCP_AGGREGATION_SVC_DOWN

Field Details
Triggering Condition The SCP Aggregation service went down or not accessible
Severity Critical
Description OCNADD SCP Aggregation service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddscpaggregation service is down'

PromQL Expression:

expr: up{service="ocnaddscpaggregation"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_SCP_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddscpaggregation"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddscpaggregation"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.22.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD SCP Aggregation service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-34 OCNADD_NRF_AGGREGATION_SVC_DOWN

Field Details
Triggering Condition The NRF Aggregation service went down or not accessible
Severity Critical
Description OCNADD NRF Aggregation service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddnrfaggregation service is down'

PromQL Expression:

expr: up{service="ocnaddnrfaggregation"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_NRF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddnrfaggregation"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddnrfaggregation"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.31.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD NRF Aggregation service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-35 OCNADD_SEPP_AGGREGATION_SVC_DOWN

Field Details
Triggering Condition The SEPP Aggregation service went down or not accessible
Severity Critical
Description OCNADD SEPP Aggregation service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddseppaggregation service is down'

PromQL Expression:

expr: up{service="ocnaddseppaggregation"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_SEPP_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddseppaggregation"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddseppaggregation"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.32.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD SEPP Aggregation service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-36 OCNADD_BSF_AGGREGATION_SVC_DOWN

Triggering Condition The BSF Aggregation service went down or not accessible
Severity Critical
Description OCNADD BSF Aggregation service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddbsfaggregation service is down'

PromQL Expression:

expr: up{service="ocnaddbsfaggregation"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_BSF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddbsfaggregation"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddbsfaggregation"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.40.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD BSF Aggregation service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-37 OCNADD_PCF_AGGREGATION_SVC_DOWN

Triggering Condition The PCF Aggregation service went down or not accessible
Severity Critical
Description OCNADD PCF Aggregation service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddpcfaggregation service is down'

PromQL Expression:

expr: up{service="ocnaddpcfaggregation"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_PCF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddpcfaggregation"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddpcfaggregation"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.41.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD PCF Aggregation service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

Table 15-38 OCNADD_NON_ORACLE_AGGREGATION_SVC_DOWN

Triggering Condition The Non Oracle Aggregation service went down or not accessible
Severity Critical
Description OCNADD Non Oracle Aggregation service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddnonoracleaggregation service is down'

PromQL Expression:

expr: up{service="ocnaddnonoracleaggregation"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_NON_ORACLE_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddnonoracleaggregation"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddnonoracleaggregation"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.37.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Non Oracle Aggregation service starts instance becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-39 OCNADD_ADMIN_SVC_DOWN

Field Details
Triggering Condition The OCNADD Admin service went down or not accessible
Severity Critical
Description OCNADD Admin service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddadminservice service is down'

PromQL Expression:

expr: up{service="ocnaddadminservice"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_ADMIN_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddadminservice"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddadminservice"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.30.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Admin service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-40 OCNADD_CONSUMER_ADAPTER_SVC_DOWN

Field Details
Triggering Condition The OCNADD Consumer Adapter service went down or not accessible
Severity Critical
Description OCNADD Consumer Adapter service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Consumer Adapter service is down'

PromQL Expression:

expr: up{service=~".*adapter.*", role="adapter"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_CONSUMER_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="correlation"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner=~"adapter*"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.25.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Consumer Adapter service start becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in “Running” state:

kubectl –n <namespace> get pod

If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-41 OCNADD_FILTER_SVC_DOWN

Field Details
Triggering Condition The OCNADD Filter service went down or not accessible
Severity Critical
Description OCNADD Filter service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Filter service is down'

PromQL Expression:

expr: up{service=~".*filter.*"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_FILTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddfilter"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ocnaddfilter"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.34.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Filter service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-42 OCNADD_CORRELATION_SVC_DOWN

Field Details
Triggering Condition The OCNADD Correlation service went down or not accessible
Severity Critical
Description OCNADD Correlation service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Correlation service is down'

PromQL Expression:

expr: up{service=~".*correlation.*"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_CORRELATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="correlation"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="correlation"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.33.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Correlation service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-43 OCNADD_EXPORT_SVC_DOWN

Triggering Condition The OCNADD Export service went down or not accessible
Severity Critical
Description OCNADD Export service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Export service is down'

PromQL Expression:

expr: up{service=~".*export.*"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_EXPORT_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="export"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="export"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.39.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD export service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-44 OCNADD_STORAGE_ADAPTER_SVC_DOWN

Triggering Condition The OCNADD Storage adapter service went down or not accessible
Severity Critical
Description OCNADD Storage adapter service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Storage adapter service is down'

PromQL Expression:

expr: up{service=~".*storage-adapter.*", role="storageAdapter"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_STORAGE_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="storageadapter"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="storageadapter"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.38.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Storage adapter service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-45 OCNADD_INGRESS_ADAPTER_SVC_DOWN

Triggering Condition The OCNADD Ingress Adapter service went down or not accessible
Severity Critical
Description OCNADD Ingress Adapter service not available for more than 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Ingress Adapter service is down'

PromQL Expression:

expr: up{service=~".*ingress-adapter.*", role="ingressadapter"} != 1
Alert Details OCI

Summary:

Alarm "OCNADD_INGRESS_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ingressadapter"}.mean()!=1", with a trigger delay of 1 minute

MQL Expression:

podStatus[10m]{podOwner="ingressadapter"}.mean()!=1

OID 1.3.6.1.4.1.323.5.3.53.1.36.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Ingress Adapter service starts becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in the “Running” state:

kubectl –n <namespace> get pod

If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

Helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

Table 15-46 OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total ingress MPS crossed the warning threshold of 80% of the supported MPS
Severity Warn
Description Total Ingress Message Rate is above the configured warning threshold (80%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'

PromQL Expression:

expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.8*{{ .Values.global.cluster.mps }}
Alert Details OCI

Summary:

Alarm "OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>.8*{{ MPS Threshold }}", with a trigger delay of 1 minute

MQL Expression:

kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>.8*{{ MPS Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.29.5007
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the MPS rate goes below the warning threshold level of 80%.

Table 15-47 OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total ingress MPS crossed the minor threshold alert level of 90% of the supported MPS
Severity Minor
Description Total Ingress Message Rate is above configured minor threshold alert (90%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'

PromQL Expression:

expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.9*{{ .Values.global.cluster.mps }}

Alert Details OCI

Summary:

Alarm "OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>.9*{{ MPS Threshold }}", with a trigger delay of 1 minute

MQL Expression:

kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>.9*{{ MPS Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.29.5008
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the MPS rate goes below the minor threshold alert level of 90%.

Table 15-48 OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total ingress MPS crossed the major threshold alert level of 95% of the supported MPS
Severity Major
Description Total Ingress Message Rate is above the configured major threshold alert (95%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'

PromQL Expression:

expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }}

Alert Details OCI

Summary:

Alarm "OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>0.95*{{ MPS Threshold }}", with a trigger delay of 1 minute

MQL Expression:

kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>0.95*{{ MPS Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.29.5009
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the MPS rate goes below the major threshold alert level of 95%.

Table 15-49 OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total ingress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity Critical
Description Total Ingress Message Rate is above configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}'

PromQL Expression:

expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI

Summary:

Alarm "OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>1.0*{{ MPS Threshold }}", with a trigger delay of 1 minute

MQL Expression:

kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>1.0*{{ MPS Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.36.5010
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

Table 15-50 OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total ingress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity Critical
Description Total Ingress Message Rate is above configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}'

PromQL Expression:

expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI

Summary:

Alarm "OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>1.0*{{ MPS Threshold }}", with a trigger delay of 1 minute

MQL Expression:

kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>1.0*{{ MPS Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.29.5010
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

Table 15-51 OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total egress MPS crossed the warning threshold alert level of 80% of the supported MPS
Severity Warn
Description The total Egress Message Rate is above the configured warning threshold alert (80%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80% of Max messages per second:{{ .Values.global.cluster.mps }}'

PromQL Expression:

expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.80*{{ .Values.global.cluster.mps }}
Alert Details OCI

Summary:

Alarm "OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.80*{{ MPS Threshold }}"", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.80*{{ MPS Threshold }}"

OID 1.3.6.1.4.1.323.5.3.53.1.29.5011
Metric Used ocnadd_egress_requests_total
Resolution The alert is cleared automatically when the MPS rate goes below the warning threshold alert level of 80% of supported MPS

Table 15-52 OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total egress MPS crossed the minor threshold alert level of 90% of the supported MPS
Severity Minor
Description The total Egress Message Rate is above the configured minor threshold alert (90%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90% of Max messages per second:{{ .Values.global.cluster.mps }}'

PromQL Expression:

expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.90*{{ .Values.global.cluster.mps }}
Alert Details OCI

Summary:

Alarm "OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.90*{{ MPS Threshold }}"", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.90*{{ MPS Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.29.5012
Metric Used ocnadd_egress_requests_total
Resolution The alert is cleared automatically when the MPS rate goes below the minor threshold alert level of 90% of supported MPS

Table 15-53 OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total egress MPS crossed the major threshold alert level of 95% of the supported MPS
Severity Major
Description The total Egress Message Rate is above the configured major threshold alert (95%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95% of Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }}
Alert Details OCI

Summary:

Alarm "OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.95*{{ MPS Threshold }}"", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.95*{{ MPS Threshold }}"

OID 1.3.6.1.4.1.323.5.3.53.1.29.5013
Metric Used ocnadd_egress_requests_total
Resolution The alert is cleared automatically when the MPS rate goes below the major threshold alert level of 95% of supported MPS

Table 15-54 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity Critical
Description The total Egress Message Rate is above the configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}'

PromQL Expression:

expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI

Summary:

Alarm "OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>1.0*{{ MPS Threshold }}"", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>1.0*{{ MPS Threshold }}"

OID 1.3.6.1.4.1.323.5.3.53.1.29.5014
Metric Used ocnadd_egress_requests_total
Resolution The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

Table 15-55 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER

Field Details
Triggering Condition The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS for a consumer
Severity Critical
Description The total Egress Message Rate is above the configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum (rate(ocnadd_egress_requests_total[5m])) by (namespace, instance_identifier) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI

Summary:

Alarm "OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>1.0 {{ MPS Threshold }}", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>1.0 {{ MPS Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.29.5015
Metric Used ocnadd_egress_requests_total
Resolution The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

Table 15-56 OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED

Field Details
Triggering Condition The total observed latency is above the configured warning threshold alert level of 80%
Severity Warn
Description Average E2E Latency is above the configured warning threshold alert level (80%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 80% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'

Expression:

expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .80*{{ .Values.global.cluster.max_latency }} <= .90*{{ .Values.global.cluster.max_latency }}

Alert Details OCI

Summary:

Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.040&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.045", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.040&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.045

Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05

OID 1.3.6.1.4.1.323.5.3.53.1.29.5016
Metric Used ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution The alert is cleared automatically when the average latency goes below the warning threshold alert level of 80% of permissible latency

Table 15-57 OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED

Field Details
Triggering Condition The total observed latency is above the configured minor threshold alert level of 90%
Severity Minor
Description Average E2E Latency is above the configured minor threshold alert level (90%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 90% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'

PromQL Expression:

expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .90*{{ .Values.global.cluster.max_latency }} <= 0.95*{{ .Values.global.cluster.max_latency }}

Alert Details OCI

Summary:

Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.045&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.0475", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.045&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.0475

Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05

OID 1.3.6.1.4.1.323.5.3.53.1.29.5017
Metric Used ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution The alert is cleared automatically when the average latency goes below the minor threshold alert level of 90% of permissible latency

Table 15-58 OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED

Field Details
Triggering Condition The total observed latency is above the configured major threshold alert level of 95%
Severity Major
Description Average E2E Latency is above the configured minor threshold alert level (95%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 95% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'

PromQL Expression:

expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .95*{{ .Values.global.cluster.max_latency }} <= 1.0*{{ .Values.global.cluster.max_latency }}

Alert Details OCI

Summary:

Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.0475&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.05", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.0475&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.05

Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05

OID 1.3.6.1.4.1.323.5.3.53.1.29.5018
Metric Used ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution The alert is cleared automatically when the average latency goes below the major threshold alert level of 95% of permissible latency

Table 15-59 OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED

Field Details
Triggering Condition The total observed latency is above the configured critical threshold alert level of 100%
Severity Critical
Description Average E2E Latency is above the configured critical threshold alert level (100%) for the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 100% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'

PromQL Expression:

expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > 1.0*{{ .Values.global.cluster.max_latency }}

Alert Details OCI

Summary:

Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.05", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.05

Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05

OID 1.3.6.1.4.1.323.5.3.53.1.29.5019
Metric Used ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution The alert is cleared automatically when the average latency goes below the critical threshold alert level of permissible latency

Table 15-60 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS

Field Details
Triggering Condition The packet drop rate in Kafka streams is above the configured major threshold of 1% of the total supported MPS
Severity Major
Description The packet drop rate in Kafka streams is above the configured major threshold of 1% of total supported MPS in the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 1% threshold of Max messages per second:{{ .Values.global.cluster.mps }}'

PromQL Expression:

expr: sum(rate(kafka_stream_task_dropped_records_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.01*{{ .Values.global.cluster.mps }}

Alert Details OCI

Summary:

Alarm "OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().groupBy(k8Namespace).sum()*100> {{ MPS Threshold }}", with a trigger delay of 1 minute

MQL Expression:

kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().groupBy(k8Namespace).sum()*100> {{ MPS Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.29.5020
Metric Used kafka_stream_task_dropped_records_total
Resolution The alert is cleared automatically when the packet drop rate goes below the major threshold (1%) alert level of supported MPS

Table 15-61 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS

Field Details
Triggering Condition The packet drop rate in Kafka streams is above the configured critical threshold of 10% of the total supported MPS
Severity Critical
Description The packet drop rate in Kafka streams is above the configured critical threshold of 10% of total supported MPS in the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 10% threshold of Max messages per second:{{ .Values.global.cluster.mps }}'

PromQL Expression:

expr: sum(rate(kafka_stream_task_dropped_records_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.1*{{ .Values.global.cluster.mps }}

Alert Details OCI

Summary:

Alarm "OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().groupBy(k8Namespace).sum()*100>10*{{MPS Threshold }}", with a trigger delay of 1 minute

MQL Expression:

kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().groupBy(k8Namespace).sum()*100>10*{{MPS Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.29.5021
Metric Used kafka_stream_task_dropped_records_total
Resolution The alert is cleared automatically when the packet drop rate goes below the critical threshold (10%) alert level of supported MPS

Table 15-62 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_0.1PERCENT

Field Details
Triggering Condition The Egress adapter failure rate towards the 3rd party application is above the configured threshold of 0.1% of the total supported MPS
Severity Info
Description Egress external connection failure rate towards 3rd party application is crossing the info threshold of 0.1% in the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 0.1 Percent of Total Egress external connections'

PromQL Expression:

expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 0.1 < 10

Alert Details OCI

Summary:

Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_01PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=0.1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<1", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=0.1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<1

OID 1.3.6.1.4.1.323.5.3.53.1.29.5022
Metric Used ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (0.1%) alert level of supported MPS

Table 15-63 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT

Field Details
Triggering Condition The Egress adapter failure rate towards the third-party application is above the configured threshold of 1% of the total supported MPS
Severity Warn
Description Egress external connection failure rate towards 3rd party application is crossing the warning threshold of 1% in the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 1 Percent of Total Egress external connections'

PromQL Expression:

expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 1 < 10

Alert Details OCI

Summary:

Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<10", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<10

OID 1.3.6.1.4.1.323.5.3.53.1.29.5023
Metric Used ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (1%) alert level of supported MPS

Table 15-64 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT

Field Details
Triggering Condition The Egress adapter failure rate towards the third-party application is above the configured threshold of 10% of the total supported MPS
Severity Minor
Description Egress external connection failure rate towards 3rd party application is crossing a minor threshold of 10% in the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 10 Percent of Total Egress external connections'

PromQL Expression:

expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 10 < 25

Alert Details OCI

Summary:

Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=10&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<25", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=10&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<25

OID 1.3.6.1.4.1.323.5.3.53.1.29.5024
Metric Used ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (10%) alert level of supported MPS

Table 15-65 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT

Field Details
Triggering Condition The Egress adapter failure rate towards the third-party application is above the configured threshold of 25% of the total supported MPS
Severity Major
Description Egress external connection failure rate towards 3rd party application is crossing the major threshold of 25% in the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 25 Percent of Total Egress external connections'

PromQL Expression:

expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 25 < 50

Alert Details OCI

Summary:

Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=25&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<50", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=25&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<50

OID 1.3.6.1.4.1.323.5.3.53.1.29.5025
Metric Used ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (25%) alert level of supported MPS

Table 15-66 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT

Field Details
Triggering Condition The Egress adpater failure rate towards the 3rd party application is above the configured threshold of 50% of the total supported MPS
Severity Critical
Description Egress external connection failure rate towards 3rd party application is crossing the critical threshold of 50% in the period of 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 50 Percent of Total Egress external connections'

PromQL Expression:

expr:(sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 50

Alert Details OCI

Summary:

Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=50", with a trigger delay of 1 minute

MQL Expression:

ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=50

OID 1.3.6.1.4.1.323.5.3.53.1.29.5026
Metric Used ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (50%) alert level of supported MPS

Table 15-67 OCNADD_INGRESS_TRAFFIC_RATE_INCREASE_SPIKE_10PERCENT

Field Details
Triggering Condition The ingress traffic increase is more than 10% of the supported MPS
Severity Major
Description The ingress traffic increase is more than 10% of the supported MPS in the last 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS increase is more than 10 Percent of current supported MPS'

PromQL Expression:

expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m] offset 5m)) by (namespace) >= 1.1

Alert Details OCI Not Available
OID 1.3.6.1.4.1.323.5.3.53.1.29.5027
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the increase in MPS comes back to lower than 10% of the supported MPS

Table 15-68 OCNADD_INGRESS_TRAFFIC_RATE_DECREASE_SPIKE_10PERCENT

Field Details
Triggering Condition The ingress traffic decrease is more than 10% of the supported MPS
Severity Major
Description The ingress traffic decrease is more than 10% of the supported MPS in the last 5 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS decrease is more than 10 Percent of current supported MPS'

PromQL Expression:

expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m] offset 5m)) by (namespace) <= 0.9

Alert Details OCI Not Available
OID 1.3.6.1.4.1.323.5.3.53.1.29.5028
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the decrease in MPS comes back to lower than 10% of the supported MPS

Table 15-69 OCNADD_TRANSACTION_SUCCESS_CRITICAL_THRESHOLD_DROPPED

Field Details
Triggering Condition The total transaction success xDRs rate has dropped the critical threshold alert level of 90%
Severity Critical
Description The total transaction success xDRs rate has dropped the critical threshold alert level of 90% for the period of 5min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Transaction Success Rate is below 90% per hour:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum(irate(ocnadd_total_transactions_total{service=~".*correlation.*",status="SUCCESS"}[5m]))by (namespace,service) / sum(irate(ocnadd_total_transactions_total{service=~".*correlation.*"}[5m]))by (namespace,service) *100 < 90

Alert Details OCI

Summary:

Alarm "OCNADD_TRANSACTION_SUCCESS_CRITICAL_THRESHOLD_DROPPED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_total_transactions_total[10m]{status="SUCCESS",app=~"*corr*"}.rate().groupBy(workername,app).sum()/ocnadd_total_transactions_total[10m]{app=~"*corr*"}.rate().groupBy(workername,app).sum()*100<90", with a trigger delay of 1 minute

MQL Expression:

ocnadd_total_transactions_total[10m]{status="SUCCESS",app=~"*corr*"}.rate().groupBy(workername,app).sum()/ocnadd_total_transactions_total[10m]{app=~"*corr*"}.rate().groupBy(workername,app).sum()*100<90

OID 1.3.6.1.4.1.323.5.3.53.1.33.5029
Metric Used ocnadd_total_transactions_total
Resolution The alert is cleared automatically when the transaction success rate goes above the critical threshold alert level of 90%

15.3.3 ADDING SNMP SUPPORT

OCNADD forwards the Prometheus alerts as Simple Network Management Protocol (SNMP) traps to the southbound SNMP servers. OCNADD uses two SNMP MIB files to generate the traps. The alert manager configuration is modified by updating the alertmanager.yaml file. In the alertmanager.yaml file, the alerts can be grouped based on podname, alertname, severity, namespace, and so on. The Prometheus alert manager is integrated with Oracle Communications Cloud Native Core, Cloud Native Environment (CNE) snmp-notifier service. The external SNMP servers are set up to receive the Prometheus alerts as SNMP traps. The operator must update the MIB files along with the alert manager file to fetch the SNMP traps in their environment.

Note:

  • SNMP is not supported on OCI.
  • Only a user with admin privileges can perform the following procedures.

Alert Manager Configuration

  • Run the following command to obtain the Alert Manager Secret configuration from the Bastion Host and save it to a file:
    $ kubectl get secret alertmanager-occne-kube-prom-stack-kube-alertmanager -o yaml -n occne-infra > alertmanager-secret-k8s.yaml

    Sample output:

    apiVersion: v1
    data:
      alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCnJvdXRlOgogIGdyb3VwX2J5OgogIC0gam9iCiAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgZ3JvdXBfd2FpdDogMzBzCiAgcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbA==
    kind: Secret
    metadata:
      annotations:
        meta.helm.sh/release-name: occne-kube-prom-stack
        meta.helm.sh/release-namespace: occne-infra
      creationTimestamp: "2022-01-24T22:46:34Z"
      labels:
        app: kube-prometheus-stack-alertmanager
        app.kubernetes.io/instance: occne-kube-prom-stack
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/part-of: kube-prometheus-stack
        app.kubernetes.io/version: 18.0.1
        chart: kube-prometheus-stack-18.0.1
        heritage: Helm
        release: occne-kube-prom-stack
      name: alertmanager-occne-kube-prom-stack-kube-alertmanager
      namespace: occne-infra
      resourceVersion: "5175"
      uid: a38eb420-a4d0-4020-a375-ab87421defde
    type: Opaque
  • Extract the Alert Manager configuration. The third line of the alertmanager.yaml file contains Alert Manager configuration encoded in base64 format. To extract the Alert Manager configuration, decode the alertmanager.yaml file. Run the following command:
    echo 'Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCnJvdXRlOgogIGdyb3VwX2J5OgogIC0gam9iCiAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgZ3JvdXBfd2FpdDogMzBzCiAgcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbA=='  | base64 --decode
    

    Sample output:

    global:
      resolve_timeout: 5m
    receivers:
    - name: default-receiver
      webhook_configs:
      - url: http://occne-snmp-notifier:9464/alerts
    route:
      group_by:
      - job
      group_interval: 5m
      group_wait: 30s
      receiver: default-receiver
      repeat_interval: 12h
      routes:
      - match:
          alertname: Watchdog
        receiver: default-receiver
    templates:
    - /etc/alertmanager/config/*.tmpl
  • Update the alertmanager.yaml file, alerts can be grouped based on the following:
    • podname
    • alertname
    • severity
    • namespace

    Save the changes to alertmanager.yaml file.

    For example:

    route:
      group_by: [podname, alertname, severity, namespace]
      group_interval: 5m
      group_wait: 30s
      receiver: default-receiver
      repeat_interval: 12h
  • Encode the updated alertmanager.yaml file, run the following command:
    $ cat alertmanager.yaml | base64 -w0
    Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCi0gbmFtZTogbmV3LXJlY2VpdmVyLTEKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyLTE6OTQ2NS9hbGVydHMKcm91dGU6CiAgZ3JvdXBfYnk6CiAgLSBqb2IKICBncm91cF9pbnRlcnZhbDogNW0KICBncm91cF93YWl0OiAzMHMKICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgogIHJlcGVhdF9pbnRlcnZhbDogMTJoCiAgcm91dGVzOgogIC0gcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICAgIGdyb3VwX3dhaXQ6IDMwcwogICAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIC0gcmVjZWl2ZXI6IG5ldy1yZWNlaXZlci0xCiAgICBncm91cF93YWl0OiAzMHMKICAgIGdyb3VwX2ludGVydmFsOiA1bQogICAgcmVwZWF0X2ludGVydmFsOiAxMmgKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgogIC0gbWF0Y2g6CiAgICAgIGFsZXJ0bmFtZTogV2F0Y2hkb2cKICAgIHJlY2VpdmVyOiBuZXctcmVjZWl2ZXItMQp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbAo=
    
  • Edit the alertmanager-secret-k8s.yaml file created in step 1. Replace the alertmanager.yaml encoded content with the output generated in the previous step.

    For example:

    $ vi alertmanager-secret-k8s.yaml
    apiVersion: v1
    data:
      alertmanager.yaml: <paste here the encoded content of alertmanager.yaml>
    kind: Secret
    metadata:
      annotations:
        meta.helm.sh/release-name: occne-kube-prom-stack
        meta.helm.sh/release-namespace: occne-infra
      creationTimestamp: "2023-02-16T09:44:58Z"
      labels:
        app: kube-prometheus-stack-alertmanager
        app.kubernetes.io/instance: occne-kube-prom-stack
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/part-of: kube-prometheus-stack
        app.kubernetes.io/version: 36.2.0
        chart: kube-prometheus-stack-36.2.0
        heritage: Helm
        release: occne-kube-prom-stack
      name: alertmanager-occne-kube-prom-stack-kube-alertmanager
      namespace: occne-infra
      resourceVersion: "8211"
      uid: 9b499b32-6ad2-4754-8691-70665f9daab4
    type: Opaque
  • Run the following command:
    $ kubectl apply -f alertmanager-secret-k8s.yaml -n occne-infra

Integrate the Alert Manager with snmp-notifier Service

  • Update the SNMP client destination in the occne-snmp-notifier service with the SNMP destination client IP.

    Note:

    For a persistent client configuration, edit the values of the snmp-notifier in Helm charts and perform a Helm upgrade.

    Add "warn" in alert severity to receive warning alerts from OCNADD. Run the following command:

    $ kubectl edit deployment -n occne-infra occne-snmp-notifier
     
    1. update the field "--snmp.destination=<IP>:<port>" inside the args of container and add the snmp-client destination IP.
       Example:
     
        spec:
          containers:
          - args:
            - --snmp.destination=10.20.30.40:162
     
    2. "warn" should also be added to the severity list as some of the DD alerts are raised with severity: warn.
       Exmaple:
     
        - --alert.severities=critical,major,minor,warning,info,clear,warn 

Verifying SNMP notification

  • Update the SNMP client destination in the occne-snmp-notifier service with the SNMP destination client IP.

    Note:

    For a persistent client configuration, edit the values of the snmp-notifier in Helm charts and perform a Helm upgrade.
    Add "alert.severities" in the container arguments for the occne-snmp-notifier to receive alerts from OCNADD. Run the following command:
    $ kubectl edit deployment -n occne-infra occne-snmp-notifier
      
    1. update the field "--snmp.destination=<IP>:<port>" inside the args of container and add the snmp-client destination IP.
       Example:
      
        spec:
          containers:
          - args:
            - --snmp.destination=10.20.30.40:162
      
    2. Add the "alert.severities" parameter in the container arguments. Add the below line in the container arguments:
      alert.severities=critical,major,minor,warning,info,clear,warn
     
       Exmaple:
        spec:
          containers:
          - args:
            - --snmp.destination=10.20.30.40:162
            - --alert.severities=critical,major,minor,warning,info,clear,warn
  • To verify the SNMP notification, see the new notifications in the pod logs of occne snmp notifier. Run the following command to see the logs:
    $ kubectl logs -n occne-infra <occne-snmp-notifier-pod-name>

    Sample output:

    10.20.30.50 - - [26/Mar/2023:13:58:14 +0000] "POST /alerts HTTP/1.1" 200 0
    10.20.30.60 - - [26/Mar/2023:14:02:51 +0000] "POST /alerts HTTP/1.1" 200 0
    10.20.30.70 - - [26/Mar/2023:14:03:14 +0000] "POST /alerts HTTP/1.1" 200 0
    10.20.30.80 - - [26/Mar/2023:14:07:51 +0000] "POST /alerts HTTP/1.1" 200 0
    10.20.30.90 - - [26/Mar/2023:14:08:14 +0000] "POST /alerts HTTP/1.1" 200 0

OCNADD MIB Files

Two OCNADD MIB files are used to generate the traps. The operator has to update the MIB files and the alert manager file to obtain the traps in their environment. The files are:

  • OCNADD-MIB-TC-25.1.200.mib: This is a top level mib file, where the objects and their data types are defined.
  • OCNADD-MIB-25.1.200.mib: This file fetches the objects from the top level mib file and based on the alert notification, the objects are selected for display.

Note:

MIB files are packaged along with OCNADD Custom Templates. Download the files from MOS. See Oracle Communications Cloud Native Core Network Analytics Data Director Installation, Upgrade, and Fault Recovery Guide for more information.