9 OCNADD KPIs
Note:
The "namespace" in the KPIs should be updated to reflect the current namespace used in the OCNADD deployment.The queries should be used per worker group wherever applicable like KPIs for ingress and egress MPS, failure or success rate, packet drop, etc. The label "worker_group" should be used to filter on the basis of the worker group name in the KPI queries.
The following KPIs are added in OCNADD 23.4.0.0.1.
Table 9-1 ocnadd_ingress_record_count_by_service
| KPI Detail | Measures the total ingress records in kafka source topics per aggregation service at the current time |
|---|---|
| Metric Used for the KPI | sum by (service)(kafka_stream_processor_node_process_total{namespace="$NAMESPACE", service=~".*aggregation.*"}) |
| Service Operation | NA |
| Response Code | NA |
Table 9-2 ocnadd_ingress_record_count_total
| KPI Detail | Measures the total ingress records in kafka source topics at the current time |
|---|---|
| Metric Used for the KPI | sum (kafka_stream_processor_node_process_total{namespace="$NAMESPACE", service=~".*aggregation.*"}) |
| Service Operation | NA |
| Response Code | NA |
Table 9-3 ocnadd_ingress_mps_per_service_5mAgg
| KPI Detail | Measures the ingress MPS per service aggregated over 5min |
|---|---|
| Metric Used for the KPI | sum by (service)(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[5m])) |
| Service Operation | NA |
| Response Code | NA |
Table 9-4 ocnadd_ingress_mps_5mAgg
| KPI Detail | Measures the ingress MPS aggregated over 5min |
|---|---|
| Metric Used for the KPI | sum(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[5m])) |
| Service Operation | NA |
| Response Code | NA |
Table 9-5 ocnadd_ingress_mps_per_service_5mAgg_last_24h
| KPI Detail | Measures the ingress MPS per service aggregated over 5min for last 24 hours |
|---|---|
| Metric Used for the KPI | sum by (service)(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[5m]))[24h:5m] |
| Service Operation | NA |
| Response Code | NA |
Table 9-6 ocnadd_ingress_record_count_per_service_5mAgg_last_24h
| KPI Detail | Measures the ingress messages per service aggregated over 5min for last 24 hours |
|---|---|
| Metric Used for the KPI | sum by (service)(increase(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[5m]))[24h:5m] |
| Service Operation | NA |
| Response Code | NA |
Table 9-7 ocnadd_kafka_ingress_record_drop_rate_5minAgg
| KPI Detail | Measures the total ingress message drop rate aggregated over 5min |
|---|---|
| Metric Used for the KPI | sum(rate(kafka_stream_task_dropped_records_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[5m])) |
| Service Operation | NA |
| Response Code | NA |
Table 9-8 ocnadd_kafka_ingress_record_drop_rate_per_service_5minAgg
| KPI Detail | Measures the total ingress message drop rate per service aggregated over 5min |
|---|---|
| Metric Used for the KPI | sum(rate(kafka_stream_task_dropped_records_total{namespace="$NAMESPACE",service=~".*aggregation.*}[5m])) by (service,pod) |
| Service Operation | NA |
| Response Code | NA |
Table 9-9 ocnadd_egress_request_count_total_by_3rdparty_destination_endpoint
| KPI Detail | Total egress requests per 3rd party application per destination endpoint |
|---|---|
| Metric Used for the KPI | sum by (instance_identifier,destination_endpoint)(ocnadd_egress_requests_total{namespace="$NAMESPACE"}) |
| Service Operation | POST |
| Response Code | NA |
Table 9-10 ocnadd_egress_response_count_total_by_3rdparty_destination_endpoint
| KPI Detail | Total egress responses per 3rd party application per destination endpoint |
|---|---|
| Metric Used for the KPI | sum by (instance_identifier,destination_endpoint)(ocnadd_egress_responses_total{namespace="$NAMESPACE"} |
| Service Operation | POST |
| Response Code | NA |
Table 9-11 ocnadd_egress_failure_count_total_by_3rdparty_destination_endpoint
| KPI Detail | Total egress failure count per 3rd party application per destination endpoint |
|---|---|
| Metric Used for the KPI | sum by (destination_endpoint,instance_identifier)(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"}) |
| Service Operation | POST |
| Response Code | NA |
Table 9-12 ocnadd_egress_request_rate_by_3rdparty_5mAgg
| KPI Detail | Total egress request rate per 3rd party application in 5min Aggregation |
|---|---|
| Metric Used for the KPI | sum by (instance_identifier)(rate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[5m])) |
| Service Operation | POST |
| Response Code | NA |
Table 9-13 ocnadd_egress_failure_rate_by_3rdparty_5mAgg
| KPI Detail | Total egress failure rate per 3rd party application in 5min Aggregation |
|---|---|
| Metric Used for the KPI | sum by
(instance_identifier)(irate(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"}[5m]))
/ sum by (instance_identifier) (irate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[5m])) |
| Service Operation | POST |
| Response Code | NA |
Table 9-14 ocnadd_egress_failure_rate_by_3rdparty_per_destination_endpoint_5mAgg
| KPI Detail | Total egress failure rate per 3rd party application per destination endpoint in 5min Aggregation |
|---|---|
| Metric Used for the KPI |
sum by (instance_identifier, destination_endpoint)(irate(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"}[5m])) / sum by (instance_identifier, destination_endpoint) (irate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[5m])) |
| Service Operation | POST |
| Response Code | NA |
Table 9-15 ocnadd_e2e_avg_record_latency_by_3rdparty
| KPI Detail | Total e2e average latency per 3rd party application in 5min Aggregation |
|---|---|
| Metric Used for the KPI |
(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[5m])) by (instance_identifier) / (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[5m])) by (instance_identifier))) |
| Service Operation | POST |
| Response Code | NA |
Table 9-16 ocnadd_e2e_avg_record_latency_by_3rdparty_per_adapter_pod
| KPI Detail | Total e2e average latency per 3rd party application per egress adapter POD in 5min Aggregation |
|---|---|
| Metric Used for the KPI |
(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod) / (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod))) |
| Service Operation | POST |
| Response Code | NA |
Table 9-17 ocnadd_egress_adapter_processing_avg_record_latency_by_3rdparty_per_adapter_pod
| KPI Detail | Total service processing average latency per 3rd party application per adapter POD in 5min Aggregation |
|---|---|
| Metric Used for the KPI |
(sum (irate(ocnadd_egress_service_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod) / (sum (irate(ocnadd_egress_service_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod))) |
| Service Operation | POST |
| Response Code | NA |
Table 9-18 ocnadd_egress_adapter_request_processing_avg_record_latency_by_3rdparty_per_adapter_pod
| KPI Detail | Total request processing average latency per 3rd party application per adapter POD in 5min Aggregation, this includes network latency added by response from 3rd party application |
|---|---|
| Metric Used for the KPI |
(sum (irate(ocnadd_egress_request_latency_seconds_sum{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod) / (sum (irate(ocnadd_egress_request_latency_seconds_count{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod))) |
| Service Operation | POST |
| Response Code | NA |
Table 9-19 ocnadd_egress_e2e_avg_latency_95percentile_for_a_given_egress_adapter
| KPI Detail | The 95 quantile value of e2e latency in milisec for egress adapter calculated over period of 5min |
|---|---|
| Metric Used for the KPI | histogram_quantile(0.95, sum(rate(ocnadd_egress_e2e_request_processing_latency_seconds_bucket{namespace="$namespaces",service="$servicename"}[5m])) by (le)) |
| Service Operation | POST |
| Response Code | NA |
The following KPI should be used in the context of the management group and worker group, the namespaces may differ for the management and worker group if there is no default worker group.
Table 9-20 Memory Usage per POD
| KPI Detail | Measures the memory usage per POD |
|---|---|
| Metric Used for the KPI | sum(container_memory_working_set_bytes{namespace=~"$Namespace",image!=""}/(1024*1024*1024)) by (pod) |
| Service Operation | NA |
| Response Code | NA |
This KPI should be used in the context of the management group and worker group, the namespaces may differ for the management and worker group if there is no default worker group.
Table 9-21 CPU Usage per POD
| KPI Detail | Measures the CPU usage per POD |
|---|---|
| Metric Used for the KPI | sum(rate(container_cpu_usage_seconds_total{namespace=~"$Namespace",image!=""}[2m])) by (pod) * 1000 |
| Service Operation | NA |
| Response Code | NA |
Table 9-22 Service Status
| KPI Detail | Provide the status of each of the data director service running in the namespace provided |
|---|---|
| Metric Used for the KPI | up{namespace="$NAMESPACE"} |
| Service Operation | NA |
| Response Code | NA |
Table 9-23 ocnadd_ext_kafka_feed_record_total per external feed rate(MPS)
| KPI Detail | The rate of messages consumed per sec per external Kafka consumer, calculated over period of 5min |
|---|---|
| Metric Used for the KPI | sum(irate(ocnadd_ext_kafka_feed_record_total{namespace="$Namespace"}[5m])) by (feed_name) |
| Service Operation | NA |
| Response Code | NA |