9 OCNADD KPIs
Note:
The "namespace" in the KPIs should be updated to reflect the current namespace used in the OCNADD deployment.The queries should be used per worker group wherever applicable like KPIs for ingress and egress MPS, failure or success rate, packet drop, etc. The label "worker_group" should be used to filter on the basis of the worker group name in the KPI queries.
The following KPIs are added in OCNADD 23.4.0.0.1.
Table 9-1 ocnadd_ingress_record_count_by_service
KPI Detail | Measures the total ingress records in kafka source topics per aggregation service at the current time |
---|---|
Metric Used for the KPI | sum by (service)(kafka_stream_processor_node_process_total{namespace="$NAMESPACE", service=~".*aggregation.*"}) |
Service Operation | NA |
Response Code | NA |
Table 9-2 ocnadd_ingress_record_count_total
KPI Detail | Measures the total ingress records in kafka source topics at the current time |
---|---|
Metric Used for the KPI | sum (kafka_stream_processor_node_process_total{namespace="$NAMESPACE", service=~".*aggregation.*"}) |
Service Operation | NA |
Response Code | NA |
Table 9-3 ocnadd_ingress_mps_per_service_5mAgg
KPI Detail | Measures the ingress MPS per service aggregated over 5min |
---|---|
Metric Used for the KPI | sum by (service)(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[5m])) |
Service Operation | NA |
Response Code | NA |
Table 9-4 ocnadd_ingress_mps_5mAgg
KPI Detail | Measures the ingress MPS aggregated over 5min |
---|---|
Metric Used for the KPI | sum(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[5m])) |
Service Operation | NA |
Response Code | NA |
Table 9-5 ocnadd_ingress_mps_per_service_5mAgg_last_24h
KPI Detail | Measures the ingress MPS per service aggregated over 5min for last 24 hours |
---|---|
Metric Used for the KPI | sum by (service)(rate(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[5m]))[24h:5m] |
Service Operation | NA |
Response Code | NA |
Table 9-6 ocnadd_ingress_record_count_per_service_5mAgg_last_24h
KPI Detail | Measures the ingress messages per service aggregated over 5min for last 24 hours |
---|---|
Metric Used for the KPI | sum by (service)(increase(kafka_stream_processor_node_process_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[5m]))[24h:5m] |
Service Operation | NA |
Response Code | NA |
Table 9-7 ocnadd_kafka_ingress_record_drop_rate_5minAgg
KPI Detail | Measures the total ingress message drop rate aggregated over 5min |
---|---|
Metric Used for the KPI | sum(rate(kafka_stream_task_dropped_records_total{namespace="$NAMESPACE",service=~".*aggregation.*"}[5m])) |
Service Operation | NA |
Response Code | NA |
Table 9-8 ocnadd_kafka_ingress_record_drop_rate_per_service_5minAgg
KPI Detail | Measures the total ingress message drop rate per service aggregated over 5min |
---|---|
Metric Used for the KPI | sum(rate(kafka_stream_task_dropped_records_total{namespace="$NAMESPACE",service=~".*aggregation.*}[5m])) by (service,pod) |
Service Operation | NA |
Response Code | NA |
Table 9-9 ocnadd_egress_request_count_total_by_3rdparty_destination_endpoint
KPI Detail | Total egress requests per 3rd party application per destination endpoint |
---|---|
Metric Used for the KPI | sum by (instance_identifier,destination_endpoint)(ocnadd_egress_requests_total{namespace="$NAMESPACE"}) |
Service Operation | POST |
Response Code | NA |
Table 9-10 ocnadd_egress_response_count_total_by_3rdparty_destination_endpoint
KPI Detail | Total egress responses per 3rd party application per destination endpoint |
---|---|
Metric Used for the KPI | sum by (instance_identifier,destination_endpoint)(ocnadd_egress_responses_total{namespace="$NAMESPACE"} |
Service Operation | POST |
Response Code | NA |
Table 9-11 ocnadd_egress_failure_count_total_by_3rdparty_destination_endpoint
KPI Detail | Total egress failure count per 3rd party application per destination endpoint |
---|---|
Metric Used for the KPI | sum by (destination_endpoint,instance_identifier)(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"}) |
Service Operation | POST |
Response Code | NA |
Table 9-12 ocnadd_egress_request_rate_by_3rdparty_5mAgg
KPI Detail | Total egress request rate per 3rd party application in 5min Aggregation |
---|---|
Metric Used for the KPI | sum by (instance_identifier)(rate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[5m])) |
Service Operation | POST |
Response Code | NA |
Table 9-13 ocnadd_egress_failure_rate_by_3rdparty_5mAgg
KPI Detail | Total egress failure rate per 3rd party application in 5min Aggregation |
---|---|
Metric Used for the KPI | sum by
(instance_identifier)(irate(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"}[5m]))
/ sum by (instance_identifier) (irate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[5m])) |
Service Operation | POST |
Response Code | NA |
Table 9-14 ocnadd_egress_failure_rate_by_3rdparty_per_destination_endpoint_5mAgg
KPI Detail | Total egress failure rate per 3rd party application per destination endpoint in 5min Aggregation |
---|---|
Metric Used for the KPI |
sum by (instance_identifier, destination_endpoint)(irate(ocnadd_egress_failed_request_total{namespace="$NAMESPACE"}[5m])) / sum by (instance_identifier, destination_endpoint) (irate(ocnadd_egress_requests_total{namespace="$NAMESPACE"}[5m])) |
Service Operation | POST |
Response Code | NA |
Table 9-15 ocnadd_e2e_avg_record_latency_by_3rdparty
KPI Detail | Total e2e average latency per 3rd party application in 5min Aggregation |
---|---|
Metric Used for the KPI |
(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[5m])) by (instance_identifier) / (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[5m])) by (instance_identifier))) |
Service Operation | POST |
Response Code | NA |
Table 9-16 ocnadd_e2e_avg_record_latency_by_3rdparty_per_adapter_pod
KPI Detail | Total e2e average latency per 3rd party application per egress adapter POD in 5min Aggregation |
---|---|
Metric Used for the KPI |
(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod) / (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod))) |
Service Operation | POST |
Response Code | NA |
Table 9-17 ocnadd_egress_adapter_processing_avg_record_latency_by_3rdparty_per_adapter_pod
KPI Detail | Total service processing average latency per 3rd party application per adapter POD in 5min Aggregation |
---|---|
Metric Used for the KPI |
(sum (irate(ocnadd_egress_service_request_processing_latency_seconds_sum{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod) / (sum (irate(ocnadd_egress_service_request_processing_latency_seconds_count{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod))) |
Service Operation | POST |
Response Code | NA |
Table 9-18 ocnadd_egress_adapter_request_processing_avg_record_latency_by_3rdparty_per_adapter_pod
KPI Detail | Total request processing average latency per 3rd party application per adapter POD in 5min Aggregation, this includes network latency added by response from 3rd party application |
---|---|
Metric Used for the KPI |
(sum (irate(ocnadd_egress_request_latency_seconds_sum{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod) / (sum (irate(ocnadd_egress_request_latency_seconds_count{namespace="$NAMESPACE"}[5m])) by (instance_identifier,pod))) |
Service Operation | POST |
Response Code | NA |
Table 9-19 ocnadd_egress_e2e_avg_latency_95percentile_for_a_given_egress_adapter
KPI Detail | The 95 quantile value of e2e latency in milisec for egress adapter calculated over period of 5min |
---|---|
Metric Used for the KPI | histogram_quantile(0.95, sum(rate(ocnadd_egress_e2e_request_processing_latency_seconds_bucket{namespace="$namespaces",service="$servicename"}[5m])) by (le)) |
Service Operation | POST |
Response Code | NA |
The following KPI should be used in the context of the management group and worker group, the namespaces may differ for the management and worker group if there is no default worker group.
Table 9-20 Memory Usage per POD
KPI Detail | Measures the memory usage per POD |
---|---|
Metric Used for the KPI | sum(container_memory_working_set_bytes{namespace=~"$Namespace",image!=""}/(1024*1024*1024)) by (pod) |
Service Operation | NA |
Response Code | NA |
This KPI should be used in the context of the management group and worker group, the namespaces may differ for the management and worker group if there is no default worker group.
Table 9-21 CPU Usage per POD
KPI Detail | Measures the CPU usage per POD |
---|---|
Metric Used for the KPI | sum(rate(container_cpu_usage_seconds_total{namespace=~"$Namespace",image!=""}[2m])) by (pod) * 1000 |
Service Operation | NA |
Response Code | NA |
Table 9-22 Service Status
KPI Detail | Provide the status of each of the data director service running in the namespace provided |
---|---|
Metric Used for the KPI | up{namespace="$NAMESPACE"} |
Service Operation | NA |
Response Code | NA |
Table 9-23 ocnadd_ext_kafka_feed_record_total per external feed rate(MPS)
KPI Detail | The rate of messages consumed per sec per external Kafka consumer, calculated over period of 5min |
---|---|
Metric Used for the KPI | sum(irate(ocnadd_ext_kafka_feed_record_total{namespace="$Namespace"}[5m])) by (feed_name) |
Service Operation | NA |
Response Code | NA |