6 CAPIF Alerts

This chapter includes information about the following CAPIF alerts:

Note:

  • The performance and capacity of the CAPIF system may vary based on the call model, feature or interface configuration, and underlying CNE and hardware environment.
  • Due to unavailability of metric and/or MQL queries, the following alerts are not supported for OCI:
    • OccapifNfStatusUnavailable
    • OccapifPodsRestart
    • OccapifEgressGatewayServiceDown
    • OccapifIngressGatewayServiceDown
    • OccapifAfManagerServiceDown
    • OccapifAPIManagerServiceDown
    • OccapifEventManagerServiceDown

6.1 System Level Alerts

This section lists the system level alerts for CAPIF.

6.1.1 OccapifNfStatusUnavailable

Table 6-1 OccapifNfStatusUnavailable

Field Details
Description CAPIF services unavailable'
Summary "namespace: {{$labels.namespace}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : All OCCAPIF services are unavailable."
Severity Critical
Condition All the CAPIF services are unavailable.
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring.

If this metric is not available, use a similar metric as exposed by the monitoring system.

Recommended Actions The alert is cleared automatically when the CAPIF services restart.

Steps:

  1. Check for service-specific alerts which may be causing the issues with service exposure.
  2. Run the following command to check the pod status:
    $ kubectl get po -n <namespace>
    1. Run the following command to analyze the error condition of the pod that is not in the Running state:
      $ kubectl describe pod <pod name not in Running state> -n <namespace>

      Where <pod name not in Running state> indicates the pod that is not in the Running state.

  3. Refer to the application logs on Kibana and check for database related failures such as connectivity and invalid secrets. The logs can be filtered based on the services.
  4. Check for helm status to make sure there are no errors:
    $ helm status <helm release name of the desired NF> -n <namespace>

    If it is not in “STATUS : DEPLOYED”, then capture logs and event again.

  5. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information on the Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.1.2 OccapifPodsRestart

Table 6-2 OccapifPodsRestart

Field Details
Description 'Pod <Pod Name> has restarted.
Summary "namespace: {{$labels.namespace}}, podname: {{$labels.pod}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : A Pod has restarted"
Severity Major
Condition A pod belonging to any of the CAPIF services has restarted.
Metric Used kube_pod_container_status_restarts_total
Recommended Actions

The alert is cleared automatically if the specific pod is up.

Steps:

  1. Refer to the application logs on Kibana and filter based on pod name, check for database related failures such as connectivity and Kubernetes secrets.
  2. To check the orchestration logs for liveness or readiness probe failures, do the following:
    1. Run the following command to check the pod status:
      $ kubectl get po -n <namespace>
    2. Run the following command to analyze the error condition of the pod that is not in the Running state:
      $ kubectl describe pod <pod name not in Running state> -n <namespace>

      Where <pod name not in Running state> indicates the pod that is not in the Running state.

  3. Check the database status. For more information, see Oracle Communications Cloud Native Core, cnDBTier User Guide.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information on the Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.1.3 OccapifTotalExternalIngressTrafficRateAboveMinorThreshold

Table 6-3 OccapifTotalExternalIngressTrafficRateAboveMinorThreshold

Field Details
Description "OCCAPIF External Ingress traffic rate is above the configured minor threshold i.e. 800 TPS (current value is: {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Traffic rate is above 80 percent of max TPS (1000)"
Severity Minor
Condition The total CAPIF External Ingress traffic rate has crossed the configured minor threshold of 800 TPS.

Default value of this alert trigger point in Occapif Alert.yaml is 80 % of 1000 (Maximum ingress request rate).

OID 1.3.6.1.4.1.323.5.3.39.1.3.5003
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions The alert is cleared either when the External Ingress traffic rate goes above the minor threshold.

Note: The threshold is configurable in the Occapif Alert.yaml alert file.

Reassess why the CAPIF is receiving additional traffic. If this alert is unexpected, contact My Oracle Support.
Steps:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress gateway section in Grafana to determine the increase in 4xx and 5xx error codes.
  3. Check Ingress gateway logs on Kibana to determine the reason for the errors.

6.1.4 OccapifTotalNetworkIngressTrafficRateAboveMinorThreshold

Table 6-4 OccapifTotalNetworkIngressTrafficRateAboveMinorThreshold

Field Details
Description "OCCAPIF Network Ingress traffic rate is above the configured minor threshold i.e. 800 TPS (current value is: {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Traffic rate is above 80 percent of max TPS (1000)"
Severity Minor
Condition The total CAPIF Network Ingress traffic rate has crossed the configured minor threshold of 800 TPS.

Default value of this alert trigger point in Occapif Alert.yaml is 80% of 1000 (maximum ingress request rate).

OID 1.3.6.1.4.1.323.5.3.39.1.3.5004
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions The alert is cleared either when the Network Ingress traffic rate goes above the minor threshold.

Note: The threshold is configurable in the Occapif Alert.yaml alert file.

Reassess why the CAPIF is receiving additional traffic. If this alert is unexpected, contact My Oracle Support.
Steps:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress gateway section in Grafana to determine the increase in 4xx and 5xx error codes.
  3. Check Ingress gateway logs on Kibana to determine the reason for the errors.

6.1.5 OccapifTotalExternalIngressTrafficRateAboveMajorThreshold

Table 6-5 OccapifTotalExternalIngressTrafficRateAboveMajorThreshold

Field Details
Description "OCCAPIF External Ingress traffic rate is above the configured major threshold i.e. 900 TPS (current value is: {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Traffic rate is above 90 percent of max TPS (1000)"
Severity Major
Condition The total CAPIF External Ingress traffic rate has crossed the configured major threshold of 900 TPS.

Default value of this alert trigger point in Occapif Alert.yaml is 90 % of 1000 (maximum ingress request rate).

OID 1.3.6.1.4.1.323.5.3.39.1.3.5005
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions The alert is cleared either when the External Ingress traffic rate goes above the major threshold.

Note: The threshold is configurable in the Occapif Alert.yaml alert file.

Reassess why the CAPIF is receiving additional traffic. If this alert is unexpected, contact My Oracle Support.
Steps:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress gateway section in Grafana to determine the increase in 4xx and 5xx error codes.
  3. Check Ingress gateway logs on Kibana to determine the reason for the errors.

6.1.6 OccapifTotalNetworkIngressTrafficRateAboveMajorThreshold

Table 6-6 OccapifTotalNetworkIngressTrafficRateAboveMajorThreshold

Field Details
Description "OCCAPIF Network Ingress traffic rate is above the configured major threshold i.e. 900 TPS (current value is: {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Traffic Rate is above 90 percent of max TPS (1000)"
Severity Major
Condition The total CAPIF Network Ingress traffic rate has crossed the configured major threshold of 900 TPS.

Default value of this alert trigger point in Occapif Alert.yaml is 90 % of 1000 (maximum ingress request rate).

OID 1.3.6.1.4.1.323.5.3.39.1.3.5006
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions The alert is cleared either when the Network Ingress traffic rate goes above the major threshold.

Note: The threshold is configurable in the Occapif Alert.yaml alert file.

Reassess why the CAPIF is receiving additional traffic. If this alert is unexpected, contact My Oracle Support.
Steps:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress gateway section in Grafana to determine the increase in 4xx and 5xx error codes.
  3. Check Ingress gateway logs on Kibana to determine the reason for the errors.

6.1.7 OccapifTotalExternalIngressTrafficRateAboveCriticalThreshold

Table 6-7 OccapifTotalExternalIngressTrafficRateAboveCriticalThreshold

Field Details
Description "OCCAPIF External Ingress traffic rate is above the configured critical threshold i.e. 950 TPS (current value is: {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Traffic rate is above 95 percent of max TPS (1000)"
Severity Critical
Condition The total CAPIF External Ingress traffic rate has crossed the configured critical threshold of 950 TPS.

Default value of this alert trigger point in Occapif Alert.yaml is 95 % of 1000 (maximum ingress request rate).

OID 1.3.6.1.4.1.323.5.3.39.1.3.5007
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions The alert is cleared either when the External Ingress traffic rate goes above the critical threshold.

Note: The threshold is configurable in the Occapif Alert.yaml alert file.

Reassess why the CAPIF is receiving additional traffic. If this alert is unexpected, contact My Oracle Support.
Steps:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress gateway section in Grafana to determine the increase in 4xx and 5xx error codes.
  3. Check Ingress gateway logs on Kibana to determine the reason for the errors.

6.1.8 OccapifTotalNetworkIngressTrafficRateAboveCriticalThreshold

Table 6-8 OccapifTotalNetworkIngressTrafficRateAboveCriticalThreshold

Field Details
Description "OCCAPIF Network Ingress traffic rate is above the configured critical threshold i.e. 950 TPS (current value is: {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Traffic rate is above 95 percent of max TPS (1000)"
Severity Critical
Condition The total CAPIF Network Ingress traffic rate has crossed the configured critical threshold of 950 TPS.

Default value of this alert trigger point in Occapif Alert.yaml is 95 % of 1000 (Maximum ingress request rate).

OID 1.3.6.1.4.1.323.5.3.39.1.3.5008
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions The alert is cleared either when the Network Ingress traffic rate goes above the critical threshold.

Note: The threshold is configurable in the Occapif Alert.yaml alert file.

Reassess why the CAPIF is receiving additional traffic. If this alert is unexpected, contact My Oracle Support.
Steps:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress gateway section in Grafana to determine the increase in 4xx and 5xx error codes.
  3. Check Ingress gateway logs on Kibana to determine the reason for the errors.

6.1.9 OccapifExternalIngressTransactionErrorRateAboveZeroPointOnePercent

Table 6-9 OccapifExternalIngressTransactionErrorRateAboveZeroPointOnePercent

Field Details
Description "OCCAPIF External Ingress transaction error rate is above 0.1 percent of total transactions (current value is {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction error rate detected above 0.1 percent of total transactions"
Severity Warning
Condition The number of failed External Ingress transactions is above 0.1 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5009
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure External Ingress transactions is below 0.1 percent of the total transactions or when the number of failed transactions crosses the 1% threshold, in which case the OccapifExternalIngressTransactionErrorRateAbove1Percent is raised.

Steps:

  1. Check the service specific metrics to understand the specific service request errors.
  2. Check metrics per service, per method:
  3. If guidance is required, contact My Oracle Support.

6.1.10 OccapifNetworkIngressTransactionErrorRateAboveZeroPointOnePercent

Table 6-10 OccapifNetworkIngressTransactionErrorRateAboveZeroPointOnePercent

Field Details
Description "OCCAPIF Network Ingress transaction error rate is above 0.1 percent of total transactions (current value is {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction error rate detected above 0.1 percent of total transactions"
Severity Warning
Condition The number of failed Network Ingress transactions is above 0.1 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5010
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure Network Ingress transactions is below 0.1 percent of the total transactions or when the number of failed transactions crosses the 1% threshold, in which case the OccapifNetworkIngressTransactionErrorRateAbove1Percent is raised.

Steps:

  1. Check the service specific metrics to understand the specific service request errors.
  2. Check metrics per service, per method:
  3. If guidance is required, contact My Oracle Support.

6.1.11 OccapifExternalIngressTransactionErrorRateAbove1Percent

Table 6-11 OccapifExternalIngressTransactionErrorRateAbove1Percent

Field Details
Description "OCCAPIF External Ingress transaction error rate is above 1 percent of total transactions (current value is {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction error rate detected above 1 percent of total transactions"
Severity Warning
Condition The number of failed External Ingress transactions is above 1 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5011
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure External Ingress transactions is below 1 percent of the total transactions.

Steps:

  1. Check the service specific metrics to understand the specific service request errors.
  2. Check metrics per service, per method:
  3. If guidance is required, contact My Oracle Support.

6.1.12 OccapifNetworkIngressTransactionErrorRateAbove1Percent

Table 6-12 OccapifNetworkIngressTransactionErrorRateAbove1Percent

Field Details
Description "OCCAPIF Network Ingress transaction error rate is above 1 percent of total transactions (current value is {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction error rate detected above 1 percent of total transactions"
Severity Warning
Condition The number of failed Network Ingress transactions is above 1 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5012
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure Network Ingress transactions is below 1 percent of the total transactions.

Steps:

  1. Check the service specific metrics to understand the specific service request errors.
  2. Check metrics per service, per method:
  3. If guidance is required, contact My Oracle Support.

6.1.13 OccapifExternalIngressTransactionErrorRateAbove10Percent

Table 6-13 OccapifExternalIngressTransactionErrorRateAbove10Percent

Field Details
Description "OCCAPIF External Ingress transaction error rate is above 10 percent of total transactions (current value is {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction error rate detected above 10 percent of total transactions"
Severity Minor
Condition The number of failed External Ingress transactions is above 10 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5013
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure External Ingress transactions is below 10 percent of the total transactions.

Steps:

  1. Check the service specific metrics to understand the specific service request errors.
  2. Check metrics per service, per method:
  3. If guidance is required, contact My Oracle Support.

6.1.14 OccapifNetworkIngressTransactionErrorRateAbove10Percent

Table 6-14 OccapifNetworkIngressTransactionErrorRateAbove10Percent

Field Details
Description "OCCAPIF Network Ingress transaction error rate is above 10 percent of total transactions (current value is {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction error rate detected above 10 percent of total transactions"
Severity Minor
Condition The number of failed Network Ingress transactions is above 10 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5014
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure Network Ingress transactions is below 10 percent of the total transactions.

Steps:

  1. Check the service specific metrics to understand the specific service request errors.
  2. Check metrics per service, per method:
  3. If guidance is required, contact My Oracle Support.

6.1.15 OccapifExternalIngressTransactionErrorRateAbove25Percent

Table 6-15 OccapifExternalIngressTransactionErrorRateAbove25Percent

Field Details
Description "OCCAPIF External Ingress transaction error rate detected above 25 percent of total transactions (current value is {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction error rate detected above 25 percent of total transactions"
Severity Major
Condition The number of failed External Ingress transactions is above 25 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5015
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure External Ingress transactions is below 25 percent of the total transactions.

Steps:

  1. Check the service specific metrics to understand the specific service request errors.
  2. Check metrics per service, per method:
  3. If guidance is required, contact My Oracle Support.

6.1.16 OccapifNetworkIngressTransactionErrorRateAbove25Percent

Table 6-16 OccapifNetworkIngressTransactionErrorRateAbove25Percent

Field Details
Description "OCCAPIF Network Ingress transaction error rate detected above 25 percent of total transactions (current value is {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction error rate detected above 25 percent of total transactions"
Severity Major
Condition The number of failed Network Ingress transactions is above 25 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5016
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure Network Ingress transactions is below 25 percent of the total transactions.

Steps:

  1. Check the service specific metrics to understand the specific service request errors.
  2. Check metrics per service, per method:
  3. If guidance is required, contact My Oracle Support.

6.1.17 OccapifExternalIngressTransactionErrorRateAbove50Percent

Table 6-17 OccapifExternalIngressTransactionErrorRateAbove50Percent

Field Details
Description "OCCAPIF External Ingress transaction error rate detected above 50 percent of total transactions (current value is {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction error rate detected above 50 percent of total transactions"
Severity Critical
Condition The number of failed External Ingress transactions is above 50 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5017
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure External Ingress transactions is below 50 percent of the total transactions.

Steps:

  1. Check the service specific metrics to understand the specific service request errors.
  2. Check metrics per service, per method:
  3. If guidance is required, contact My Oracle Support.

6.1.18 OccapifNetworkIngressTransactionErrorRateAbove50Percent

Table 6-18 OccapifNetworkIngressTransactionErrorRateAbove50Percent

Field Details
Description "OCCAPIF Network Ingress transaction error rate detected above 50 percent of total transactions (current value is {{ $value }})"
Summary "timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction error rate detected above 50 percent of total transactions"
Severity Critical
Condition The number of failed Network Ingress transactions is above 50 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5018
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure Network Ingress transactions is below 50 percent of the total transactions.

Steps:

  1. Check the service specific metrics to understand the specific service request errors.
  2. Check metrics per service, per method:
  3. If guidance is required, contact My Oracle Support.

6.1.19 OccapifEgressGatewayServiceDown

Table 6-19 OccapifEgressGatewayServiceDown

Field Details
Description "CAPIF Egress-Gateway service {{$labels.app_kubernetes_io_name}} is down"
Summary "kubernetes_namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Egress-Gateway service down"
Severity Critical
Condition None of the pods of the Egress Gateway microservice is available.
Metric Used 'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the Egress Gateway service is available.

Note: The threshold is configurable in the NefAlertrules alert file.

Steps:

  1. To check the orchestration logs of Egress Gateway service and check for liveness or readiness probe failures, do the following:
    1. Run the following command to check the pod status:
      $ kubectl get po -n <namespace>
    2. Run the following command to analyze the error condition of the pod that is not in the Running state:
      $ kubectl describe pod <pod name not in Running state> -n <namespace>

      Where <pod name not in Running state> indicates the pod that is not in the Running state.

  2. Refer to the application logs on Kibana and filter based on Egress Gateway service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.1.20 OccapifMemoryUsageCrossedMinorThreshold

Table 6-20 OccapifMemoryUsageCrossedMinorThreshold

Field Details
Description "CAPIF Memory Usage for pod {{ $labels.pod }} has crossed the configured minor threshold (50%) (value={{ $value }}) of its limit."
Summary "namespace: {{$labels.namespace}}, podname: {{$labels.pod}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Memory Usage of pod exceeded 50% of its limit."
Severity Minor
Condition A pod has reached the configured minor threshold (50%) of its memory resource limits.
Metric Used 'container_memory_usage_bytes''container_spec_memory_limit_bytes'

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions The alert gets cleared when the memory utilization falls below the Minor Threshold or crosses the major threshold, in which case OccapifMemoryUsageCrossedMajorThreshold alert is raised.

Note: The threshold is configurable in the NefAlertrules alert file.

In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.1.21 OccapifMemoryUsageCrossedMajorThreshold

Table 6-21 OccapifMemoryUsageCrossedMajorThreshold

Field Details
Description "CAPIF Memory Usage for pod {{ $labels.pod }} has crossed the configured major threshold (60%) (value = {{ $value }}) of its limit."
Summary "namespace: {{$labels.namespace}}, podname: {{$labels.pod}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Memory Usage of pod exceeded 60% of its limit."
Severity Major
Condition A pod has reached the configured major threshold (60%) of its memory resource limits.
OID 1.3.6.1.4.1.323.5.3.39.1.3.5021
Metric Used

'container_memory_usage_bytes'

'container_spec_memory_limit_bytes'

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions The alert gets cleared when the memory utilization falls below the Major Threshold or crosses the critical threshold, in which case OccapifMemoryUsageCrossedCriticalThreshold alert is raised.

Note: The threshold is configurable in the NefAlertrules alert file.

In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.1.22 OccapifMemoryUsageCrossedCriticalThreshold

Table 6-22 OccapifMemoryUsageCrossedCriticalThreshold

Field Details
Description "CAPIF Memory Usage for pod {{ $labels.pod }} has crossed the configured major threshold (70%) (value = {{ $value }}) of its limit."
Summary "namespace: {{$labels.namespace}}, podname: {{$labels.pod}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Memory Usage of pod exceeded 70% of its limit."
Severity Critical
Condition A pod has reached the configured critical threshold (70%) of its memory resource limits.
Metric Used

'container_memory_usage_bytes'

'container_spec_memory_limit_bytes'

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.

Recommended Actions The alert gets cleared when the memory utilization falls below the Critical threshold.

Note: The threshold is configurable in the NefAlertrules alert file.

In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.1.23 OccapifIngressGatewayServiceDown

Table 6-23 OccapifIngressGatewayServiceDown

Field Details
Description "CAPIF Ingress-Gateway service {{$labels.app_kubernetes_io_name}} is down"
Summary "kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Ingress-gateway service down"
Severity Critical
Condition None of the pods of the Ingress-Gateway microservice is available.
Metric Used 'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the Ingress Gateway service is available.

Steps:

  1. To check the orchestration logs of Ingress Gateway service and check for liveness or readiness probe failures, do the following:
    1. Run the following command to check the pod status:
      $ kubectl get po -n <namespace>
    2. Run the following command to analyze the error condition of the pod that is not in the Running state:
      $ kubectl describe pod <pod name not in Running state> -n <namespace>

      Where <pod name not in Running state> indicates the pod that is not in the Running state.

  2. Refer to the application logs on Kibana and filter based on Ingress Gateway service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.1.24 OccapifApiManagerServiceDown

Table 6-24 OccapifApiManagerServiceDown

Field Details
Description "CAPIF API Manager service {{$labels.app_kubernetes_io_name}} is down"
Summary "namespace: {{$labels.namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : AF Manager service down"
Severity Critical
Condition The API Manager service is down.
Metric Used 'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the CAPIF API Manager service is available.

Steps:

  1. To check the orchestration logs of occapif_apimgr service and check for liveness or readiness probe failures, do the following:
    1. Run the following command to check the pod status:
      $ kubectl get pod -n <namespace>
    2. Run the following command to analyze the error condition of the pod that is not in the Running state:
      $ kubectl describe pod <pod name not in Running state> -n <namespace>

      Where <pod name not in Running state> indicates the pod that is not in the Running state.

  2. Refer the application logs on Kibana and filter based on occapif_apimgr service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Check the DB status. For more information on how to check the DB status, see Oracle Communications Cloud Native Core, cnDBTier User Guide.
  4. Depending on the failure reason, take the resolution steps.
  5. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.1.25 OcapifAfManagerServiceDown

Table 6-25 OcapifAfManagerServiceDown

Field Details
Description "CAPIF AF Manager service {{$labels.app_kubernetes_io_name}} is down"
Summary "kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : AF Manager service down"
Severity Critical
Condition The AF Manager service is down.
Metric Used 'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the CAPIF AF Manager service is available.

Steps:

  1. To check the orchestration logs of occapif_afmgr service and check for liveness or readiness probe failures, do the following:
    1. Run the following command to check the pod status:
      $ kubectl get pod -n <namespace>
    2. Run the following command to analyze the error condition of the pod that is not in the Running state:
      $ kubectl describe pod <pod name not in Running state> -n <namespace>

      Where <pod name not in Running state> indicates the pod that is not in the Running state.

  2. Refer the application logs on Kibana and filter based on occapif_afmgr service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Check the DB status. For more information on how to check the DB status, see Oracle Communications Cloud Native Core, cnDBTier User Guide.
  4. Depending on the failure reason, take the resolution steps.
  5. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.1.26 OccapifEventManagerServiceDown

Table 6-26 OccapifEventManagerServiceDown

Field Details
Description "CAPIF API Manager service {{$labels.app_kubernetes_io_name}} is down"
Summary "kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : API Manager service down"
Severity Critical
Condition The Event Manager service is down.
Metric Used 'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the CAPIF Event Manager service is available.

Steps:

  1. To check the orchestration logs of occapif_eventmanager service and check for liveness or readiness probe failures, do the following:
    1. Run the following command to check the pod status:
      $ kubectl get pod -n <namespace>
    2. Run the following command to analyze the error condition of the pod that is not in the Running state:
      $ kubectl describe pod <pod name not in Running state> -n <namespace>

      Where <pod name not in Running state> indicates the pod that is not in the Running state.

  2. Refer the application logs on Kibana and filter based on ocnef_expgw_apimgr service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Check the DB status. For more information on how to check the DB status, see Oracle Communications Cloud Native Core, cnDBTier User Guide.
  4. Depending on the failure reason, take the resolution steps.
  5. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.2 Application Level Alerts

This section lists the application level alerts for CAPIF.

6.2.1 AfMgrOnboardingOauthValidationFailureRateCrossedThreshold

Table 6-27 AfMgrOnboardingOauthValidationFailureRateCrossedThreshold

Field Details
Description "Failure Rate of AI Onboarding Oauth Validation Is Crossing the Threshold (10%)"
Summary "namespace: {{$labels.namespace}}, timestamp: {{ with query \"time()\" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Failure Rate Of Onboarding is above 10 percent of total requests."
Severity Error
Condition The failure rate of API Invoker onboarding is reaching the threshold value.
Metric Used occapif_afmgr_resp_total
Recommended Actions

The alert is cleared when the failure rate of API invoker onboarding is below the threshold.

Steps:
  1. Check for pod logs on Kibana for ERROR WARN logs.
  2. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.