5 Alerts

This section provides information about alerts for Oracle Communications Cloud Native Core, Network Slice Selection Function (NSSF).

5.1 System Level Alerts

This section lists the system level alerts.

5.1.1 OcnssfNfStatusUnavailable

Table 5-1 OcnssfNfStatusUnavailable

Field Details
Description 'OCNSSF services unavailable'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : All OCNSSF services are unavailable.'
Severity Critical
Condition All the NSSF services are unavailable, either because the NSSF is getting deployed or purged. These NSSF services considered are nssfselection, nssfsubscription, nssfavailability, nssfconfiguration, appinfo, ingressgateway and egressgateway.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9001
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions The alert is cleared automatically when the NSSF services start becoming available.

Steps:

  1. Check for service specific alerts which may be causing the issues with service exposure.
  2. Run the following command to check if the pod’s status is in “Running” state:
    kubectl –n <namespace> get pod

    If it is not in running state, capture the pod logs and events.

    Run the following command to fetch the events as follows:

    kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>
  3. Refer to the application logs on Kibana and check for database related failures such as connectivity, invalid secrets, and so on. The logs can be filtered based on the services.
  4. Run the following command to check Helm status and make sure there are no errors:
    helm status <helm release name of the desired NF> -n <namespace>

    If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

  5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.2 OcnssfPodsRestart

Table 5-2 OcnssfPodsRestart

Field Details
Description 'Pod <Pod Name> has restarted.
Summary 'kubernetes_namespace: {{$labels.namespace}}, podname: {{$labels.pod}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : A Pod has restarted'
Severity Major
Condition A pod belonging to any of the NSSF services has restarted.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9002
Metric Used 'kube_pod_container_status_restarts_total'Note: This is a Kubernetes metric. If this metric is not available, use the similar metric as exposed by the monitoring system.
Recommended Actions

The alert is cleared automatically if the specific pod is up.

Steps:

  1. Refer to the application logs on Kibana and filter based on the pod name. Check for database related failures such as connectivity, Kubernetes secrets, and so on.
  2. Run the following command to check orchestration logs for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <desired full pod name> -n <namespace>
  3. Check the database status. For more information, see "Oracle Communications Cloud Native Core, cnDBTier User Guide".
  4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.3 OcnssfSubscriptionServiceDown

Table 5-3 OcnssfSubscriptionServiceDown

Field Details
Description 'OCNSSF Subscription service <ocnssf-nssubscription> is down'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : NssfSubscriptionServiceDown service down'
Severity Critical
Condition NssfSubscription services is unavailable.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9003
Metric Used

''up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions The alert is cleared when the NssfSubscription services is available.

Steps:

  1. Check if NfService specific alerts are generated to understand which service is down.

    If the following alerts are generated based on which service is down

    OcnssfSubscriptionServiceDown

  2. Run the following command to check the orchestration log nfsubscription service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  3. Run the following command to check if the pod’s status is in “Running” state:
    kubectl –n <namespace> get pod

    If it is not in running state, capture the pod logs and events .

    Run the following command to fetch events:

    kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>
  4. Refer to the application logs on Kibana and filter based on above service names. Check for ERROR WARNING logs for each of these services.
  5. Check the database status. For more information, see "Oracle Communications Cloud Native Core, cnDBTier User Guide".
  6. Refer to the application logs on Kibana and check for the service status of the nssfConfig service.
  7. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.4 OcnssfSelectionServiceDown

Table 5-4 OcnssfSelectionServiceDown

Field Details
Description 'OCNSSF Selection service <ocnssf-nsselection> is down'.
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : OcnssfSelectionServiceDown service down'
Severity Critical
Condition None of the pods of the NSSFSelection microservice is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9004
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions The alert is cleared when the nfsubscription service is available.

Steps:

  1. Run the following command to check the orchestration logs of ocnssf-nsselection service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer to the application logs on Kibana and filter based on ocnssf-nsselection service names. Check for ERROR WARNING logs.
  3. Check the database status. For more information, see "Oracle Communications Cloud Native Core, cnDBTier User Guide".
  4. Depending on the failure reason, take the resolution steps.
  5. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.5 OcnssfAvailabilityServiceDown

Table 5-5 OcnssfAvailabilityServiceDown

Field Details
Description 'Ocnssf Availability service ocnssf-nsavailability is down'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : NssfAvailability service down'
Severity Critical
Condition None of the pods of the OcnssfAvailabilityServiceDown microservice is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9005
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions The alert is cleared when the ocnssf-nsavailability service is available.

Steps:

  1. Run the following command to check the orchestration logs of ocnssf-nsavailability service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer to the application logs on Kibana and filter based on ocnssf-nsavailability service names. Check for ERROR WARNING logs.
  3. Check the database status. For more information, see "Oracle Communications Cloud Native Core, cnDBTier User Guide".
  4. Depending on the failure reason, take the resolution steps.
  5. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.6 OcnssfConfigurationServiceDown

Table 5-6 OcnssfConfigurationServiceDown

Field Details
Description 'OCNSSF Config service nssfconfiguration is down'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : OcnssfConfigServiceDown service down'
Severity Critical
Condition None of the pods of the NssfConfiguration microservice is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9006
Metric Used

'up'

Note: : This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the nssfconfiguration service is available.

Steps:

  1. Run the following command to check the orchestration logs of nssfconfiguration service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer the application logs on Kibana and filter based on nssfconfiguration service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Check the database status. For more information, see "Oracle Communications Cloud Native Core, cnDBTier User Guide".
  4. Depending on the reason of failure, take the resolution steps.
  5. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.7 OcnssfAppInfoServiceDown

Table 5-7 OcnssfAppInfoServiceDown

Field Details
Description OCNSSF Appinfo service appinfo is down'
Summary kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Appinfo service down'
Severity Critical
Condition None of the pods of the App Info microservice is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9025
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the app-info service is available.

Steps:

  1. Run the following command to check the orchestration logs of appinfo service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer to the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.8 OcnssfIngressGatewayServiceDown

Table 5-8 OcnssfIngressGatewayServiceDown

Field Details
Description 'Ocnssf Ingress-Gateway service ingressgateway is down'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : OcnssfIngressGwServiceDown service down'
Severity Critical
Condition None of the pods of the Ingress-Gateway microservice is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9007
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the ingressgateway service is available.

Steps:

  1. Run the following command to check the orchestration logs of ingress-gateway service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer to the application logs on Kibana and filter based on ingress-gateway service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.9 OcnssfEgressGatewayServiceDown

Table 5-9 OcnssfEgressGatewayServiceDown

Field Details
Description 'OCNSSF Egress service egressgateway is down'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : OcnssfEgressGwServiceDown service down'
Severity Critical
Condition None of the pods of the Egress-Gateway microservice is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9008
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the egressgateway service is available.

Note: The threshold is configurable in the alerts.yaml

Steps:

  1. Run the following command to check the orchestration logs of egress-gateway service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer to the application logs on Kibana and filter based on egress-gateway service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.10 OcnssfOcpmConfigServiceDown

Table 5-10 OcnssfOcpmConfigServiceDown

Field Details
Description 'OCNSSF OCPM Config service is down'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Ocnssf OCPM Config service down'
Severity Critical
Condition None of the pods of the ConfigService is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9027
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the ConfigService is available.

Note: The threshold is configurable in the alerts.yaml

Steps:

  1. Run the following command to check the orchestration logs of ConfigService service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer to the application logs on Kibana and filter based on PerfInfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.11 OcnssfPerfInfoServiceDown

Table 5-11 OcnssfPerfInfoServiceDown

Field Details
Description OCNSSF PerfInfo service is down'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Ocnssf PerfInfo service down'
Severity Critical
Condition None of the pods of the PerfInfo service is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9026
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the PerfInfo service is available.

Note: The threshold is configurable in the alerts.yaml

Steps:

  1. Run the following command to check the orchestration logs of PerfInfo service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer to the application logs on Kibana and filter based on PerfInfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.12 OcnssfNrfClientManagementServiceDown

Table 5-12 OcnssfNrfClientManagementServiceDown

Field Details
Description 'OCNSSF NrfClient Management service is down'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Ocnssf NrfClient Management service down'
Severity Critical
Condition None of the pods of the NrfClientManagement service is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9024
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the NrfClientManagement service is available.

Note: The threshold is configurable in the alerts.yaml

Steps:

  1. Run the following command to check the orchestration logs of NrfClientManagement service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer to the application logs on Kibana and filter based on NrfClientManagement service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.13 OcnssfAlternateRouteServiceDown

Table 5-13 OcnssfAlternateRouteServiceDown

Field Details
Description 'OCNSSF Alternate Route service is down'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Ocnssf Alternate Route service down'
Severity Critical
Condition None of the pods of the Alternate Route service is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9023
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the Alternate Route service is available.

Note: The threshold is configurable in the alerts.yaml

Steps:

  1. Run the following command to check the orchestration logs of Alternate Route service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer to the application logs on Kibana and filter based on Alternate Route service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.14 OcnssfAuditorServiceDown

Table 5-14 OcnssfAuditorServiceDown

Field Details
Description 'OCNSSF NsAuditor service is down'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Ocnssf NsAuditor service down'
Severity Critical
Condition None of the pods of the NsAuditor service is available.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9022
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Recommended Actions

The alert is cleared when the NsAuditor service is available.

Note: The threshold is configurable in the alerts.yaml

Steps:

  1. Run the following command to check the orchestration logs of NsAuditor service and check for liveness or readiness probe failures:
    kubectl get po -n <namespace>

    Note the full name of the pod that is not running, and use it in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>
  2. Refer to the application logs on Kibana and filter based on NsAuditor service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.1.15 OcnssfTotalIngressTrafficRateAboveMinorThreshold

Table 5-15 OcnssfTotalIngressTrafficRateAboveMinorThreshold

Field Details
Description 'Ingress traffic Rate is above the configured minor threshold i.e. 64000 requests per second (current value is: {{ $value }})'
Summary 'timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Traffic Rate is above 80 Percent of Max requests per second(80000)'
Severity Minor
Condition

The total Ocnssf Ingress Message rate has crossed the configured minor threshold of 64000 TPS.

Default value of this alert trigger point in NrfAlertValues.yaml is when Ocnssf Ingress Rate crosses 80 % of 80000 (Maximum ingress request rate).

OID 1.3.6.1.4.1.323.5.3.40.1.2.9009
Metric Used 'oc_ingressgateway_http_requests_total'
Recommended Actions

The alert is cleared either when the total Ingress Traffic rate falls below the Minor threshold or when the total traffic rate crosses the Major threshold, in which case the OcnssfTotalIngressTrafficRateAboveMinorThreshold alert shall be raised.

Note: The threshold is configurable in the alerts.yaml

Steps:

Reassess the reason why the NSSF is receiving additional traffic, for example, the mated site NSSF is unavailable in the georedundancy scenario.

If this is unexpected, contact My Oracle Support.

  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine an increase in 4xx and 5xx error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

5.1.16 OcnssfTotalIngressTrafficRateAboveMajorThreshold

Table 5-16 OcnssfTotalIngressTrafficRateAboveMajorThreshold

Field Details
Description 'Ingress traffic Rate is above the configured major threshold i.e. 72000 requests per second (current value is: {{ $value }})'
Summary 'timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Traffic Rate is above 90 Percent of Max requests per second(80000)'
Severity Major
Condition

The total Ocnssf Ingress Message rate has crossed the configured major threshold of 72000 TPS.

Default value of this alert trigger point in NssfAlertValues.yaml is when Ocnssf Ingress Rate crosses 90 % of 80000 (Maximum ingress request rate).

OID 1.3.6.1.4.1.323.5.3.40.1.2.9010
Metric Used 'oc_ingressgateway_http_requests_total'
Recommended Actions

The alert is cleared when the total Ingress traffic rate falls below the major threshold or when the total traffic rate crosses the critical threshold, in which case the alert shall be raised.

OcnssfTotalIngressTrafficRateAboveCriticalThreshold

Note: The threshold is configurable in the alerts.yaml

Steps:

Reassess the reason why the NSSF is receiving additional traffic, for example, the mated site NSSF is unavailable in the georedundancy scenario.

If this is unexpected, contact My Oracle Support.

  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine an increase in 4xx and 5xx error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

5.1.17 OcnssfTotalIngressTrafficRateAboveCriticalThreshold

Table 5-17 OcnssfTotalIngressTrafficRateAboveCriticalThreshold

Field Details
Description 'Ingress traffic Rate is above the configured critical threshold i.e. 76000 requests per second (current value is: {{ $value }})'
Summary 'timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Traffic Rate is above 95 Percent of Max requests per second(80000)'
Severity Critical
Condition

The total Ocnssf Ingress Message rate has crossed the configured critical threshold of 76000 TPS.

Default value of this alert trigger point in NrfAlertValues.yaml is when Ocnssf Ingress Rate crosses 95 % of 80000 (Maximum ingress request rate).

OID 1.3.6.1.4.1.323.5.3.40.1.2.9011
Metric Used 'oc_ingressgateway_http_requests_total'
Recommended Actions

The alert is cleared when the Ingress traffic rate falls below the critical threshold.

Note: The threshold is configurable in the alerts.yaml

Steps:

Reassess the reason why the NSSF is receiving additional traffic, for example, the mated site NSSF is unavailable in the georedundancy scenario.

If this is unexpected, contact My Oracle Support.

  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine an increase in 4xx and 5xx error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

5.1.18 OcnssfTransactionErrorRateAbove1Percent

Table 5-18 OcnssfTransactionErrorRateAbove1Percent

Field Details
Description Transaction Error rate is above 1 Percent of Total Transactions
Summary Transaction Error Rate detected above 1 Percent of Total Transactions
Severity Warning
Condition The number of failed transactions has crossed the minor threshold of 1 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9012
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failed transactions reduces from the 1% threshold of the total transactions or when the failed transactions crosses the 10% threshold in which case the OcnssfTransactionErrorRateAbove10Percent shall be raised.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    For example: ocnssf_nsselection_success_tx_total with statusCode ~= 2xx.

  2. Verify the metrics per service, per method

    For example: Discovery requests can be deduced from the following metrics:

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    NFServiceType="ocnssf-nsselection"

    Route_path="/nnssf-nsselection/v2/**"

    Status="503 SERVICE_UNAVAILABLE"

  3. If guidance is required, contact My Oracle Support.

5.1.19 OcnssfTransactionErrorRateAbove10Percent

Table 5-19 OcnssfTransactionErrorRateAbove10Percent

Field Details
Description 'Transaction Error rate is above 10 Percent of Total Transactions (current value is {{ $value }})'
Summary 'timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction Error Rate detected above 10 Percent of Total Transactions'
Severity Minor
Condition The number of failed transactions has crossed the minor threshold of 10 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9013
Metric Used 'oc_ingressgateway_http_responses_total'
Recommended Actions

The alert is cleared when the number of failed transactions reduces from the 10% threshold of the total transactions or when the failed transactions crosses the 25% threshold in which case the OcnssfTransactionErrorRateAbove25Percent shall be raised.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    For example: ocnssf_nsselection_success_tx_total with statusCode ~= 2xx.

  2. Verify the metrics per service, per method

    For example: Discovery requests can be deduced from the following metrics:

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    NFServiceType="ocnssf-nsselection"

    Route_path="/nnssf-nsselection/v2/**"

    Status="503 SERVICE_UNAVAILABLE"

  3. If guidance is required, contact My Oracle Support.

5.1.20 OcnssfTransactionErrorRateAbove25Percent

Table 5-20 OcnssfTransactionErrorRateAbove25Percent

Field Details
Description 'Transaction Error rate is above 25 Percent of Total Transactions (current value is {{ $value }})'
summary 'timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction Error Rate detected above 25 Percent of Total Transactions'
Severity Major
Condition The number of failed transactions has crossed the minor threshold of 25 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9014
Metric Used 'oc_ingressgateway_http_responses_total'
Recommended Actions

The alert is cleared when the number of failed transactions reduces from the 25% of the total transactions or when the number of failed transactions crosses the 50% threshold in which case the OcnssfTransactionErrorRateAbove50Percent shall be raised.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    For example: ocnssf_nsselection_success_tx_total with statusCode ~= 2xx.

  2. Verify the metrics per service, per method

    For example: Discovery requests can be deduced from the following metrics:

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    NFServiceType="ocnssf-nsselection"

    Route_path="/nnssf-nsselection/v2/**"

    Status="503 SERVICE_UNAVAILABLE"

  3. If guidance is required, contact My Oracle Support.

5.1.21 OcnssfTransactionErrorRateAbove50Percent

Table 5-21 OcnssfTransactionErrorRateAbove50Percent

Field Details
Description 'Transaction Error rate is above 50 Percent of Total Transactions (current value is {{ $value }})'
Summary 'timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}: Transaction Error Rate detected above 50 Percent of Total Transactions'
Severity Critical
Condition The number of failed transactions has crossed the minor threshold of 50 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9015
Metric Used 'oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failed transactions is below 50 percent of the total transactions.

Steps:

  1. Check for service specific metrics to understand the specific service request errors.

    For example: ocnssf_nsselection_success_tx_total with statusCode ~= 2xx.

  2. Verify the metrics per service, per method

    For example: Discovery requests can be deduced from the following metrics:

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    NFServiceType="ocnssf-nsselection"

    Route_path="/nnssf-nsselection/v2/**"

    Status="503 SERVICE_UNAVAILABLE"

  3. If guidance is required, contact My Oracle Support.

5.2 Application Level Alerts

This section lists the application level alerts.

5.2.1 OcnssfOverloadThresholdBreachedL1

Table 5-22 OcnssfOverloadThresholdBreachedL1

Field Details
Description 'Overload Level of {{$labels.app_kubernetes_io_name}} service is L1'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}: Overload Level of {{$labels.app_kubernetes_io_name}} service is L1'
Severity Warning
Condition NSSF Services have breached their configured threshold of Level L1 for any of the aforementioned metrics.

Thresholds are configured for CPU, svc_failure_count, svc_pending_count, and memory.

OID 1.3.6.1.4.1.323.5.3.40.1.2.9016
Metric Used load_level
Recommended Actions

The alert is cleared when the Ingress Traffic rate falls below the configured L1 threshold.

Note: The thresholds can be configured using REST API.

Steps:

Reassess the reasons leading to NSSF receiving additional traffic.

If this is unexpected, contact My Oracle Support.

1. Refer to alert to determine which service is receiving high traffic. It may be due to a sudden spike in traffic.

For example: When one mated site goes down, the NFs move to the given site.

2. Check the service pod logs on Kibana to determine the reason for the errors.

3. If this is expected traffic, then the thresholds levels may be reevaluated as per the call rate and reconfigured as mentioned in Oracle Communications Cloud Native Core, Network Slice Selection Function REST Specification Guide.

5.2.2 OcnssfOverloadThresholdBreachedL2

Table 5-23 OcnssfOverloadThresholdBreachedL2

Field Details
Description 'Overload Level of {{$labels.app_kubernetes_io_name}} service is L2'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}: Overload Level of {{$labels.app_kubernetes_io_name}} service is L2'
Severity Minor
Condition NSSF Services have breached their configured threshold of Level L2 for any of the aforementioned metrics.

Thresholds are configured for CPU, svc_failure_count, svc_pending_count, and memory.

OID 1.3.6.1.4.1.323.5.3.40.1.2.9017
Metric Used load_level
Recommended Actions

The alert is cleared when the Ingress Traffic rate falls below the configured L2 threshold.

Note: The thresholds can be configured using REST API.

Steps:

Reassess the reasons leading to NSSF receiving additional traffic.

If this is unexpected, contact My Oracle Support.

1. Refer to alert to determine which service is receiving high traffic. It may be due to a sudden spike in traffic.

For example: When one mated site goes down, the NFs move to the given site.

2. Check the service pod logs on Kibana to determine the reason for the errors.

3. If this is expected traffic, then the thresholds levels may be reevaluated as per the call rate and reconfigured as mentioned in Oracle Communications Cloud Native Core, Network Slice Selection Function REST Specification Guide.

5.2.3 OcnssfOverloadThresholdBreachedL3

Table 5-24 OcnssfOverloadThresholdBreachedL3

Field Details
Description 'Overload Level of {{$labels.app_kubernetes_io_name}} service is L3'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}: Overload Level of {{$labels.app_kubernetes_io_name}} service is L3'
Severity Major
Condition NSSF Services have breached their configured threshold of Level L3 for any of the aforementioned metrics.

Thresholds are configured for CPU, svc_failure_count, svc_pending_count, and memory.

OID 1.3.6.1.4.1.323.5.3.40.1.2.9018
Metric Used load_level
Recommended Actions

The alert is cleared when the Ingress Traffic rate falls below the configured L3 threshold.

Note: The thresholds can be configured using REST API.

Steps:

Reassess the reasons leading to NSSF receiving additional traffic.

If this is unexpected, contact My Oracle Support.

1. Refer to alert to determine which service is receiving high traffic. It may be due to a sudden spike in traffic.

For example: When one mated site goes down, the NFs move to the given site.

2. Check the service pod logs on Kibana to determine the reason for the errors.

3. If this is expected traffic, then the thresholds levels may be reevaluated as per the call rate and reconfigured as mentioned in Oracle Communications Cloud Native Core, Network Slice Selection Function REST Specification Guide.

5.2.4 OcnssfOverloadThresholdBreachedL4

Table 5-25 OcnssfOverloadThresholdBreachedL4

Field Details
Description 'Overload Level of {{$labels.app_kubernetes_io_name}} service is L4'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}: Overload Level of {{$labels.app_kubernetes_io_name}} service is L4'
Severity Critical
Condition NSSF Services have breached their configured threshold of Level L4 for any of the aforementioned metrics.

Thresholds are configured for CPU, svc_failure_count, svc_pending_count, and memory.

OID 1.3.6.1.4.1.323.5.3.40.1.2.9019
Metric Used load_level
Recommended Actions

The alert is cleared when the Ingress Traffic rate falls below the configured L4 threshold.

Note: The thresholds can be configured using REST API.

Steps:

Reassess the reasons leading to NSSF receiving additional traffic.

If this is unexpected, contact My Oracle Support.

1. Refer to alert to determine which service is receiving high traffic. It may be due to a sudden spike in traffic.

For example: When one mated site goes down, the NFs move to the given site.

2. Check the service pod logs on Kibana to determine the reason for the errors.

3. If this is expected traffic, then the thresholds levels may be reevaluated as per the call rate and reconfigured as mentioned in Oracle Communications Cloud Native Core, Network Slice Selection Function REST Specification Guide.

5.2.5 OcnssfScpMarkedAsUnavailable

Table 5-26 OcnssfScpMarkedAsUnavailable

Field Details
Description 'An SCP has been marked unavailable'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : One of the SCP has been marked unavailable'
Severity Major
Condition One of the SCPs has been marked unhealthy.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9020
Metric Used 'oc_egressgateway_peer_health_status'
Recommended Actions This alert get cleared when unavailable SCPs become available.

5.2.6 OcnssfAllScpMarkedAsUnavailable

Table 5-27 OcnssfAllScpMarkedAsUnavailable

Field Details
Description 'All SCPs have been marked unavailable'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : All SCPs have been marked as unavailable'
Severity Critical
Condition All SCPs have been marked unavailable.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9021
Metric Used 'oc_egressgateway_peer_count and oc_egressgateway_peer_available_count'
Recommended Actions NF clears the critical alarm when at least one SCP peer in a peer set becomes available such that all other SCP or SEPP peers in the given peer set are still unavailable.

5.2.7 OcnssfTLSCertificateExpireMinor

Table 5-28 OcnssfTLSCertificateExpireMinor

Field Details
Description 'TLS certificate to expire in 6 months'.
Summary 'namespace: {{$labels.namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : TLS certificate to expire in 6 months'
Severity Minor
Condition This alert is raised when the TLS certificate is about to expire in six months.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9028
Metric Used security_cert_x509_expiration_seconds
Recommended Actions

The alert is cleared when the TLS certificate is renewed.

For more information about certificate renewal, see "Creating Private Keys and Certificate " section in the Oracle Communications Cloud Native Core, Network Slice Selection Function Installation, Upgrade, and Fault Recovery Guide.

5.2.8 OcnssfTLSCertificateExpireMajor

Table 5-29 OcnssfTLSCertificateExpireMajor

Field Details
Description 'TLS certificate to expire in 3 months.'
Summary 'namespace: {{$labels.namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : TLS certificate to expire in 3 months'
Severity Major
Condition This alert is raised when the TLS certificate is about to expire in three months.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9029
Metric Used security_cert_x509_expiration_seconds
Recommended Actions

The alert is cleared when the TLS certificate is renewed.

For more information about certificate renewal, see "Creating Private Keys and Certificate " section in the Oracle Communications Cloud Native Core, Network Slice Selection Function Installation, Upgrade, and Fault Recovery Guide.

5.2.9 OcnssfTLSCertificateExpireCritical

Table 5-30 OcnssfTLSCertificateExpireCritical

Field Details
Description 'TLS certificate to expire in one month.'
Summary 'namespace: {{$labels.namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : TLS certificate to expire in 1 month'
Severity Critical
Condition This alert is raised when the TLS certificate is about to expire in one month.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9030
Metric Used security_cert_x509_expiration_seconds
Recommended Actions

The alert is cleared when the TLS certificate is renewed.

For more information about certificate renewal, see "Creating Private Keys and Certificate " section in the Oracle Communications Cloud Native Core, Network Slice Selection Function Installation, Upgrade, and Fault Recovery Guide.

5.2.10 OcnssfNrfInstancesInDownStateMajor

Table 5-31 OcnssfNrfInstancesInDownStateMajor

Field Details
Description 'When current operative status of any NRF Instance is unavailable/unhealthy'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Few of the NRF instances are in unavailable state'
Severity Major
Condition When sum of the metric values of each NRF instance is greater than 0 but less than 3.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9032
Metric Used nrfclient_nrf_operative_status
Recommended Actions

This alert is cleared when operative status of all the NRF Instances is available/healthy.

Steps:

  1. Check the nrfclient_nrf_operative_status metric value of each NRF instance.
  2. The instances for which the metric value is '0' are down.
  3. Bring up the NRF instances that are down.
  4. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

5.2.11 OcnssfAllNrfInstancesInDownStateCritical

Table 5-32 OcnssfAllNrfInstancesInDownStateCritical

Field Details
Description 'When current operative status of all the NRF Instances is unavailable/unhealthy'
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : All the NRF instances are in unavailable state'
Severity Critical
Condition When sum of the metric values of each NRF instance is equal to 0.
OID 1.3.6.1.4.1.323.5.3.40.1.2.9031
Metric Used nrfclient_nrf_operative_status
Recommended Actions

This alert is cleared when current operative status of atleast one NRF Instance is available/healthy.

Steps:

  1. Bring up at least one NRF Instance.
  2. If the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use Cloud Native Core Network Function Data Collector tool for capturing the logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.