G Error Handling and Expected Outcomes for NRF Growth Feature

This section outlines the behavior of NRF with failure scenarios if the NRF growth feature is enabled. Each use case describes the scenario, expected system behavior, customer impact, and observability.

Unavailability of the Cache Data Microservice (CDS) of an NRF Instance

The CDS pods of an NRF instance can go down and become unavailable due to unexpected traffic surges, or an increase in the number of NFs beyond the supported capacity, and so on. This can cause the request towards the CDS to timeout or receive a non-2xx error response.

Expected behavior:

  • NRF will continue to support all service operations, with a limited segment view.
  • The affected NRF instance falls back to its database to continue providing services.

The following table describes the impact of different service operations in the affected NRF:

Table G-1 The CDS microservice of an NRF instance is down or overloaded

Service Operation Impact
NFRegister, NFUpdate, NfHeartbeat, and NFDeregister There is no impact on NF Consumer, and it remains unaffected.
NFProfileRetrieval/NFListRetrieval Uses the local set nfInstances data only.
NFStatusSubscribe/NFStatusUnsubscribe Uses the local set nfsubscription and nfInstances data only.
NFStatusNotify Notification is generated for the subscription taken in the local NRF set only.
Nnrf_AccessToken_Get Uses the local set nfInstances data only.
NFDiscover Uses the local set nfInstances data and the last known remote set nfInstances data (if available in the in-memory cache of the discovery microservice).

Customer Impact: Customers may experience degraded service and higher service latencies. Some operations may not return the latest nfProfile content, and it returns the content from a local set data only depending on the NRF operation. The impact varies by NRF service operation—for example, discovery may still return some data that is not the latest data of the remote set, but for other service operations, the response will contain data of the same set (current NRF) only.

Observability:

Alerts: The following alerts can be monitored to identify CDS microservice issues, including unavailability and overload conditions:

  • OcnrfCacheDataServiceDown – Triggered when the CDS microservice becomes unavailable due to pod restarts, database connectivity failure, and so on.
  • OcnrfDatabaseFallbackUsed – Triggered when the CDS microservice is overloaded or when the system switches to database fallback due to CDS unavailability.

The following metrics can be monitored for higher latency indication:

  • oc_ingressgateway_request_latency_seconds_[suffix]
  • oc_ingressgateway_backend_invocation_latency_seconds_[suffix]

Communication Between the Local and the Remote NRFs is Down

  • Communication Between the Local NRF and Any One of the Remote NRF Set is Down

    CDS pods communicate with the remote set NRFs to synchronize the state data between the set. During this synchronization, communication may become unavailable or down between the local NRF and the remote primary NRF, or between the local NRF and the remote NRFs (primary or alternate NRFs).

    This can occur due to various scenarios. Some of them are as follows:
    • network issues
    • remote set CDS pods are unavailable or overloaded. In this case, the request may timeout, and an error response may be received
    • remote NRF set is completely down

    Expected behavior:

    • For periodic state data synchronization requests, the NRF will receive a 4xx/5xx error response from the remote NRF of the remote NRF set when its CDS is down or unavailable.
    • In case when the synchronization requests fail with an remote NRF, an attempt will be made to the alternate NRFs of the remote NRF set to synchronize the data. (On failure, NRF attempts the remote set NRFs sequentially.)

    Note:

    All the NRFs of the remote set include the same state data in their synchronization response.
    The following table describes the impact of different service operations in the current NRF:

    Table G-2 Communication between the local NRF and any one of the remote NRF set is down

    Service Operation Impact
    NFRegister, NFUpdate, NfHeartbeat, and NFDeregister There is no impact on NF Consumer, and it remains unaffected.
    NFProfileRetrieval/NFListRetrieval There is no impact on NF Consumer, and it remains unaffected. Uses the local set nfInstances data and the remote nfInstances sync from a non-primary NRF.
    NFStatusSubscribe/NFStatusUnsubscribe There is no impact on NF Consumer, and it remains unaffected. Uses the local set nfsubscription and nfInstances data, and the remote nfInstances and nfsubscription sync from a non-primary NRF.
    NFStatusNotify There is no impact on NF Consumer, and it remains unaffected. Notification is generated for the subscription taken in the local NRF set and the remote NRF set sync from a non-primary NRF.
    Nnrf_AccessToken_Get There is no impact on NF Consumer, and it remains unaffected. Uses the local set nfInstances data and the remote nfInstances sync from a non-primary NRF.
    NFDiscover There is no impact on NF Consumer, and it remains unaffected. Uses the local nfInstances and the remote nfInstances sync from a non-primary NRF.

    Customer Impact: As long as the NRF can synchronize the state data from any one of the NRFs of the remote NRF set, NF Consumers have no impact on any service operations.

    Observability:

    Alerts:OcnrfRemoteSetNrfSyncFailed -Monitors if any failure occurs while synchronizing the remote NRF set data.

    Metrics:

    The following metrics can be monitored for higher latency indication:

    • oc_ingressgateway_request_latency_seconds_[suffix]
    • oc_ingressgateway_backend_invocation_latency_seconds_[suffix]
  • Communication Between Local NRF and All Remote NRF Sets is Down

    The communication between a local NRF and all remote NRFs is unavailable or down. This affects the ability of the NRF to synchronize the state data with the NRFs of the remote set.

    This can occur due to various scenarios. Some of them are as follows:
    • network issues
    • all the remote set CDS pods are unavailable or overloaded
    • all the NRFs in the remote NRF set is completely down

    Expected behavior:

    • The CDS of the local NRF uses the last known data of the remote sets (available in the in-memory cache) for all service operations, along with the latest data of the local NRF set, till the communication between any of the NRFs of the remote set is restored.
    • If the remote NRF set is completely down, then the consumer NF registered to the set may have a service outage, as there is no other NRF with a GR site to fallback.
    • NRF continue to attempt to sync the data from the remote set until it is successful.
    The following table describes the impact of different service operations in the current NRF:

    Table G-3 Communication Between Local NRF and All Remote NRF Sets is Down

    Service Operation Impact
    NFRegister, NFUpdate, NfHeartbeat, and NFDeregister There is no impact on Consumer NFs, and it remains unaffected.
    NFProfileRetrieval/NFListRetrieval Uses the latest nfInstances from the current (local) NRF set and the last synchronized nfInstances from the remote set.
    NFStatusSubscribe/NFStatusUnsubscribe Uses the latest nfsubscription and nfInstances from the current (local) NRF set and the last synchronized nfInstances and nfsubscription from the remote set.
    NFStatusNotify Notification is generated for the latest subscription taken in the current (local) set NRF and the last synchronized remote set NRF.
    Nnrf_AccessToken_Get Uses the latest nfInstances from the current (local) NRF set and the last synchronized nfInstances from the remote set.
    NFDiscover Uses the local set nfInstances data and the last known remote set nfInstances data (if available in the in-memory cache of the CDS).

    Customer Impact: NF Consumers may experience degraded service till the communication with the remote set is restored, as the remote set's data will not be the latest.

    It is considered that the NRF is working with limited capacity. In this state, the operator should explore diverting the traffic to an alternate NRF within the same NRF set, provided the alternate NRFs are not having similar fault and having necessary capacity to handle the additional traffic.

    Observability:

    Alerts: The following alerts can be monitored to check if the communication between the local NRF and all the remote NRF sets is down.

    Metrics:

    The following metrics can be monitored for higher latency indication:
    • oc_ingressgateway_request_latency_seconds_[suffix]
    • oc_ingressgateway_backend_invocation_latency_seconds_[suffix]

Unavailability of the Database for the Remote NRF Sets

  • The Database Service for One or More Remote NRF Site is Unavailable or Down

    The database service for one or more remote NRF site is unavailable or down. This affects the ability of the NRF to synchronize the data.

    Expected behavior:

    • The NRF site, whose DB service is down, marks itself as unavailable. When the database service of an NRF site is unavailable, the microservices like discovery, access token, registration, subscription, auditor, and CDS mark themselves as unavailable by moving their pod to "Not Ready" state.
    • The current NRF receives the 5xx response for a synchronization request that is sent to the NRF whose database services is down.
    • NRF sends the request to an alternate NRF of that set to synchronize the state data.

    The following table describes the impact of different service operations in the current NRF:

    Table G-4 The database service for one or more remote NRF site is unavailable or down

    Service Operation Impact
    NFRegister, NFUpdate, NfHeartbeat, and NFDeregister There is no impact on NF Consumer, and it remains unaffected.
    NFProfileRetrieval/NFListRetrieval There is no impact on NF Consumer, and it remains unaffected. Uses the local set nfInstances data and the remote nfInstances sync from a non-primary NRF.
    NFStatusSubscribe/NFStatusUnsubscribe There is no impact on NF Consumer, and it remains unaffected. Uses the local set nfsubscription and nfInstances data, and the remote nfInstances and nfsubscription sync from a non-primary NRF.
    NFStatusNotify There is no impact on NF Consumer, and it remains unaffected. Notification is generated for the subscription taken in the local set NRF and the remote set NRF sync from a non-primary NRF.
    Nnrf_AccessToken_Get There is no impact on NF Consumer, and it remains unaffected. Uses the local set nfInstances data and the remote nfInstances sync from a non-primary NRF.
    NFDiscover There is no impact on NF Consumer, and it remains unaffected. Uses the local nfInstances and the remote nfInstances sync from a non-primary NRF.

    Customer Impact: As long as the NRF is able to get the state data from any of the NRFs of the remote set, NF Consumers have no impact on any service operation.

    In case, all the NRFs of the remote set are unavailable, NF Consumers may experience degraded service till the communication with the remote set is restored, as the remote set's data will not be the latest.

    Observability:

    Alerts: -OcnrfRemoteSetNrfSyncFailed is used to monitor whether the remote sync has failed or whether the database service for a remote NRF is unavailable or down.

  • Database Services of All the Remote NRF Sets are Down

    When the database services of all the remote NRF sets are down, NRF marks the service as unavailable.

    Expected behavior:

    • The NRF site, whose database service is down, marks itself as unavailable. When the database service of an NRF site is unavailable, the microservices like discovery, access token, registration, subscription, auditor, and CDS mark themselves as unavailable by moving their pod to "Not Ready" state.
    • All NRFs in the remote set declare themselves as service-unavailable.
    • NRF will try to attempt all the NRFs of that set to synchronize the data, and will fail.
    • The CDS of the local NRF will use the last known data of the remote set.
    • NRF continue to attempt to sync the data from the remote set until it is successful.
    The following table describes the impact of different service operations in the current NRF:

    Table G-5 Database services of all the remote NRF sets are down

    Service Operation Impact
    NFRegister, NFUpdate, NfHeartbeat, and NFDeregister There is no impact on Consumer NFs, and it remains unaffected.
    NFProfileRetrieval/NFListRetrieval Uses the latest nfInstances from the current (local) NRF set and the last synchronized nfInstances from the remote set.
    NFStatusSubscribe/NFStatusUnsubscribe Uses the latest nfsubscription and nfInstances from the current (local) NRF set and the last synchronized nfInstances from the remote set.
    NFStatusNotify Notification is generated for the subscription taken in the local NRF set and the last synchronized data from the remote NRF set.
    Nnrf_AccessToken_Get Uses the latest nfInstances from the current (local) NRF set and the last synchronized nfInstances from the remote set.
    NFDiscover Uses the local set nfInstances data and the last known remote set nfInstances data (if available in the in-memory cache of the CDS).

    Customer Impact: NF Consumers may experience degraded service till the communication with the remote set is restored, as the remote set's data will not be the latest.

    Observability:

    The following alerts can be monitored when the database services of all the remote NRF sets are down.

Remote NRF Instance’s Replication Channel

  • When the Primary NRF of the Remote Set has the Replication Down with One of its Mated Sites

    The replication channel is down between one or more NRFs of the remote NRF sets.

    Expected behavior:

    • The current NRF continues to use the remote NRF site/set for synchronizing even if the replication is down even if the replication of the remote set is down with one or more of its peer NRF.
    • The remote NRF will have the latest nf state data for the peer NRF with whom the replication is up and for NRF whose replication is down will have the last replicated data.
    • This results in the unavailability of the latest remote NRF set data being used for service operations.

    The following table describes the impact of different service operations in the current NRF:

    Table G-6 When the Primary NRF of the Remote set has the Replication down with one of its Mated Sites

    Service Operation Impact
    NFRegister, NFUpdate, NfHeartbeat, and NFDeregister There is no impact on NF Consumer, and it remains unaffected.
    NFProfileRetrieval/NFListRetrieval Uses the local set nfInstances data and the remote nfInstances sync from a primary NRF which may not have the latest state data.
    NFStatusSubscribe/NFStatusUnsubscribe Uses the local set nfsubscription and nfInstances data, and the remote nfInstances sync from a primary NRF which may not have the latest state data.
    NFStatusNotify Notification is generated for the subscription taken in the local NRF set and the remote NRF set which may not have the latest state data.
    Nnrf_AccessToken_Get Uses the local set nfInstances data and the remote nfInstances sync from a primary NRF which may not have the latest state data.
    NFDiscover Uses the local nfInstances and the remote nfInstances sync from a primary NRF which may not have the latest state data.

    Customer Impact: NF Consumers may experience service impacts due to the unavailability of the latest remote NRF set data.

    Observability:

    There is no alert to monitor the current NRF behavior.

    OcnrfDbReplicationStatusInactive alert is used to monitor the target NRF behavior.

  • When the Primary NRF of the Remote Set has the Replication Down with All of its Mated Sites

    The replication channel is down between the Primary NRF and all of its georedundant NRFs.

    Expected behavior:

    • The current NRF continues to use the remote NRF site or set for synchronizing even if the replication is down even if the replication of the remote set is down with one or more of its peer NRF.
    • The remote NRF will have the have the last replicated data of the georedundant NRFs.
    • This results in the unavailability of the latest remote NRF set data being used for service operations.
    The following table describes the impact of different service operations in the current NRF:

    Table G-7 When the Primary NRF of the Remote set has the Replication down with all of its Mated sites

    Service Operation Impact
    NFRegister, NFUpdate, NfHeartbeat, and NFDeregister There is no impact on Consumer NFs, and it remains unaffected.
    NFProfileRetrieval/NFListRetrieval There is no impact on Consumer NFs, and it remains unaffected. Uses the last synchronized nfinstances from the remote set and the latest nfInstances from the current NRF set.
    NFStatusSubscribe/NFStatusUnsubscribe Uses the local set nfInstances data, the remote nfInstances data, and the nfSubscription data only.
    NFStatusNotify Notification is generated for the subscription taken in the local NRF set and the remote NRF set.
    Nnrf_AccessToken_Get There is no impact on Consumer NFs, and it remains unaffected. Uses the last synchronized nfinstances from the remote set and the latest nfInstances from the current NRF set.
    NFDiscover There is no impact on Consumer NFs, and it remains unaffected. Uses the local set nfInstances data and the last known remote set nfInstances data (if available in the in-memory cache of the discovery microservice).

    Customer Impact: NF Consumers may experience service impact because the remote set's data will not be the latest.

    Observability: There are no alerts for this scenario.

Local Database Failure

  • Local Database Tier is Down

    The database of the local NRF is unavailable or down.

    Expected behavior:

    • NRF considers the database tier as a mandatory component and checks for the health of the Database periodically.
    • When the database is unavailable, all services, including CDS, mark their pods as "Not Ready". Any requests for these services will fail, and Egress Gateway will send an error response (5xx).

    Customer Impact: On receiving the 5xx error, the NF Consumers are expected to move to an alternate NRF of the set. As long as NF Consumers can communicate with NRF, the NF Consumers will not have any service impact.

    Observability:

    Alerts:
    • OcnrfNfStatusUnavailable is used to check if the database service for a remote NRF is unavailable or down.
    • The database-related alerts should be monitored. For more details, see Oracle Communications Cloud Native Core, cnDBTier User Guide.
  • Local NRF’s Replication Failure
    • Total Replication Failure (to all remote NRF peers):

      The replication from the local NRF to all configured remote NRFs (primary and alternates) is down.

      Expected behavior:

      Remote NRFs retain only the last successfully replicated NF state and become increasingly outdated data until replication is restored and re-synchronization completes.

      Customer Impact: NF Consumers may experience service impact because the remote set's data will not be the latest.

      Observability: There are no alerts for this scenario.

    • Partial Replication Failure (to one or more remote NRF peers):

      Replication from the local NRF to some remote NRFs is down while replication to other peers continues.

      Expected behavior:

      Remote NRFs with replication up have the latest NF state; remote NRFs with replication down retain the last successfully replicated NF state until recovery.

      Customer Impact: NF Consumers may experience service impact because the affected remote NRF set(s) will not have the latest NF state data while other remote sets may still be up to date.

      Observability: There are no alerts for this scenario.

Periodic Refresh of Discovery and CDS PODs

  • Discovery in-memory cache

    Each discovery pod maintains independent cache.

    During initialization or pod restart scenario, each discovery pod gets the local NRF and remote NRF data and retains in-memory cache.
    • In every 2 seconds, it queries CDS to get the delta data.
    • In every X seconds, it reloads the complete NF profile from CDS.
    • Additionally, in every 60 seconds it reloads the local set data by directly going to the database.

    Note:

    During initialization, if CDS communication fails, discovery goes to the database and loads the local set data so that atleast it can provide service with local set data. It continues to reach out to CDS in every 2 seconds to get the remote set data until it is successful.

    Customer Impact: There is a possibility that NF Consumers may receive a different view of the topology if the underlying data is updated during the synchronization period.

    Observability: There are no alerts for this scenario.

  • CDS in-memory cache

    Each CDS pod maintains independent cache.

    During pod initialization or pod restart, each CDS pod gets the local NRF data from the database and remote NRF data by querying the remote sets and retains it in the in-memory cache.

    After Pod initialization, CDS performs periodic sync requests to the remote sets every 2 seconds to get the delta data (Data that has not been synced within the last two seconds).

    Additionally, CDS also performs complete resync of remote set data every 60 seconds.

    Note:

    If CDS is unable to perform sync with the remote sets, due to network issues, other set is down etc, CDS will retain its last known data of the remote set in its cache till its able to re-establish sync with the remote set.

    Customer Impact: There is a possibility that NF Consumers may receive a different view of the topology if the underlying data is updated during the synchronization period.

    Observability: OcnrfRemoteSetNrfSyncFailed- Indicates that synchronization with a specific local NRF instance in a remote set has failed.