Troubleshooting NRF

4 Troubleshooting NRF

This chapter provides information to troubleshoot the common errors that can be encountered during the preinstall, installation, upgrade, and rollback procedures of Oracle Communications Cloud Native Core, Network Repository Function (NRF).

Following are the troubleshooting procedures:

Note:

kubectl commands might vary based on the platform deployment. Replace kubectl with Kubernetes environment-specific command line tool to configure Kubernetes resources through kube-api server. The instructions provided in this document are as per the Oracle Communications Cloud Native Core, Cloud Native Environment (CNE) version of kube-api server.

Caution:

User, computer and applications, and character encoding settings may cause an issue when copy-pasting commands or any content from PDF. PDF reader version also affects the copy-pasting functionality. It is recommended to verify the copy-pasted content, especially when hyphens or any special characters are part of the copied content.

Note:

The performance and capacity of the NRF system may vary based on the call model, Feature or Interface configuration, and underlying CNE and hardware environment.

4.1 Generic Checklist

The following sections provide a generic checklist for troubleshooting tips.

Deployment related tips

Perform the following checks after the deployment:

Are NRF deployment, pods, and services created?
Are NRF deployment, pods, and services running and available?
Run the following command:
```
# kubectl -n <namespace> get deployments,pods,svc
```
Inspect the output, check the following columns:
- AVAILABLE of deployment
- READY, STATUS, and RESTARTS of a pod
- PORT(S) of service

Is the correct image used?

Is the correct environment variables set in the deployment?

Run the following command:

# kubectl -n <namespace> get deployment <deployment-name> -o yaml

Inspect the output, check the environment and image.

# kubectl -n nrf-svc get deployment ocnrf-nfregistration -o yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"name":"ocnrf-nfregistration","namespace":"nrf-svc"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"ocnrf-nfregistration"}},"template":{"metadata":{"labels":{"app":"ocnrf-nfregistration"}},"spec":{"containers":[{"env":[{"name":"MYSQL_HOST","value":"mysql"},{"name":"MYSQL_PORT","value":"3306"},{"name":"MYSQL_DATABASE","value":"nrfdb"},{"name":"NRF_REGISTRATION_ENDPOINT","value":"ocnrf-nfregistration"},{"name":"NRF_SUBSCRIPTION_ENDPOINT","value":"ocnrf-nfsubscription"},{"name":"NF_HEARTBEAT","value":"120"},{"name":"DISC_VALIDITY_PERIOD","value":"3600"}],"image":"dsr-master0:5000/ocnrf-nfregistration:latest","imagePullPolicy":"Always","name":"ocnrf-nfregistration","ports":[{"containerPort":8080,"name":"server"}]}]}}}}
  creationTimestamp: 2018-08-27T15:45:59Z
  generation: 1
  name: ocnrf-nfregistration
  namespace: nrf-svc
  resourceVersion: "2336498"
  selfLink: /apis/extensions/v1beta1/namespaces/nrf-svc/deployments/ocnrf-nfregistration
  uid: 4b82fe89-aa10-11e8-95fd-fa163f20f9e2
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: ocnrf-nfregistration
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: ocnrf-nfregistration
    spec:
      containers:
      - env:
        - name: MYSQL_HOST
          value: mysql
        - name: MYSQL_PORT
          value: "3306"
        - name: MYSQL_DATABASE
          value: nrfdb
        - name: NRF_REGISTRATION_ENDPOINT
          value: ocnrf-nfregistration
        - name: NRF_SUBSCRIPTION_ENDPOINT
          value: ocnrf-nfsubscription
        - name: NF_HEARTBEAT
          value: "120"
        - name: DISC_VALIDITY_PERIOD
          value: "3600"
        image: dsr-master0:5000/ocnrf-nfregistration:latest
        imagePullPolicy: Always
        name: ocnrf-nfregistration
        ports:
        - containerPort: 8080
          name: server
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: 2018-08-27T15:46:01Z
    lastUpdateTime: 2018-08-27T15:46:01Z
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: 2018-08-27T15:45:59Z
    lastUpdateTime: 2018-08-27T15:46:01Z
    message: ReplicaSet "ocnrf-nfregistration-7898d657d9" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Check if the microservices can access each other using REST interface.
Run the following command:
```
# kubectl -n <namespace> exec <pod name> -- curl <uri>
```
Example:
```
# kubectl -n nrf-svc exec ocnrf-nfregistration-44f4d8f5d5-6q92i -- curl http://ocnrf-nfregistration:8080/nnrf-nfm/v1/nf-instances
```
Note:
These commands are in their simple form and display the logs only if there is a single nrf<registration> and nf<subscription> pod deployed.

Application related tips

Run the following command to check the application logs and look for exceptions:

# kubectl -n <namespace> logs -f <pod name>

You can use '-f' to follow the logs or 'grep' for a specific pattern in the log output.

Example:

# kubectl -n nrf-svc logs -f $(kubectl -n nrf-svc get pods -o name|cut -d'/' -f2|grep nfr)
# kubectl -n nrf-svc logs -f $(kubectl -n nrf-svc get pods -o name|cut -d'/' -f2|grep nfs)

Note:

These commands are in their simple form and display the logs only if there is 1 nrf<registration> and nf<subscription> pod deployed.

4.2 Deployment Related Issues

This section describes the most common deployment related issues and their resolution steps. It is recommended to perform the resolution steps provided in this guide. If the issue still persists, then contact My Oracle Support (MOS).

4.2.1 Installation

This section describes the common installation related issues and their resolution steps.

4.2.1.1 Helm Install Failure

This section describes the various scenarios in which helm install might fail. Following are some of the scenarios:

4.2.1.1.1 Incorrect image name in ocnrf-custom-values files

Problem

helm install might fail if an incorrect image name is provided in the ocnrf-custom-values.yaml file.

Error Code/Error Message

When kubectl get pods -n <ocnrf_namespace> is run, the status of the pods might be ImagePullBackOff or ErrImagePull.

For example:

$ kubectl get pods -n ocnrf

NAME                                     READY  STATUS            RESTARTS  AGE
ocnrf-egressgateway-d6567bbdb-9jrsx      2/2    ImagePullBackOff  0         30h
ocnrf-egressgateway-d6567bbdb-ntn2v      2/2    Running           0         30h
ocnrf-ingressgateway-754d645984-h9vzq    2/2    Running           0         30h
ocnrf-ingressgateway-754d645984-njz4w    2/2    Running           0         30h
ocnrf-nfaccesstoken-59fb96494c-k8w9p     1/1    Running           0         30h
ocnrf-nfaccesstoken-49fb96494c-k8w9q     1/1    Running           0         30h
ocnrf-nfdiscovery-84965d4fb9-rjxg2       1/1    Running           0         30h
ocnrf-nfdiscovery-94965d4fb9-rjxg3       1/1    Running           0         30h
ocnrf-nfregistration-64f4d8f5d5-6q92j    1/1    Running           0         30h
ocnrf-nfregistration-44f4d8f5d5-6q92i    1/1    Running           0         30h
ocnrf-nfsubscription-5b6db965b9-gcvpf    1/1    Running           0         30h
ocnrf-nfsubscription-4b6db965b9-gcvpe    1/1    Running           0         30h
ocnrf-nrfauditor-67b676dd87-xktbm        1/1    Running           0         30h
ocnrf-nrfconfiguration-678fddc5f5-c5htj  1/1    Running           0         30h
ocnrf-appinfo-8b7879cdb-jds4r            1/1    Running           0         30h

Solution

Perform the following steps to verify and correct the image name:

Check ocnrf-custom-values.yaml file has the release specific image name and tags.
```
vi ocnrf-custom-values-<release-number>
```
For NRF images details, see "Customizing NRF" in Oracle Communications Cloud Native Core, Network Repository Function Installation, Upgrade, and Fault Recovery Guide.
Edit ocnrf-custom-values file in case the release specific image name and tags must be modified.
Save the file.
Run helm install command. For helm install command, see the "Customizing NRF" section in Oracle Communications Cloud Native Core, Network Repository Function Installation, Upgrade, and Fault Recovery Guide.

Run kubectl get pods -n <ocnrf_namespace> to verify if the status of all the pods is Running.

For example:

$ kubectl get
                                pods -n ocnrf

NAME                                     READY  STATUS   RESTARTS  AGE
ocnrf-egressgateway-d6567bbdb-9jrsx      2/2    Running  0         30h
ocnrf-egressgateway-d6567bbdb-ntn2v      2/2    Running  0         30h
ocnrf-ingressgateway-754d645984-h9vzq    2/2    Running  0         30h
ocnrf-ingressgateway-754d645984-njz4w    2/2    Running  0         30h
ocnrf-nfaccesstoken-59fb96494c-k8w9p     1/1    Running  0         30h
ocnrf-nfaccesstoken-49fb96494c-k8w9q     1/1    Running  0         30h
ocnrf-nfdiscovery-84965d4fb9-rjxg2       1/1    Running  0         30h
ocnrf-nfdiscovery-94965d4fb9-rjxg3       1/1    Running  0         30h
ocnrf-nfregistration-64f4d8f5d5-6q92j    1/1    Running  0         30h
ocnrf-nfregistration-44f4d8f5d5-6q92i    1/1    Running  0         30h
ocnrf-nfsubscription-5b6db965b9-gcvpf    1/1    Running  0         30h
ocnrf-nfsubscription-4b6db965b9-gcvpe    1/1    Running  0         30h
ocnrf-nrfauditor-67b676dd87-xktbm        1/1    Running  0         30h
ocnrf-nrfconfiguration-678fddc5f5-c5htj  1/1    Running  0         30h
ocnrf-appinfo-8b7879cdb-jds4r            1/1    Running  0         30h

4.2.1.1.2 Docker registry is configured incorrectly

Problem

helm install might fail if the docker registry is not configured in all primary and secondary nodes.

Error Code or Error Message

When kubectl get pods -n <ocnrf_namespace> is performed, the status of the pods might be ImagePullBackOff or ErrImagePull.

For example:

$ kubectl get pods -n ocnrf

NAME                                     READY  STATUS            RESTARTS  AGE
ocnrf-egressgateway-d6567bbdb-9jrsx      2/2    ImagePullBackOff  0         30h
ocnrf-egressgateway-d6567bbdb-ntn2v      2/2    Running           0         30h
ocnrf-ingressgateway-754d645984-h9vzq    2/2    Running           0         30h
ocnrf-ingressgateway-754d645984-njz4w    2/2    Running           0         30h
ocnrf-nfaccesstoken-59fb96494c-k8w9p     1/1    Running           0         30h
ocnrf-nfaccesstoken-49fb96494c-k8w9q     1/1    Running           0         30h
ocnrf-nfdiscovery-84965d4fb9-rjxg2       1/1    Running           0         30h
ocnrf-nfdiscovery-94965d4fb9-rjxg3       1/1    Running           0         30h
ocnrf-nfregistration-64f4d8f5d5-6q92j    1/1    Running           0         30h
ocnrf-nfregistration-44f4d8f5d5-6q92i    1/1    Running           0         30h
ocnrf-nfsubscription-5b6db965b9-gcvpf    1/1    Running           0         30h
ocnrf-nfsubscription-4b6db965b9-gcvpe    1/1    Running           0         30h
ocnrf-nrfauditor-67b676dd87-xktbm        1/1    Running           0         30h
ocnrf-nrfconfiguration-678fddc5f5-c5htj  1/1    Running           0         30h
ocnrf-appinfo-8b7879cdb-jds4r            1/1    Running           0         30h

Solution

Configure docker registry on all primary and secondary nodes. For more information on configuring the docker registry, see Oracle Communications Cloud Native Core, Network Repository Function Installation, Upgrade, and Fault Recovery Guide.

4.2.1.1.3 Continuous Restart of Pods

Problem

helm install might fail if the MySQL primary and secondary hosts are not configured properly in ocnrf-custom-values.yaml.

Error Code/Error Message

When kubectl get pods -n <ocnrf_namespace> is performed, the pods restart count increases continuously.

For example:

$ kubectl get pods -n
                    ocnrf

NAME                                     READY  STATUS   RESTARTS  AGE
ocnrf-egressgateway-d6567bbdb-9jrsx      2/2    Running  0         30h
ocnrf-egressgateway-d6567bbdb-ntn2v      2/2    Running  0         30h
ocnrf-ingressgateway-754d645984-h9vzq    2/2    Running  0         30h
ocnrf-ingressgateway-754d645984-njz4w    2/2    Running  2         30h
ocnrf-nfaccesstoken-59fb96494c-k8w9p     1/1    Running  0         30h
ocnrf-nfaccesstoken-49fb96494c-k8w9q     1/1    Running  0         30h
ocnrf-nfdiscovery-84965d4fb9-rjxg2       1/1    Running  0         30h
ocnrf-nfdiscovery-94965d4fb9-rjxg3       1/1    Running  0         30h
ocnrf-nfregistration-64f4d8f5d5-6q92j    1/1    Running  0         30h
ocnrf-nfregistration-44f4d8f5d5-6q92i    1/1    Running  0         30h
ocnrf-nfsubscription-5b6db965b9-gcvpf    1/1    Running  0         30h
ocnrf-nfsubscription-4b6db965b9-gcvpe    1/1    Running  0         30h
ocnrf-nrfauditor-67b676dd87-xktbm        1/1    Running  0         30h
ocnrf-nrfconfiguration-678fddc5f5-c5htj  1/1    Running  0         30h
ocnrf-appinfo-8b7879cdb-jds4r            1/1    Running  0         30h

Solution

MySQL servers(s) may not be configured properly according to the preinstallation steps. For configuring MySQL servers, see the "Configuring Database, Creating Users, and Granting Permissions" section in Oracle Communications Cloud Native Core, Network Repository Function Installation, Upgrade, and Fault Recovery Guide.

4.2.1.2 Custom Value File Parse Failure

This section explains troubleshooting procedure in case of failure while parsing ocnrf-custom-values.yaml file.

Problem

Unable to parse ocnrf-custom-values-x.x.x.yaml, while running helm install.

Error Code/Error Message

Error: failed to parse ocnrf-custom-values-x.x.x.yaml: error converting YAML to JSON: yaml

Symptom

While creating the ocnrf-custom-values-x.x.x.yaml file, if the aforementioned error is received, it means that the file is not created properly. The tree structure may not have been followed or there may also be tab spaces in the file.

Solution

Perform the following:

Download the latest NRF templates zip file from My Oracle Support. For more information, see the "Downloading NRF package" section in Oracle Communications Cloud Native Core, Network Repository Function Installation, Upgrade, and Fault Recovery Guide.
Follow the steps mentioned in the "Installation Tasks" section in Oracle Communications Cloud Native Core, Network Repository Function Installation, Upgrade, and Fault Recovery Guide.

4.2.2 Postinstallation

This section describes the common postinstallation related issues and their resolution steps.

4.2.2.1 Helm Test Error Scenarios

Following are the error scenarios that may be identified using helm test.

Run the following command to get the Helm Test pod name:
```
kubectl get pods -n <deployment-namespace>
```
When a helm test is performed, a new helm test pod is created. Check for the Helm Test pod that is in an error state.
Get the logs using the following command:
```
kubectl logs <podname> -n <namespace>
```
Example:
```
kubectl get <helm_test_pod> -n ocnrf
```
For further assistance, collect the logs and contact MOS.

4.3 Upgrade or Rollback Failure

When NRF upgrade or rollback fails, perform the following procedure.

Check the pre or post upgrade logs or rollback hook logs in Kibana as applicable.

Users can filter upgrade or rollback logs using the following filters:

For upgrade: lifeCycleEvent=9001
For rollback: lifeCycleEvent=9002

{
   "time_stamp":"2021-08-23 06:45:57.698+0000",
   "thread":"main",
   "level":"INFO",
   "logger":"com.oracle.cgbu.cne.ocnrf.hooks.releases.ReleaseHelmHook_1_14_1",
   "message":"{logMsg=Starting Pre-Upgrade hook Execution, lifeCycleEvent=9001 | Upgrade, sourceRelease=101400, targetRelease=101401}",
   "loc":"com.oracle.cgbu.ocnrf.common.utils.EventSpecificLogger.submit(EventSpecificLogger.java:94)"
}

Check the pod logs in Kibana to analyze the cause of failure.
After detecting the cause of failure, do the following:
- For upgrade failure:
  - If the cause of upgrade failure is database or network connectivity issue, contact your system administrator. When the issue is resolved, rerun the upgrade command.
  - If the cause of failure occurs during the preupgrade phase, do not perform the roll back.
  - If the upgrade failure occurs during the postupgrade phase, for example, post upgrade hook failure due to target release pod not moving to ready state, then perform a rollback.
- For rollback failure: If the cause of rollback failure is database or network connectivity issue, contact your system administrator. When the issue is resolved, rerun the rollback command.
If the issue persists, contact My Oracle Support.

4.4 Troubleshooting CDS

Service Operations responses doesn't contain Remote NRF Set Data

Following are scenarios where response doesn't contain remote NRF set data:

CDS is down

When the CDS is down, the OcnrfCacheDataServiceDown alert is raised. All the NRF core microservices fall back to cnDBTier for serving the requests.
In this case, the NRF instance has the local set georeplicated view and not the segment-level view.
Check the resolution steps to resolve the OcnrfCacheDataServiceDown alert.
Once the alert is cleared and CDS is in the Running state, the NRF core microservices connect to CDS to serve the requests.
In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

CDS unable to synchronize with the remote NRF set

If the CDS from a set is unable to synchronize the in-memory cache from the remote NRF’s CDS, then the CDS attempts to reach healthy remote NRFs to synchronize the in-memory cache.
The retry attempt to the same remote NRF is performed based on the configuration in Egress Gateway.
The reroute from local NRF is based on the NRF Growth feature configuration. For more information about the feature configuration, see Oracle Communications Cloud Native Core, Network Repository Function REST Specification Guide.
If all the remote NRFs are not reachable, then the CDS from NRF uses the last known data from the remote set to serve the service requests.

Incorrect Feature Configuration

If the CDS from a set is unable to synchronize the in-memory cache from the remote NRF’s CDS, then the CDS attempts to reach healthy remote NRFs to synchronize the in-memory cache.
Check the NRF Growth feature configuration as mentioned in the REST configuration. For more information about the feature configuration, see Oracle Communications Cloud Native Core, Network Repository Function REST Specification Guide.
In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

CDS Unreachable

Check for the OcnrfDatabaseFallbackUsed alert.
If present, wait for 30 seconds to 1 minute and retry till the alerts are cleared. If the alerts are not cleared, see alerts for resolution steps.
In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

CDS unable to synchronization with the local cnDBTier

If the CDS is unable to synchronize the data with the local cnDBTier, then the CDS marks itself as not in the ready state.
With CDS not being ready, the NRF core services mark itself as not ready forcing the NF consumers and producers to move to mated and healthy NRFs.
The CDS to CDS synchronization request also fails so that the NRFs in the peer set move to healthy NRFs for updated data synchronization.

NF Records present in NRF after Deregistration

Check for the following alerts:
1. OcnrfRemoteSetNrfSyncFailed
2. OcnrfSyncFailureFromAllNrfsOfAnyRemoteSet
3. OcnrfSyncFailureFromAllNrfsOfAllRemoteSets
  If present, wait for 30 seconds to 1 minute and retry till the alerts are cleared. If the alerts are not cleared, see alerts for resolution steps.
Check the nrfHostConfigList configuration in the local NRF set.
In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

4.5 TLS Connection Failure

This section describes the TLS related issues and their resolution steps. It is recommended to attempt the resolution steps provided in this guide before contacting Oracle Support.

Problem: Handshake is not established between NRFs.

Scenario: When the client version is TLS 1.2 and the server version is TLS 1.3

Server Error Message

The client supported protocol versions[TLSv1.2] are not accepted by
      server preferences [TLSv1.3]

Client Error Message

Received fatal alert: protocol_version

Scenario: When the client version is TLS 1.3 and the server version is TLS1.2

Server Error Message

The client supported protocol versions[TLSv1.3]are not accepted by
      server preferences [TLSv1.2]

Client Error Message

Received fatal alert: protocol_version

Solution:

If the error logs have the SSL exception, do the following:

Check the TLS version of both NRFs, if both support different and single TLS versions, (that is, NRF1 supports TLS 1.2 only and NRF2 supports TLS 1.3 only or vice versa), handshake fails. Ensure that the TLS version is same for both NRFs or revert to default configuration for both NRFs. The TLS version communication supported are:

Table 4-1 TLS Version Used

Client TLS Version	Server TLS Version	TLS Version Used
TLS 1.2, TLS 1.3	TLS 1.2, TLS 1.3	TLS 1.3
TLS 1.3	TLS 1.3	TLS 1.3
TLS 1.3	TLS 1.2, TLS 1.3	TLS 1.3
TLS 1.2, TLS 1.3	TLS 1.3	TLS 1.3
TLS 1.2	TLS 1.2, TLS 1.3	TLS 1.2
TLS 1.2, TLS 1.3	TLS 1.2	TLS 1.2

Check the cipher suites being supported by both NRFs, it should be either the same or should have common cipher suites present. If not, revert to default configuration.

Problem: Pods not coming up after populating the clientDisabledExtension or serverDisabledExtension parameter.

Solution:

Check the values given in the Helm parameters. The values listed cannot be added in these parameters:
- supported_versions
- key_share
- supported_groups
- signature_algorithms
- pre_shared_key

If any of the above values is present, remove them or revert to default configuration for the pod to come up.

Problem: Pods not coming up after populating clientSignatureSchemes parameter.

Solution:

Check the values given in the Helm parameters.
Value listed below should not be removed from these parameters:
- rsa_pkcs1_sha512
- rsa_pkcs1_sha384
- rsa_pkcs1_sha256
If any of the above values is not present, add them or revert to default configuration for the pod to come up.

Problem: Connection Failure Due to Cipher Mismatch: NRF -Client and Producer Server for TLS 1.3

Scenario: The NRF client is configured to request a connection using TLS 1.3 with specific ciphers that are not supported by the producer server. As a result, the connection fails due to the cipher mismatch, preventing secure communication between the client and server.

Client Error Message

No appropriate protocol(protocol is disabled or cipher suites are inappropriate)

Server Error Message

Received fatal alert: handshake failure

Solution:

Ensure that the following cipher suites are configured for the NRF client to use with TLS 1.3:
- TLS_AES_128_GCM_SHA256
- TLS_AES_256_GCM_SHA384
- TLS_CHACHA20_POLY1305_SHA256
Ensure that both the client and server have at least one common TLS 1.3 cipher configured.
Verify TLS 1.3 for secure communication between the NRF- Client and the producer server to ensure that the issue has been resolved.

Problem: Connection Failure for TLS 1.3 Due to Expired Certificates.

Scenario: The NRF -Client is attempting to establish a connection using TLS 1.3, but the connection fails due to expired certificates. Specifically, the NRF -Client is presenting TLS 1.3 certificates that have passed their validity period, which causes the Producer server to reject the connection.

Client Error Message

Service Unavailable for producer due to Certificate Expired

Server Error Message

Received fatal alert: handshake failure

Solution:

Verify the validity of the current certificate.
If the certificate has expired, renew it or extend its validity.
Attempt to establish a connection between the NRF client and the Producer server to confirm that the issue has been resolved.
Verify the TLS 1.3 for secure communication.