Deployment Related Issues

4.3 Deployment Related Issues

This section describes the most common deployment related issues and their resolution steps. Users are recommended to attempt the resolution provided in this section before contacting Oracle Support.

4.3.1 Debugging Pre-Installation Related Issues

As of now, there are no known preinstallation related issues that you may encounter before installing UDR. However, it is recommended to see the Prerequisites and PreInstallation Tasks section in the Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide to prepare for UDR installation.

4.3.2 Debugging Installation Related Issues

This section describes how to troubleshoot the installation related issues. It is recommended to see the Generic Checklist section also in addition to the information shared in this section.

4.3.2.1 Debugging Pod Creation Failure

A pod creation can fail due to various reasons. Some of the possible scenarios are as follows:

Verifying Pod Image Correctness

To verify pod image:

Check whether any of the pods is in the ImagePullBackOff state.

Check whether the image name used for any pod is incorrect. Verify the following values in the custom-values.yaml file.

global:
  dockerRegistry: ocudr-registry.us.oracle.com:5000/ocudr
 
nudr-drservice:
  image:
    name: nudr_datarepository_service
    tag: 25.2.100
 
nudr-dr-provservice:
  image:
    name: nudr_datarepository_service
    tag: 25.2.100
 
 
nudr-notify-service:
  image:
    name: nudr_notify_service
    tag: 25.2.100
 
nudr-config:
  image:
    name: nudr_config
    tag: 25.2.100
 
config-server:
  # Image details
  image: ocpm_config_server
  imageTag: 25.2.102
  pullPolicy: IfNotPresent
  
ingressgateway-sig:
  image:
    name: ocingress_gateway
    tag: 25.2.104
  initContainersImage:
    name: configurationinit
    tag: 25.2.104
  updateContainersImage:
    name: configurationupdate
    tag: 25.2.104
  dbHookImage:
    name: common_config_hook
    tag: 25.2.104
    pullPolicy: IfNotPresent
 
ingressgateway-prov:
  image:
    name: ocingress_gateway
    tag: 25.2.104
  initContainersImage:
    name: configurationinit
    tag: 25.2.104
  updateContainersImage:
    name: configurationupdate
    tag: 25.2.104
  dbHookImage:
    name: common_config_hook
    tag: 25.2.104
    pullPolicy: IfNotPresent
 
egressgateway:
  image:
    name: ocegress_gateway
    tag: 25.2.104
  initContainersImage:
    name: configurationinit
    tag: 25.2.104
  updateContainersImage:
    name: configurationupdate
    tag: 25.2.104
  dbHookImage:
    name: common_config_hook
    tag: 25.2.104
    pullPolicy: IfNotPresent
 
nudr-diameterproxy
  image:
    name: nudr_diameterproxy
    tag: 25.2.100
     
nudr-ondemandmigration:
  image:
    name: nudr_ondemandmigration
    tag: 25.2.100
 
alternate-route:
  deploymentDnsSrv:
    image: alternate_route
    tag: 25.2.104
    pullPolicy: IfNotPresent
  dbHookImage:
    name: common_config_hook
    tag: 25.2.104
    pullPolicy: IfNotPresent
 
perf-info:
  image: perf-info
  imageTag: 25.2.102
  imagepullPolicy: Always
 
  dbHookImage:
    name: common_config_hook
    tag: 25.2.104
    pullPolicy: IfNotPresent
 
app-info:
  image: app-info
  imageTag: 25.2.102
  imagepullPolicy: Always
 
  dbHookImage:
    name: common_config_hook
    tag: 25.2.104
    pullPolicy: IfNotPresent
 
nrf-client:
  image: nrf-client
  imageTag: 25.2.102
  imagepullPolicy: Always
 
  dbHookImage:
    name: common_config_hook
    tag: 25.2.104
    pullPolicy: IfNotPresent
 
nudr-dbcr-auditor-service:
  image:
    name: nudr_dbcr_auditor_service
    tag: 25.2.100
    pullPolicy: IfNotPresent

After updating the values.yaml file, run the following command for helm upgrade:
helm upgrade <helm chart> [--version <OCUDR version>] --name <release> --namespace <ocudr-namespace> -f <ocudr_values.yaml>

If the helm install command is stuck for a long time or fails with timeout error, verify whether the pre install hooks have come up. Verify whether there exists any ImagePullBackOff status check as follows.

hookImageDetails

global:
  dockerRegistry: ocudr-registry.us.oracle.com:5000/ocudr
 
  preInstall:
    image:
      name: nudr_common_hooks
      tag: 25.2.100
   
  preUpgrade:
    image:
      name: nudr_common_hooks
      tag: 25.2.100
 
  postUpgrade:
    image:
      name: nudr_common_hooks
      tag: 25.2.100
 
  postInstall:
    image:
      name: nudr_common_hooks
      tag: 25.2.100
 
  preRollback:
    image:
      name: nudr_common_hooks
      tag: 25.2.100
 
  postRollback:
    image:
      name: nudr_common_hooks
      tag: 25.2.100
 
  test:
    image:
      name: nf_test
      tag: 25.2.102

After updating these values, you can purge the deployment and install helm again.

Verifying Resource Allocation Failure

To verify any resource allocation failure:

Run the following command to verify whether any pod is in the Pending state.
kubectl describe <nudr-drservice pod id> --n <ocudr-namespace>
Verify whether any warning on insufficient CPU exists in the describe output of the respective pod. If it exists, it means there are insufficient CPUs for the pods to start. Address this hardware issue.

If any preinstall hooks are in pending state, then check the resources allocated for hooks. Do not allocate higher values for hooks. If hooks with lower CPU or memory are going to pending state, then there is an issue with available resources on cluster. Check the resources and reduce the number of CPUs alloted to the pod in the values.yaml file.

hookresources

global:
  hookJobResources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 1
      memory: 1Gi

resources

nudr-drservice:
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 2
      memory: 2Gi
 
nudr-dr-provservice:
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 2
      memory: 2Gi
 
nudr-notify-service:
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 2
      memory: 2Gi
 
nudr-config:
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 2
      memory: 2Gi
 
config-server:
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 2
      memory: 512Mi
 
nudr-client:
  resources:
    limits:
      cpu: 1
      memory: 2Gi
    requests:
      cpu: 1
      memory: 512Mi    
 
nudr-diameterproxy:
  resources:
    limits:
      cpu: 3
      memory: 4Gi
    requests:
      cpu: 3
      memory: 4Gi   
 
nudr-ondemand-migration:
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 2
      memory: 2Gi      
 
ingressgateway:
  resources:
    limits:
      cpu: 2
      memory: 2Gi
      initServiceCpu: 1
      initServiceMemory: 1Gi
      updateServiceCpu: 1
      updateServiceMemory: 1Gi
      commonHooksCpu: 1
      commonHooksMemory: 1Gi
    requests:
      cpu: 2
      memory: 2Gi
      initServiceCpu: 1
      initServiceMemory: 1Gi
      updateServiceCpu: 1
      updateServiceMemory: 1Gi
      commonHooksCpu: 1
      commonHooksMemory: 1Gi
 
egressgateway:
  resources:
    limits:
      cpu: 2
      memory: 2Gi
      initServiceCpu: 1
      initServiceMemory: 1Gi
      updateServiceCpu: 1
      updateServiceMemory: 1Gi
      commonHooksCpu: 1
      commonHooksMemory: 1Gi
    requests:
      cpu: 2
      memory: 2Gi
      initServiceCpu: 1
      initServiceMemory: 1Gi
      updateServiceCpu: 1
      updateServiceMemory: 1Gi
      commonHooksCpu: 1
      commonHooksMemory: 1Gi
 
alternate-route:
  resources:
    limits:
      cpu: 2
      memory: 2Gi
      commonHooksCpu: 1
      commonHooksMemory: 1Gi
    requests:
      cpu: 2
      memory: 2Gi
      commonHooksCpu: 1
      commonHooksMemory: 1Gi
 
perf-info:
  resources:
    limits:
      cpu: 1
      memory: 1Gi
    requests:
      cpu: 1
      memory: 1Gi
 
app-info:
  resources:
    limits:
      cpu: 1
      memory: 1Gi
    requests:
      cpu: 1
      memory: 1Gi

Run the following helm upgrade command after updating the values.yaml file.
helm upgrade <helm chart> [--version <OCUDR version>] --name <release> --namespace <ocudr-namespace> -f <ocudr_values.yaml>

Verifying Resource Allocation Issues on Webscale Environment

Webscale environment has openshift container installed. There can be cases where,

Pods does not scale after you run the installation command and the helm install command fails with timeout error. In this case, check for preinstall hooks failure. Run the oc get job command to create the jobs. Describe the job for which the pods are not getting scaled and check if there are quota limit exceeded errors with CPU or memory.
Any of the actual microservice pods do not scale post the hooks completion. In this case, run the oc get rs command to get the list of replicaset created for the NF deployment. Then, describe the replicaset for which the pods are not getting scaled and check for resource quota limit exceeded errors with CPU or memory.
Helm install command times-out after all the microservice pods are scaled as expected with the expected number of replicas. In this case, check for post install hooks failure. Run the oc get job command to get the post install jobs and do a describe on the job for which the pods are not getting scaled and check if there are quota limit exceeded errors with CPU or memory.
Resource quota exceed beyond limits.

4.3.2.2 Debugging Pod Startup Failure

Follow the guidelines shared below to debug the pod startup failure liveness check issues:

If dr-service, diameter-proxy, and diam-gateway services are stuck in the CrashLoopBackOff state, then the reason could be that config-server is not yet up. A sample log on these services is as follows:
```
"Config Server is Not yet Up, Wait For config server to be up."
```
To resolve this, make sure the dependent services nudr-config and nudr-config-server is up or the startup probe will attempt to restart pod for every configured amount of time.
If the notify-service and on-demand migration service is stuck in the Init state, then the reason could be the dr-service is not yet up. A sample log on these services is as follows:
```
"DR Service is Not yet Up, Wait For dr service to be up."
```
To resolve this, check for failures on dr-service or the startup probe will attempt to restart pod for every configured amount of time.
If the microservices connecting to mySQL database is stuck in Crashloopbackoff state, check for mySQL exceptions in the logs and fix accordingly or If you receive error messages The database secret is empty or Invalid data present in the secret in the main service container logs make sure that the secret is created as mentioned in document and check for the case sensitivity of the keys in the secret. For example, encryptionKey, dsusername, and so on.

4.3.2.3 Debugging UDR with Service Mesh Failure

There are some known failure scenarios that you can encounter while installing UDR with service mesh. The scenarios along with their solutions are as follows:

Istio-Proxy side car container not attached to Pod: This particular failure arises when istio injection is not enabled on the NF installed namespace. Run the following command to verify the same:
kubectl get namespace -L istio-injection

Figure 4-6 Verifying Istio-Proxy

To enable the istio injection, run the following command:

kubectl label --overwrite namespace <nf-namespace> istio-injection=enabled
If any of the hook pods is not responding in the 'NotReady' state and is not cleared after completion, check if the following configuration is set to 'true' under global section. Also, ensure that the URL configured for istioSidecarQuitUrl is correct.

Figure 4-7 When Hook Pod is NotReady
When Prometheus does not scrape metrics from nudr-nrf-client-service, see if the following annotation is present under nudr-nrf-client-service:

Figure 4-8 nrf-client service
If there are issues in viewing UDR metrics on OSO Prometheus, you need to ensure that the following highlighted annotation is added to all deployments for the NF.

Figure 4-9 Issues in Viewing UDR Metrics - Add Annotation
When vDBTier is used as backend and there are connectivity issues, and when nudr-preinstall communicates with DB, which can be seen from error logs on preinstall hook pod, then make the destination rule and service entry for mysql-connectivity-service on occne-infra namespace.
When installed on ASM, if ingressgateway, egressgateway, or alternate-route services go into CrashLoopBackOff, you must check if the coherence ports is excluded for inbound and outbound on the istio-proxy.
On the latest F5 versions, if the default istio-proxy resources assigned is less, make sure that you assign minimum one CPU and one GB RAM for all UDR services. The traffic handling services must be same as mentioned in the resource profile. If the pods crash due to less memory, you must check the configuration. You can refer the following annotations in the custom values file.
```
deployment:
# Replica count for deployment 
replicaCount: 2
# Microservice specific notation for deployment 
customExtension: 
labels: {}
 annotations:
sidecar.istio.io/proxyCPU: "1000m" 
sidecar.istio.io/proxyCPULimit: "lOOOm"
sidecar.istio.io/proxyMemory: "lGi" 
sidecar.istio.io/proxyMemoryLimit : "lGi" 
proxy.istio.io/config: | 
terminationDrainDuration: 60s
```

4.3.2.4 Debugging SLF Default Group ID related Issues

SLF default group ID is added to the SLF_GROUP_NAME table through Helm hooks during UDR installation or upgrade. If a subscriber is not found and default group ID is enabled, then a response with default group ID is sent. If the default group ID is not found in the response, then use the following API to add the Default Group ID (This is similar to other SLF Groups PUT operation).

http://localhost:8080/slf-group-prov/v1/slf-group

{
	"slfGroupName": "DefaultGrp",
	"slfGroupType": "LteHss",
	"nfGroupIDs": {
		"NEF": "nef-group-default",
		"UDM": "udm-group-default",
		"PCF": "pcf-group-default",
		"AUSF": "ausf-group-default",
		"CHF": "chf-group-default"
	}
}

The default group name is dynamically editable through CNCC. If user changes default group name on CNCC and does not add the same to SLF_GROUP_NAME, then the default group name can be added through API as mentioned above.

4.3.2.5 Debugging Subscriber Activity Logging

This section describes how to troubleshoot the subscriber activity logging related issues.

If subscriber activity logging is not enabled, check the subscriberAcitivtiyEnabled flag and subscriberIdentifiers keys in the Global Configurations Parameters. For more information, see Oracle Communications Cloud Native Core, Unified Data Repository User Guide.
If you are not getting the subscriber logs after enabling the flag, then make sure that the subscriber identifiers mentioned in the configuration API contains the same key value as the testing subscriber identifiers.
Each subscriber identifiers can be configured up to 100 keys using CNC Console or REST API .
You can remove a subscriber from this feature by removing the subscriber identifiers key from the Global Configurations Parameters as shown below:
```
"subscriberActivityEnabled": true, "subscriberldentifers": {
"nai": [],
"imsi": [
"1111111127,1111111128"
],
"extid": [],
"msisdn": [
"1111111129,1111111130"
]
```

4.3.2.6 Debugging Subscriber Bulk Import Tool Related Issues

Subscriber bulk Import tool pod can run into pending state during installation. The reasons could be as follow:

Resources is not available. In this case, allocate more resources for the namespace. The "kubectl describe pod" will give you more details on this issue.
PVC allocation failed for subscriber bulk import tool. During re-installation, there can be a case where the existing PVC is not linked to subscriber bulk import tool. You can debug the issue based on the details from the describe pod output.
The storage class is not configured correctly. In this case, check the correctness of the configuration as below.

Figure 4-10 Bulk Import Persistent Claim
When subscriber bulk tool installation is complete, there can be a case where REST APIs configurations is not working. In this case, make sure that the below configuration is updated in the custom-values.yaml file.

Figure 4-11 OCUDR Release name
If the transfer-in and transfer-out functionality are not working after the remote host transfer is enabled. Make sure that the below steps are performed to resolve the issue:
- The kubernetes secrets created are correct for the private and public keys.
- The remote host configured and secrets created are of the same remote host.
- The remote path is correct.
- The space remaining in the remote host is within the limits of the file size that you are transferring.

4.3.2.7 Debugging NF Scoring for a Site

If there are issues related to NF Scoring, then perform the following steps:

Perform a GET request using http://<nudr-config-host>:<nudr-config-port>/udr/nf-common-component/v1/app-info/nfScoring to check if the NF Scoring feature is enabled. If the feature is disabled, the request will show an "ERROR feature not enabled". To enable the feature, use the above GET API to set the feature flag to true and then fetch the NF score.
To get the detailed information on provisioning and signaling, multiple ingress gateways must be set to true for UDR and SLF.
If the Custom Criteria is enabled and the calculation of NF score for custom criteria fails, you must check the name of the metric and other configured details in custom criteria.

4.3.2.8 Debugging Subscriber Export Tool Related Issues

Subscriber export tool pod can run into pending state during installation. The reasons could be as follow:

Resources is not available. In this case, allocate more resources for the namespace. The "kubectl describe pod" will give you more details on this issue.
PVC allocation failed for subscriber bulk import tool. During re-installation, there can be a case where the existing PVC is not linked to subscriber export tool. You can debug the issue based on the details from the describe pod output.
The storage class is not configured correctly. In this case, check the correctness of the configuration as below.

Figure 4-12 Export tool Presistent Claim
When subscriber export tool installation is complete, there can be a case where the REST APIs configuration is not working. In this case, make sure that the below configuration is updated in the custom-values.yaml file.

Figure 4-13 OCUDR Release name
If the export dump is not generated, you can check the logs for more details. Check the configuration is updated correctly as below.

Figure 4-14 Export Tool Persistent Claim Standard
If the transfer-in and transfer-out functionality are not working after the remote host transfer is enabled. Make sure that the below steps are performed to resolve the issue:
- The kubernetes secrets created are correct for the private and public keys.
- The remote host configured and secrets created are of the same remote host.
- The remote path is correct.
- The space remaining in the remote host is within the limits of the file size that you are transferring.

4.3.2.9 Debugging Controlled Shutdown Related Issues

If there are issues related to controlled shutdown, then perform the following steps:

Check the REST API GLOBAL configuration section if the control shutdown is not enabled. Make sure the flag enableControlledShutdown parameter is set to true to enable the feature.
Once the flag is enabled to true, you can do a PUT request to udr/nf-common-component/v1/operationalState. The PUT request throws an error if the flag is disabled.
When the operational state is set to COMPLETE_SHUTDOWN, all the ingress gateway requests are rejected with the configured error codes. If the request is not rejected, check if the feature flag is enabled and do a GET request on udr/nf-common-component/v1/operationalState.
The subscriber export tool and subscriber import tool rejects all the new request that is queued for processing.
When the operational state is COMPLETE_SHUTDOWN the NF status is updated as SUSPENDED at NRF. Check the app-info logs if the status is not updated to SUSPENDED. The logs contain the operational state of COMPLETE_SHUTDOWN.

4.3.2.10 Debug Readiness Failure

During the lifecycle of a pod if the pod containers is in NotReady state the reasons could be as follow:

Make sure that the dependent services is up. Check the logs for the below content:
Dependent services down, Set readiness state to REFUSING_TRAFFIC
Make sure that the database is available and the app-info is ready to monitor the database. Check the logs for the below content:
DB connection down, Set readiness state to REFUSING_TRAFFIC
The readiness failure can occur, if resource map or key map table in the database are not having proper content. Check the logs for the below content:
ReourceMap/KeyMap Entries missing, Set readiness state to REFUSING_TRAFFIC

4.3.2.11 Enable cnDBTier Metrics with OSO Prometheus

cnDBTier setup must be applied with the below yaml file. For example:

kubectl create -f <.yaml> -n <nsdbtier>

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: cndb-to-mysql-external-se
spec:
  exportTo:
  - "."
  hosts:
  - mysql-connectivity-service
  location: MESH_EXTERNAL
  ports:
  - number: 3306
    name: mysql2
    protocol: MySQL
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: nf-to-nf
spec:
  exportTo:
  - "."
  hosts:
  - "*.$DOMAIN_NAME"   # DOMAIN_NAME must be replaced with the deployed CNE Domain name
  location: MESH_EXTERNAL
  ports:
  - number: 80
    name: HTTP2-80
    protocol: TCP
  - number: 8080
    name: HTTP2-8080
    protocol: TCP
  - number: 3306
    name: TCP-3306
    protocol: TCP
  - number: 1186
    name: TCP-1186
    protocol: TCP
  - number: 2202
    name: TCP-2202
    protocol: TCP
  resolution: NONE
---
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
spec:
  mtls:
    mode: PERMISSIVE
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: nf-to-kube-api-server
spec:
  hosts:
  - kubernetes.default.svc.$DOMAIN_NAME  # DOMAIN_NAME must be replaced with the deployed CNE Domain name
  exportTo:
  - "."
  addresses:
  - 172.16.13.4
  location: MESH_INTERNAL
  ports:
  - number: 443
    name: https
    protocol: HTTPS
  - number: 6443
    name: https-1
    protocol: HTTPS
  resolution: NONE
---

Install Operations Services Overlay (OSO) Promethues with cnDBTier namespace in the OSO custom-values.yaml file to remove the cnDBTier metrics. The sample yaml file is given below:

##################################################################################
#                                                                                #
# Copyright (c) 2022 Oracle and/or its affiliates. All rights reserved.          #
#                                                                                #
##################################################################################
 
nameOverride: prom
 
## Helm-Test (Optional)
# Values needed for helm-test, Comment the entire Section if Helm-test not needed.
helmtestimage: occne-repo-host:5000/k8s.gcr.io/ingress-nginx/controller:v1.3.1
useasm: false
namespace: ocudr-ns
clustername: cne-23-1-rc2
resources:
  limits:
    cpu: 10m
    memory: 32Mi
  requests:
    cpu: 10m
    memory: 32Mi
promsvcname: oso-prom-svr
almsvcname: oso-prom-alm
prometheushealthyurl: /prometheus/-/healthy
prometheusreadyurl: /prometheus/-/ready
 
# Note: There are 3 types of label definitons provided in this custom values file
# TYPE1: Global(allResources)
# TYPE2: lb & nonlb TYPE label only
# TYPE3: service specific label
## NOTE: POD level labels can be inserted using the specific pod label sections, every pod/container has this label defined below in all components sections.
# ******** Custom Extension Global Parameters ********
#**************************************************************************
 
global_oso:
# Prefix & Suffix that will be added to containers
  k8Resource:
    container:
      prefix:
      suffix:
 
# Service account for Prometheus, Alertmanagers
  serviceAccountNamePromSvr: ""
  serviceAccountNameAlertMgr: ""
 
  customExtension:
# TYPE1 Label
    allResources:
      labels: {}
 
# TYPE2 Labels
    lbServices:
      labels: {}
 
    nonlbServices:
      labels: {}
 
    lbDeployments:
      labels: {}
 
    nonlbDeployments:
      labels: {}
 
    lbStatefulSets:
      labels: {}
 
# Add annotations for disabling sidecar injections into oso pods here
# eg: annotations:
#       - sidecar.istio.io/inject: "false"
annotations:
  - sidecar.istio.io/inject: "false"
 
## Setting this parameter to false will disable creation of all default clusterrole, clusterolebing, role, rolebindings for the componenets that are packaged in this csar.
rbac:
  create: true
 
podSecurityPolicy:
  enabled: false
 
## Define serviceAccount names for components. Defaults to component's fully qualified name.
##
serviceAccounts:
  alertmanager:
    create: true
    name:
    annotations: {}
  nodeExporter:
    create: false
    name:
    annotations: {}
  pushgateway:
    create: false
    name:
    annotations: {}
  server:
    create: true
    name:
    annotations: {}
 
alertmanager:
  enabled: true
 
  ## Use a ClusterRole (and ClusterRoleBinding)
  ## - If set to false - Define a Role and RoleBinding in the defined namespaces ONLY
  ## This makes alertmanager work - for users who do not have ClusterAdmin privs, but wants alertmanager to operate on their own namespaces, instead of clusterwide.
  useClusterRole: false
 
  ## Set to a rolename to use existing role - skipping role creating - but still doing serviceaccount and rolebinding to the rolename set here.
  useExistingRole: false
 
  ## alertmanager resources name
  name: alm
  image:
    repository: occne-repo-host:5000/quay.io/prometheus/alertmanager
    tag: v0.24.0
    pullPolicy: IfNotPresent
  extraArgs:
    data.retention: 120h
  prefixURL: /cne-23-1-rc2/alertmanager
  baseURL: "http://localhost/cne-23-1-rc2/alertmanager"
  configFileName: alertmanager.yml
  nodeSelector: {}
  affinity: {}
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
  persistentVolume:
    enabled: true
    accessModes:
      - ReadWriteOnce
    annotations: {}
    ## alertmanager data Persistent Volume existing claim name
    ## Requires alertmanager.persistentVolume.enabled: true
    ## If defined, PVC must be created manually before volume will be bound
    existingClaim: ""
    mountPath: /data
    size: 2Gi
    storageClass: "standard"
 
  ## Annotations to be added to alertmanager pods
  ##
  podAnnotations: {}
 
  ## Labels to be added to Prometheus AlertManager pods
  ##
  podLabels: {}
 
  replicaCount: 2
 
  ## Annotations to be added to deployment
  ##
  deploymentAnnotations: {}
 
  statefulSet:
    ## If true, use a statefulset instead of a deployment for pod management.
    ## This allows to scale replicas to more than 1 pod
    ##
    enabled: true
    annotations: {}
    labels: {}
    podManagementPolicy: OrderedReady
 
    ## Alertmanager headless service to use for the statefulset
    ##
    headless:
      annotations: {}
      labels: {}
 
      ## Enabling peer mesh service end points for enabling the HA alert manager
      ## Ref: https://github.com/prometheus/alertmanager/blob/master/README.md
      enableMeshPeer: true
 
      servicePort: 80
 
  ## alertmanager resource requests and limits
  ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
  ##
  resources:
    limits:
      cpu: 20m
      memory: 64Mi
    requests:
      cpu: 20m
      memory: 64Mi
 
  service:
    annotations: {}
    labels: {}
    clusterIP: ""
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    servicePort: 80
    # nodePort: 30000
    sessionAffinity: None
    type: ClusterIP
 
## Monitors ConfigMap changes and POSTs to a URL
## Ref: https://github.com/jimmidyson/configmap-reload
##
configmapReload:
  prometheus:
    enabled: true
    ## configmap-reload container name
    ##
    name: configmap-reload
    image:
      repository: occne-repo-host:5000/docker.io/jimmidyson/configmap-reload
      tag: v0.8.0
      pullPolicy: IfNotPresent
 
    # containerPort: 9533
 
    ## Additional configmap-reload mounts
    ##
    extraConfigmapMounts: []
 
    ## Security context to be added to configmap-reload container
    containerSecurityContext: {}
 
    ## configmap-reload resource requests and limits
    ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
    ##
    resources:
      limits:
        cpu: 10m
        memory: 32Mi
      requests:
        cpu: 10m
        memory: 32Mi
 
  alertmanager:
    enabled: true
    name: configmap-reload
    image:
      repository: occne-repo-host:5000/docker.io/jimmidyson/configmap-reload
      tag: v0.8.0
      pullPolicy: IfNotPresent
 
    # containerPort: 9533
    ## Additional configmap-reload mounts
    ##
    extraConfigmapMounts: []
      # - name: prometheus-alerts
      #   mountPath: /etc/alerts.d
      #   subPath: ""
      #   configMap: prometheus-alerts
      #   readOnly: true
 
    resources:
      limits:
        cpu: 10m
        memory: 32Mi
      requests:
        cpu: 10m
        memory: 32Mi
 
kubeStateMetrics:
  enabled: false
 
nodeExporter:
  enabled: false
 
server:
  enabled: true
  ## namespaces to monitor (instead of monitoring all - clusterwide). Needed if you want to run without Cluster-admin privileges.
  namespaces: []
  #  - ocudr-ns
  name: svr
  image:
    repository: occne-repo-host:5000/quay.io/prometheus/prometheus
    tag: v2.39.1
    pullPolicy: IfNotPresent
  prefixURL: /cne-23-1-rc2/prometheus
  baseURL: "http://localhost/cne-23-1-rc2/prometheus"
 
  ## Additional server container environment variables
  env: []
 
  # List of flags to override default parameters, e.g:
  # - --enable-feature=agent
  # - --storage.agent.retention.max-time=30m
  defaultFlagsOverride: []
 
  extraFlags:
    - web.enable-lifecycle
    ## web.enable-admin-api flag controls access to the administrative HTTP API which includes functionality such as
    ## deleting time series. This is disabled by default.
    # - web.enable-admin-api
    ##
    ## storage.tsdb.no-lockfile flag controls BD locking
    # - storage.tsdb.no-lockfile
    ##
    ## storage.tsdb.wal-compression flag enables compression of the write-ahead log (WAL)
    # - storage.tsdb.wal-compression
 
  ## Path to a configuration file on prometheus server container FS
  configPath: /etc/config/prometheus.yml
 
  global:
    scrape_interval: 1m
    scrape_timeout: 30s
    evaluation_interval: 1m
  #remoteWrite:
    #- url OSO_CORTEX_URL
    # remote_timout (default = 30s)
      #remote_timeout: OSO_REMOTE_WRITE_TIMEOUT
    # bearer_token for cortex server to be configured
    #      bearer_token: BEARER_TOKEN
 
  extraArgs:
    storage.tsdb.retention.size: 1GB
  ## Additional Prometheus server Volume mounts
  ##
  extraVolumeMounts: []
 
  ## Additional Prometheus server Volumes
  ##
  extraVolumes: []
 
  ## Additional Prometheus server hostPath mounts
  ##
  extraHostPathMounts: []
    # - name: certs-dir
    #   mountPath: /etc/kubernetes/certs
    #   subPath: ""
    #   hostPath: /etc/kubernetes/certs
    #   readOnly: true
 
  extraConfigmapMounts: []
 
  nodeSelector: {}
  affinity: {}
  podDisruptionBudget:
    enabled: false
    maxUnavailable: 1
 
  persistentVolume:
    enabled: true
    accessModes:
      - ReadWriteOnce
    annotations: {}
 
    ## Prometheus server data Persistent Volume existing claim name
    ## Requires server.persistentVolume.enabled: true
    ## If defined, PVC must be created manually before volume will be bound
    existingClaim: ""
    size: 2Gi
    storageClass: "standard"
 
  ## Annotations to be added to Prometheus server pods
  ##
  podAnnotations: {}
 
  ## Labels to be added to Prometheus server pods
  ##
  podLabels: {}
 
  ## Prometheus AlertManager configuration
  ##
  alertmanagers:
  - kubernetes_sd_configs:
      - role: pod
    # Namespace to be configured
        namespaces:
          names:
          - ocudr-ns
          - dbtier-ns
    path_prefix: cne-23-1-rc2/alertmanager
    relabel_configs:
    - source_labels: [__meta_kubernetes_namespace]
    # namespace to be configured
      regex: ocudr-ns
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_app]
      regex: prom
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_component]
      regex: alm
      action: keep
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
      regex: .*
      action: keep
    - source_labels: [__meta_kubernetes_pod_container_port_number]
      regex:
      action: drop
 
  ## Use a StatefulSet if replicaCount needs to be greater than 1 (see below)
  ##
  replicaCount: 1
 
  ## Annotations to be added to deployment
  ##
  deploymentAnnotations: {}
 
  statefulSet:
    ## If true, use a statefulset instead of a deployment for pod management.
    ## This allows to scale replicas to more than 1 pod
    ##
    enabled: false
 
    annotations: {}
    labels: {}
    podManagementPolicy: OrderedReady
 
  resources:
    limits:
      cpu: 2
      memory: 4Gi
    requests:
      cpu: 2
      memory: 4Gi
 
  service:
    enabled: true
    annotations: {}
    labels: {}
    clusterIP: ""
    externalIPs: []
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    servicePort: 80
    sessionAffinity: None
    type: NodePort
 
    ## If using a statefulSet (statefulSet.enabled=true), configure the
    ## service to connect to a specific replica to have a consistent view
    ## of the data.
    statefulsetReplica:
      enabled: false
      replica: 0
  retention: "7d"
 
pushgateway:
  ## If false, pushgateway will not be installed
  ##
  enabled: false
 
## alertmanager ConfigMap entries
##
alertmanagerFiles:
  alertmanager.yml:
    global: {}
      # slack_api_url: ''
 
    receivers:
      - name: default-receiver
        # slack_configs:
        #  - channel: '@you'
        #    send_resolved: true
 
    route:
      group_wait: 10s
      group_interval: 5m
      receiver: default-receiver
      repeat_interval: 3h
 
## Prometheus server ConfigMap entries for rule files (allow prometheus labels interpolation)
ruleFiles: {}
 
## Prometheus server ConfigMap entries
##
serverFiles:
 
  ## Alerts configuration
  ## Ref: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
  alerting_rules.yml: {}
 
  ## DEPRECATED DEFAULT VALUE, unless explicitly naming your files, please use alerting_rules.yml
  alerts: {}
 
  ## Records configuration
  ## Ref: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
  recording_rules.yml: {}
  ## DEPRECATED DEFAULT VALUE, unless explicitly naming your files, please use recording_rules.yml
  rules: {}
 
  prometheus.yml:
    rule_files:
      - /etc/config/recording_rules.yml
      - /etc/config/alerting_rules.yml
    ## Below two files are DEPRECATED will be removed from this default values file
      - /etc/config/rules
      - /etc/config/alerts
 
    scrape_configs:
      - job_name: prometheus
        metrics_path: cne-23-1-rc2/prometheus/metrics
        static_configs:
          - targets:
            - localhost:9090
 
extraScrapeConfigs: |
 
  - job_name: 'oracle-cnc-service'
    kubernetes_sd_configs:
      - role: service
        namespaces:
          names:
          - ocudr-ns
          - dbtier-ns
        #  - ns2
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_oracle_com_cnc]
        regex: true
        action: keep
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_service_name
 
  - job_name: 'oracle-cnc-pod'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - ocudr-ns
          - dbtier-ns
        #  - ns2
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_oracle_com_cnc]
        regex: true
        action: keep
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
 
  - job_name: 'oracle-cnc-endpoints'
    kubernetes_sd_configs:
      - role: endpoints
        #namespaces:
        #  names:
        #  - ns1
        #  - ns2
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_oracle_com_cnc]
        regex: true
        action: keep
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
 
  - job_name: 'oracle-cnc-ingress'
    kubernetes_sd_configs:
      - role: ingress
        #namespaces:
        #  names:
        #  - ns1
        #  - ns2
    relabel_configs:
      - source_labels: [__meta_kubernetes_ingress_annotation_oracle_com_cnc]
        regex: true
        action: keep

4.3.2.12 Debugging ndbmysqld Pods Restart During cnDBTier Installation or Upgrade

During cnDBTier Installation or Upgrade the readiness probe fails as the ndbmysqld pods wait for the data nodes to be up and running. This causes the ndbmysqld pods to restart with Reason: Error and Exit Code: 1 error. If the data nodes takes time to come up for any reason, such as slowness of cluster, the ndbmysqld pods will restart. The ndbmysqld pods stabilize when the data nodes comes up.

4.3.2.13 Debugging Error Logging Related Issues

If there are issues related to Error Logging, then perform the following steps:

You must set the additionalErrorLogging parameter to ENABLED per microservice for the Error Logging feature to work. This feature is "DISABLED" by default and it can be ENABLED or DISABLED using REST APIs, CNC console, or by changing the values in custom values yaml file during installation.
For logging subscriber information in the logs, you must set the logSubscriberInfo parameter to "ENABLED" per microservice. The parameter can be ENABLED or DISABLED using REST APIs, CNC console, or by changing the values in custom values yaml file during installation.

4.3.2.14 Debugging Suppress Notification Related Issues

If there are issues related to Suppress Notification, then perform the following steps:

You must set the suppressNotificationEnabled parameter to true in the global section of the custom values yaml file for the Suppress Notification feature to work. This feature is enabled by default and it can be enabled or disabled using REST APIs, CNC console, or by changing the values in custom values yaml file during installation.
If you observe unexpected notification then check if the feature is enabled from the global configuration using configuration REST APIs.
If the feature is enabled and you observe unexpected notifications for update requests then compare the User-Agent received in the request header with the User-Agent received in the subscription request.
This feature is applicable only for signaling requests. For provisioning request the notification generation behavior remains same as earlier.
This feature does not work with the subscription created in the previous release versions. You must create new subscriptions with the feature enabled for Suppress Notification feature to work.

4.3.2.15 Debugging Diameter S13 Interface Related Issues

If there are issues related to Diameter S13 Interface, then perform the following steps:

If global.s13InterfaceEnable flag is set to true and if the helm installation is throwing errors, you must enable the following parameters in ocudr-custom-values.yaml file:
- global.diamGatewayEnable
- nudr-diameterproxy.enabled

If the NRF client heartbeats do not consider the diameter services when diameter S13 interface is enabled, you must check the following configuration in the ocudr-custom-values.yaml file.

Note:

The NRF client does not register the EIR with NRF if diameter services is down.

#eir mode
eir: &eir
- '{{ .Release.Name }}-nudr-drservice'
- '{{ .Release.Name }}-egressgateway'
- '{{ .Release.Name }}-ingressgateway-sig'
- '{{ .Release.Name }}-ingressgateway-prov'
- '{{ .Release.Name }}-nudr-config' #uncomment only if config service enabled
- '{{ .Release.Name }}-nudr-config-server' #uncomment only if config service enabled
- '{{ .Release.Name }}-alternate-route' #uncomment if alternate route enabled
- '{{ .Release.Name }}-nudr-dr-provservice' # uncomment only if drProvisioningEnabled is enabled
- '{{ .Release.Name }}-nudr-diameterproxy' # uncomment only if s13InterfaceEnable is enabled
- '{{ .Release.Name }}-nudr-diam-gateway' # uncomment only if s13InterfaceEnable is enabled

If the diameter gateway is answering CEA message with DIAMETER_UNKNOWN_PEER, then client peer configuration is incorrect. You must perform the configuration in allowedClientNodes section of diameter gateway service configuration using REST API for the client to connect to EIR and send an ECR request.
If the diameter gateway is answering CEA message as success and other diameter message responds with DIAMETER_UNABLE_TO_COMPLY/DIAMETER_MISSING_AVP, then the issue could be in the diameter message request.
If there are error logs in diameter gateway microservice stating that the connection is refused with IP and port numbers, then the specified configured peer node was not able to accept CER request from diameter gateway. The diameter gateway retries multiple times to connect that peer.
If you are getting DIAMETER_UNABLE_TO_DELIVERY error message, then diameterproxy microservice is down.
If the diam-gateway goes to crashloop back off state, then it could be due to incorrect peer node configuration.
Active connections to the existing peer nodes can be verified using ocudr_diam_conn_network metric.

4.3.2.16 Debugging TLS Related Issues

This section describes the TLS related issues and their resolution steps. It is recommended to attempt the resolution steps provided in this guide before contacting Oracle Support.

Problem: Handshake is not established between UDRs.

Scenario: When the client version is TLSv1.2 and the server version is TLSv1.3.

Server Error Message

The client supported protocol versions[TLSv1.2] are not accepted by server preferences [TLSv1.3]

Client Error Message

Received fatal alert: protocol_version

Scenario: When the client version is TLSv1.3 and the server version is TLSv1.2.

Server Error Message

The client supported protocol versions[TLSv1.3]are not accepted by server preferences [TLSv1.2]

Client Error Message

Received fatal alert: protocol_version

Solution:

If the error logs have the SSL exception, do the following:

Check the TLS version of both UDRs, if both support different and single TLS versions, (that is, UDR 1 supports TLS 1.2 only and UDR 2 supports TLS 1.3 only or vice versa), handshake fails. Ensure that the TLS version is same for both UDRs or revert to default configuration for both UDRs. The TLS version communication supported are:

Table 4-1 TLS Version Used

Client TLS Version	Server TLS Version	TLS Version Used
TLSv1.2, TLSv1.3	TLSv1.2, TLSv1.3	TLSv1.3
TLSv1.3	TLSv1.3	TLSv1.3
TLSv1.3	TLSv1.2, TLSv1.3	TLSv1.3
TLSv1.2, TLSv1.3	TLSv1.3	TLSv1.3
TLSv1.2	TLSv1.2, TLSv1.3	TLSv1.2
TLSv1.2, TLSv1.3	TLSv1.2	TLSv1.2

Check the cipher suites being supported by both UDRs, it should be either the same or should have common cipher suites present. If not, revert to default configuration.

Problem: Pods are not coming up after populating the clientDisabledExtension or serverDisabledExtension Helm parameter.

Solution:

Check the value of the clientDisabledExtension or serverDisabledExtension parameters. The following extensions should not be present for these parameters:
- supported_versions
- key_share
- supported_groups
- signature_algorithms
- pre_shared_key

If any of the above values is present, remove them or revert to default configuration for the pod to come up.

Problem: Pods are not coming up after populating the clientSignatureSchemes Helm parameter.

Solution:

Check the value of the clientSignatureSchemes parameter.
The following values should be present for this parameter:
- rsa_pkcs1_sha512
- rsa_pkcs1_sha384
- rsa_pkcs1_sha256
If any of the above values is not present, add them or revert to default configuration for the pod to come up.

4.3.2.17 Debugging Dual Stack Related Issues

With this feature, cnUDR can be deployed on a dual stack Kubernetes infrastructure. Using the dual stack mechanism, cnUDR establishes and accepts connections within the pods and services in a Kubernetes cluster using IPv4 or IPv6. You can configure the feature by setting the global.deploymentMode parameter to indicate the deployment mode of the cnUDR in the global section of the

ocudr-custom
                values.yaml

file. The default value is set as ClusterPreferred and the values can be changed in ocudr-custom values.yaml file during installation. The Helm configuration is as follows:

To use this feature, cnUDR must be deployed on a dual stack Kubernetes infrastructure either in IPv4 preferred CNE or IPv6 preferred CNE.
If the global.deploymentMode parameter is set to 'IPv6_IPv4' then, when all the pods are running, the services, such as ingressgateway-prov and ingressgateway-sig must have both IPv6 and IPv4 addresses assigned. The default address must be IPv6 . The IP family policy must be set to RequireDualStack. The load balancer assigned must have both IPv4 and IPv6 addresses.
All internal services must be single stack and must have only IPv6 and IP family policy. All the pods must have both IPv4 and IPv6 addresses.
This feature does not work after upgrade since the upgrade path is not identified for this feature. The operators must perform a fresh installation of the NF to enable the Dual Stack functionality.

#Possible values : IPv4, IPv6, IPv4_IPv6,IPv6_IPv4,ClusterPreferred
global:
deploymentMode: ClusterPreferred

4.3.2.18 Debugging Lifecycle Management (LCM) Based Automation Issues

Perform the following steps if there are issues related to Lifecycle Management (LCM) Based Automation feature:

Make sure that the autoCreateResources.enabled and autoCreateResources.serviceaccounts.enabled flags are enabled.
During upgrade, if a new service account name is provided in the serviceAccountName parameter with autoCreateResources.enabled and autoCreateResources.serviceaccounts.enabled flags enabled, then a new service account name is created. If the service account name is not created, then you must check the configuration and the flags again.
During upgrade, you must use a different service account name when you upgrading from manual to automation. If you use same service account name when upgrading from manual to automation, then Helm does not allow upgrade due to ownership issues.
To use the OSO alerts automation feature, you must follow the installation steps of the oso-alr-config Helm chart by providing the alert file that needs to be applied to the namespace. For more information, see Oracle Communications Cloud Native Core, Operations Services Overlay Installation and Upgrade Guide. If the alert file is not provided during Helm installation, you can provide the alert file during upgrade procedure. Alert file can be applied to the namespace during Helm installation or upgrade procedure.
If an incorrect data is added to the alert file, you can clear the entire data in the alert file by providing the empty alert file (ocslf_alertrules_empty_<version>.yaml). For more information, see "OSO Alerts Automation" section in Oracle Communications Cloud Native Core, Unified Data Repository User Guide.
If you provide the service account name during the upgrade but the feature is disabled, the nudr-pre-upgrade hook fails because it cannot find the service account. If the upgrade fails, the rollback to the previous version will be unsuccessful due to the missing alternate route service account, resulting in an error message indicating that the service account for the alternate route service is not found. To address this issue, it is necessary to manually create the service account after the initial upgrade failure, then continue with the upgrade, this will also ensures a successful rollback.

4.3.2.19 Troubleshooting TLS 1.3 Support for Kubernetes API Server

If you enable the TLS 1.3 Support for Kubernetes API Server feature by setting tlsEnableForKubeApiServer to true and if there is a configuration mismatch, the Helm installation and upgrade fails. For example, if you configure global.tlsVersionForKubeApiServer and global.cipherSuitesForKubeApiServer incorrectly, as shown in the following example:

tlsEnableForKubeApiServer: &tlsEnableForKubeApiServer true
tlsVersionForKubeApiServer: &tlsVersionForKubeApiServer TLSv1.1
cipherSuitesForKubeApiServer: &cipherSuitesForKubeApiServer
  - TLS_AES_256_GCM_SHA384
  - TLS_AES_128_GCM_SHA256
  - TLS_CHACHA20_POLY1305_SHA256
featureSecretsForKubeApiServer: &featureSecretsForKubeApiServer
  - ocudr-gateway-secret

The Helm installation and upgrade fails with the following error:

Error: INSTALLATION FAILED: execution error at (ocudr/charts/ingressgateway-sig/templates/gateway.yml:181:28): Invalid ciphers configured for the configured kube-api-server tls version

4.3.3 Debugging Post Installation Related Issues

This section describes how to troubleshoot the post installation related issues.

4.3.3.1 Debugging Helm Test Issues

To debug the Helm Test issues:

Run the following command to get the Helm Test pod name.
kubectl get pods -n <deployment-namespace>
Check for the Helm Test pod that is in error state.

Figure 4-15 Helm Test Pod
Run the following command to check the Helm Test pod:
kubectl logs <helm_test_pod_name> -n <deployment_namespace>
In the logs, concentrate on ERROR and WARN level logs. There can be multiple reasons for failure. Some of them are shown below:
- Figure 4-16 Helm Test in Pending State
  
  In this case, check for CPU and Memory availability in the Kubernetes cluster.
- Figure 4-17 Pod Readiness Failed
  
  In this case, check for the correctness of the readiness probe URL in the particular microservice Helm charts under charts folder. In the above case, check for charts of notify service or check if the pod is crashing for some reason when the URL configured for readiness probe is correct.
- There are a few other cases where the httpGet parameter is not configured for the readiness probe. In this case, Helm Test is considered a success for that pod. If the Pod or PVC list is fetched based on namespace and the labelSelector is empty, then the helm test is considered a success.

The Helm test logs generate the following error:

Figure 4-18 Helm Test Log Error

Check whether the required permission for te resource of the group is missing in the deployment-rbac.yaml file. The above sample shows that the permissions are missing for the persistent volume claims.

Give the appropriate permissions and redeploy.

Check if the following error appears while running the helm test:

14:35:57.732 [main] WARN  org.springframework.context.annotation.AnnotationConfigApplicationContext - Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'k8SFabricClient': Invocation of init method failed; nested exception is java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
{ {} }

Check the custom values file that is used to create the deployment. The resources should be mentioned in the form of an array under the resources section in the following format: <k8ResourceName>/<maxNFVersion>.

4.3.3.2 Debugging Horizontal Pod Autoscaler Issues

There can be scenarios where Horizontal Pod Autoscaler (HPA) running on nudr-drservice deployment and nudr_notify_service might not get the CPU metrics successfully from the pods. Run the following command to view the HPA details:

kubectl get hpa

In this scenario, you need to check the following:

Check whether the metrics server is running on the Kubernetes cluster. If the server is running and the CPU usage pod is still not accessible, check the metrics-server values.yaml file for the arguments passed shown as follows:

Figure 4-19 metrics-server yaml file
If any changes are required, make them, restart the metrics server pod, and check for correctness. Wait a couple of minutes after the metrics server starts to see the CPU usage update on running kubectl get hpa command.

Figure 4-20 Debugging HPA Issues

4.3.3.3 Debugging HTTPS Support Related Issues

UDR supports HTTPS and its validations at the UDR Ingress Gateway. You may encounter issues related to HTTPS when:

HTTPS port is not exposed: Run the following command to verify if the HTTPS port is exposed:
kubectl get svc --n <ocudr-namespace>

Figure 4-21 HTTPS Port Exposed

Note:
In the above figure, the secure port is 443.

If the HTTPS port is not exposed, then enable the configuration information highlighted in the following figure under the ingressgateway section of the values.yaml file.

Figure 4-22 Configuration Info under Ingressgateway
Ingress Gateway Container is stuck in Init State/Failed: The Ingress Gateway container may stop responding due to any one of the following reasons:
- When config initssl is enabled under ingressgateway section of the values.yaml file.
  
  Figure 4-23 config initssl
- If config initssl is enabled, then check whether secrets are created with all required certificates. The following figure shows the commands that you need to run to check whether secrets are present and have all the required data.
  
  Figure 4-24 Commands to Check Secrets
Config-Server Container not responding in Hooks Init State: UDR does not respond in the Hooks Init state when there is a database connection failure.

Figure 4-25 Config Server Container Status

In this case, run the describe pod command (on the above pod). In most cases, it is due to secrets not being found.
Also, verify the configuration given below to ensure config-server deployment refers to the correct secret values.
```
global:

  dbCredSecretName: 'ocudr-secrets'
```
Config-Server Post Upgrade Hooks with below error.:
When more than one UDR is installed with the same nfInstanceId and in the same udrDB, installation does not cause any issue or error. However, for the second UDR, there are no config-server related tables in the udrConfigDB. So when upgrade is performed on the second UDR setup, then the following error occurs.

Figure 4-26 Config-Server Post Upgrade Hooks Error
OAuth2 Related Issues:
If you do not mention the OAuth secret name and namespace properly or if the public key certificate in secret is not in correct format, then the Ingress Gateway crashes.

Figure 4-27 Ingress Gateway Crashed
Other scenarios are:
- The secret name in which public key certificate is stored is incorrect: In this scenario, it is advisable to check the logs of a pod that states "cannot retrieve secret from api server".
- The public key certificate stored in secret is not in proper format: The public key format is {nrfInstanceId}_RS256.crt (6faf1bbc-6e4a-4454-a507-a14ef8e1bc5c_RS256.crt).
  If the public key is not stored in this format, then check the logs of pod that states "Malformed entry in NRF PublicKey Secret with key ecdsa.crt". Here, ecdsa.crt is the public key certificate in oauthsecret.
By using public key certificate in required format, you can resolve these issues. You need to correct the fields with the proper secret name and namespace.

4.3.3.4 Debugging PodDisruptionBudget Related Issues

A pod can have voluntary or involuntary disruptions at any given time. Voluntary disruptions are either initiated by the application owner or the cluster administrator. Examples of voluntary disruptions are deleting the deployment or a controller that manages the pod, updating a deployment pod template, or accidentally deleting a pod. Involuntary disruptions are unavoidable and can be caused due to any one or more of the following given examples.

Disappearance of a node from the cluster due to cluster network partition
Accidentally deleting a virtual machine instance
Eviction of a pod when a node runs out of resources

To handle a voluntary disruption, you can set the PodDisruptionBudget value to determine the number of replicas of the application must be running at any given time. To configure the PodDisruptionBudget:

Run the following command to check the pods running on different nodes:
kubectl get pods -o wide -n ocudr

Figure 4-28 Pods Running on Different Nodes
Run the following set of commands to unschedule the node:
```
kubectl cordon cne-180-dev2-k8s-node-4
kubectl cordon cne-180-dev2-k8s-node-7
kubectl cordon cne-180-dev2-k8s-node-5
kubectl cordon cne-180-dev2-k8s-node-8
kubectl cordon cne-180-dev2-k8s-node-1
```
After unscheduling the nodes, the state of the nodes changes to 'Ready,SchedulingDisabled' as follows:

Figure 4-29 After Nodes are Unscheduled

Run the following set of commands to drain the nodes:

kubectl drain cne-180-dev2-k8s-node-1 --ignore-daemonsets --delete-local-data
kubectl drain cne-180-dev2-k8s-node-8 --ignore-daemonsets --delete-local-data
kubectl drain cne-180-dev2-k8s-node-5 --ignore-daemonsets --delete-local-data
kubectl drain cne-180-dev2-k8s-node-4 --ignore-daemonsets --delete-local-data
kubectl drain cne-180-dev2-k8s-node-7 --ignore-daemonsets --delete-local-data

If you are required to drain the nodes or evict the pods, then ensure the minimum number of pods are in ready state to serve the application request. To configure the minimum number of pods value, set the minAvailable parameter in the Helm charts for individual microservices. This ensures the availability of a minimum number of pods and they are not evicted. You can check logs while draining the nodes as follows:

Figure 4-30 Logs When Trying to Evict Pod

4.3.3.5 Debugging Pod Eviction Issues

During heavy traffic, there can be a situation where any UDR or Provisioning Gateway pod can run into evicted state. To handle the eviction issues, increase the Ephemeral storage allocation of pods under global section. Update the containerLogStorage configuration under global.ephemeralStorage.limits section to '5000'.

Figure 4-31 Configuring Container Log Storage

After making the above changes, perform helm upgrade. If the Ingress Gateway or Egress Gateway pods are running into Evicted state, then update the ephemeralStorageLimit configuration to '5120' and perform a helm upgrade.

Figure 4-32 Ingress Gateway or Egress Gateway - Evicted State

4.3.3.6 Debugging Taints or Tolerations Misconfigurations

The following points should be considered when the Node Selector and Taints or Toleration feature is used on UDR or Provisioning Gateway:

If any of the pods are going to the Pending state, ensure that the node selector used is pointing to the correct slave node name, and it has enough space to place the pod.

Figure 4-33 Global Configuration
Use the following configuration to configure toleration settings for the tainted nodes.Update the Global section configurations for the settings to be applicable for all the services.

Figure 4-34 Global Configuration
For Node Selector and Tolerations, configuration at the microservice level takes priority over configuration at the global level.

Figure 4-35 Toleration and Node Selector

4.3.3.7 Debugging UDR Registration with NRF Failure

UDR registration with NRF may fail due to various reasons. Some of the possible scenarios are as follows:

Confirm whether registration was successful from the nrf-client-service pod.
Check the ocudr-nudr-nrf-client-nfmanagement logs. If the log has "UDR is Deregistration" then:
- Check if all the services mentioned under allorudr/slf (depending on UDR mode) in the values.yaml file has same spelling as that of service name and are enabled.
- Once all services are up, UDR must register with NRF.
If you see a log for SERVICE_UNAVAILABLE(503), check if the primary and secondary NRF configurations (primaryNrfApiRoot/secondaryNrfApiRoot) are correct and they are UP and Running.

4.3.3.8 Debugging User Agent Header Related Issues

If there are issues related to user agent header, then perform the following steps:

Under Ingressgateway section, enable the user-agent flag to true.
If there are issues in the consumer NFTypes validations for NF's, then check the NF types in the configurations present under ingressgateway section.

Figure 4-36 Enabling User Agent Header
If the userAgentHeaderValidationConfgMode set to REST, the custom-values.yaml configurations are ignored. The configuration is loaded based on userAgentHeaderValidationConfgMode is set.

Note:

From UDR 24.1.0 release onwards this feature is supported using REST API mode. If the feature does not work, make sure the feature is enabled and configured post UDR is upgraded, see Postupgrade Tasks section in Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide

4.3.3.9 Debugging LCI and OCI Header Related Issues

If there are issues related to the LCI and OCI Header feature, perform the following:

Under Ingressgateway-sig section, set the lciHeaderConfig.enabled and ociHeaderConfig.enabled parameters as true, respectively.

Note:

Configured the names of the headers the same as in the default configuration.
Ensure to wait upto the validity period to report LCI and OCI Header in the next response.
For OCI, check the overloadconfigrange and reduction metrics. Based on which, OCI is reported.

4.3.3.10 Debugging Conflict Resolution Feature

Perform the following steps if the conflict resolution feature does not work:

If the exception tables or UPDATE_TIME column are missing from the UDR subscriber database, perform the following steps:
- Ensure to run the SQL command from the SQL files on the database ndbappsql node.
  Note:
  Following SQL files are available in Custom_Templates file:
  - ALL_mode_ndb_replication_insert.sql
  - SLF_mode_ndb_replication_insert.sql
  - EIR_mode_ndb_replication_insert.sql
  - ALL_mode_ndb_replication_insert_UPGRADE.sql.file
  - SLF_mode_ndb_replication_insert_UPGRADE.sql
  - EIR_mode_ndb_replication_insert_UPGRADE.sql
  For more information on how to download the package, see Downloading Unified Data Repository Package section in the Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide.
- after setting the global.dbConflictResolutionEnabled parameter to true in ocudr_custom_values.yaml file, if the UPDATE_TIME column is updated as 0, then run the REST APIs to fetch the global configurations and check the global.dbConflictResolutionEnabled is set to true. If the parameter is not set to true, perform a PUT operation for global configuration update to set the parameter to true.
- If nudr_dbcr_auditor service is not enabled, then make sure to enable the global.dbConflictResolutionEnabled parameter and perform an Helm upgrade.
- If nudr_dbcr_auditor service is not clearing exception tables or fixing data conflict, then make sure that the database replication is running.
- If nudr_dbcr_auditor service is not clearing exceptions on IDXTODATA$EX exception tables then you must check if have the error log "Communication failed to Mate site during audit". If you get this error make sure to check if the following configuration on the custom values file is configured correctly.
```
# Provide MateSite IGW IP List, Comma separated values. Provide fqdn or IP with port
mateSitesIgwIPList: 'http://ocudr-ingressgateway-prov.myudr2:80,http://ocudr-ingressgateway-prov.myudr3:80'
```

4.3.3.11 Debugging UDR Error Responses Using App Error Code

From UDR release 24.1.0 onwards additional information are provided in the ProblemDetails.detail parameter as part of the error responses feature. This is applicable for all UDR mode deployments which includes signaling and provisioning error responses from UDR.

Note:

The responses from nudr-config service for the REST APIs configuration remains same.

Sample ProblemDetails.detail as follows:

<nfFqdn>: <service-name>: <Readable Error details>: <App-Error-Code>, example,

slf01.abc.com: Nudr_GroupIDmap: Request header Issue, Unsupported Media Type: OSLF-DRS-HDRVLD-E001

Table 4-2 Parameters of the Details Field of the Payload

Parameter Name	Description
nfFqdn	Indicates the NF FQDN. It is obtained from the nfFqdn Helm Chart parameter. Sample Value: `slf01.abc.com: Nudr_GroupIDmap`
service-name	Indicates the microservice name. It is the originator of the error response. This value is static and cannot be configured. Sample Value: `Nudr_GroupIDmap`
Readable Error details	Provides a short description of the error. Sample Value: `Group Not Found`
App-Error ID	Indicates the microservice ID and the error ID`o<nftype>-<serviceId>-<category>-E<XXX>`. Sample Value: `OSLF-DRS-SIG-E302`, where, OSLF is the vendor NF DRS is the microservice ID SIG is the category E302 is the app error code
nftype	Indicates the vendor or NF type. This parameter is prefixed with “O”, which indicates Oracle. For example, if the NF type is SLF, the vendor name becomes OSLF. It is obtained from the nfType Helm Chart parameter. Sample Value: `OSLF`
serviceId	It is either DRS (nudr-drservice) or DRP (nudr-dr-provservice). This value is set based on container name.
Category	Category to be used is fetched from error catalog. Errors are classified into categories based on serviceid. Following are the list of categories: SIG PROV URIVLD HDRVLD REQVLD DB INTRNL

4.3.3.12 Debugging Provisioning Logs Related Issues

If there are issues related to provisioning log, then perform the following steps:

You can enable or disable the provisioning log feature by setting the provLogsEnabled parameter flag to true using REST APIs, CNC Console, or by changing the values in custom.yaml file. By default provision logging feature is disabled.
You can set the provisioning API names that are supported for provision logging by changing the provLogsApiNames configuration field to the required value. The default value is nudr-dr-prov. The accepted values are as follows:
- nudr-dr-prov
- nudr-group-id-map-prov
- slf-group-prov
- n5g-eir-prov
If provLogsEnabled flag is set to true, then it is recommended to change the values of logStorage to 4000 MB (apporx. equal to 4GB) for nudr-dr-prov pods to store provision logging files. If the values are not updated, then nudr-dr-prov pods will crash when ephemeral storage is full.

4.3.4 Debugging Upgrade or Rollback Failure

When Unified Data Repository (UDR) upgrade or rollback fails, perform the following steps:

Run the following command to check the pre or post upgrade or rollback hook logs:
```
kubectl logs <pod name> -n <namespace>
```
After detecting the cause of failure, do the following:
- For upgrade failure:
  - If the cause of upgrade failure is database or network connectivity issue, then resolve the issue and rerun the upgrade command.
  - If the cause of failure is not related to database or network connectivity issue and is observed during the preupgrade phase, then do not perform rollback because UDR deployment remains in the source or previous release.
  - If the upgrade failure occurs during the post upgrade phase. For example, post upgrade hook failure due to target release pod not moving to ready state, then perform a rollback.
- For rollback failure: If the cause of rollback failure is database or network connectivity issue, then resolve the issue and rerun the rollback command.
If the issue persists, contact My Oracle Support.