6 cnDBTier Alerts

cnDBTier generates alerts when cnDBTier meets a specified condition. You can access the alerts using the Prometheus dashboard and take necessary actions. Prometheus gets installed as part of common services during the vCNE installation. This section provides details about the available cnDBTier alerts.

6.1 cnDBTier Remote Server Backup Transfer Status Alerts

This section provides details about the cnDBTier remote server backup transfer status alerts.

Table 6-1 REMOTE_SERVER_BACKUP_TRANSFER_FAILED

Field Details
Description This alert is triggered with major severity when the transfer of backup to a remote server fails.
Summary Secure transfer of backup to remote server failed on cnDBTier site {{ $labels.site_name }}
Severity major
Condition db_tier_remote_server_backup_transfer_status == 1
Expression Validity NA
SNMP Trap ID 2031
Affects Service (Y/N) N
Recommended Action

Cause: The transfer of backup to remote server failed.

Diagnostic Information: Check the status of the db_tier_remote_server_backup_transfer_status metric (Table 5-4).

Recommended Actions:

This alert is cleared automatically when the backup transfer status is updated from the good state to remote server as success.
  1. Check if the data nodes are able to copy the backups to db replication service pod.

    Log in to one of the data nodes:

    kubectl -n <namespace> exec -it ndbmtd-0 -c db-backup-executor-svc -- bash
    ssh -i /home/mysql/.ssh/id_rsa mysql@<db-replication-service-svc> -p 2022
    
    sftp -i /home/mysql/.ssh/id_rsa -P 2022 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null mysql@<leader_db_Replication_service_svc>:/var/occnedb/ <<< $'put -r <file/dir to be copied>',
  2. Check if the remote server IP address is reachable or not from the Kubernetes cluster worker nodes.

    Log in to the worker node:

    ping  <remoteserveripaddress>
  3. Verify whether the DB replication service pod can successfully connect to the remote server.

    Log in to the db replication service pod where PVC is attached.

    kubectl -n <namespace> exec  -it   <leader_db_replication_service_pod>  -- bash
    ssh -i /home/mysql/.ssh/remoteserver_id_rsa remoteuser@remoteServerIp
  4. Verify that there is adequate disk space on the remote server for storing backups.

    Log in to the db replication service pod where PVC is attached.

    kubectl -n <namespace> exec  -it   <leader_db_replication_service_pod>  -- bash
    ssh -i /home/mysql/.ssh/remoteserver_id_rsa remoteuser@remoteServerIp 
    df -kh
  5. Manually copy large backup files to the remote server using SFTP commands from the db_replication_service pod to ensure backups can be transferred successfully.

    Log in to the db replication service pod where PVC is attached.

    kubectl -n <namespace> exec  -it   <leader_db_replication_service_pod>  -- bash 
    sftp -i /home/mysql/.ssh/remoteserver_id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null <remoteserveruser>@<remote_server_ip_address>:<remote_server_path> <<< $'put -r <file/dir to be copied>',
  6. In case the issue persists, contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.2 cnDBTier Backup Transfer Status Alerts

This section provides details about the cnDBTier backup transfer status alerts.

Table 6-2 BACKUP_TRANSFER_LOCAL_FAILED

Field Details
Description This alert is triggered with major severity when the system fails to transfer the backup from the data node to the replication service pod on the cnDBTier site (db_tier_backup_transfer_status metric value is 2).
Summary Failed to transfer backup from data node to replication service pod on cnDBTier site {{ $labels.site_name }}
Severity major
Condition db_tier_backup_transfer_status == 2
Expression Validity NA
SNMP Trap ID 2026
Affects Service (Y/N) Y
Recommended Action

Cause: The system failed to transfer a backup from the data node to the replication service pod on a cnDBTier site.

Diagnostic Information: The db_tier_backup_transfer_status metric (Table 5-5) provides information about the backup transfer status.

Recommended Actions:
  1. Ensure all data pods and db-replication-svc pods are up.
    kubectl get pods -n <namespace> -l dbtierapp=ndbmtd --no-headers
    kubectl get pods -n <namespace> -l dbtierapp=dbreplicationsvc --no-headers
  2. Check if it can be connected to the local db-replication-svc pod from the data nodes.
    kubectl -n <namespace> exec -it ndbmtd-0 -c db-backup-executor-svc -- bash
    ssh -i /home/mysql/.ssh/id_rsa mysql@<db-replication-service-svc> -p 2023
  3. Verify the number of data nodes and the DataMemory value configured in the ndbmtd configuration and ensure that the PVC configured in the data nodes and db-replication-svc aligns with the cnDBTier dimensions.
    1. Verify the data memory configured in the customer_value.yaml file.
      ndb:
       ndbdisksize: 60Gi
       ndbbackupdisksize: 100Gi
       datamemory: 12G
       
      db-replication-svc:
        dbreplsvcdeployments:
           
            pvc:
              name: pvc-cluster1-cluster2-replication-svc
              disksize: 100Gi
    2. Log in to one of the Bastion host and get the pvc for data nodes and db replication service.
      kubectl get pvc -n <namespace> | grep ndbmtd
      kubectl get pvc -n <namespace> | grep repl
  4. If db replication service pvc is not configured as per the cnDBTier dimension, then increase the PVC by following the scaling procedure provided inScaling ndbmtd Pods, and get the PVC configurations reviewed by the NF team before performing the scaling.
  5. In case the issue persists, contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-3 BACKUP_TRANSFER_FAILED

Field Details
Description This alert is triggered with major severity when the backup transfer failed as the system failed to transfer the backup to the remote site from the cnDBTier site (db_tier_backup_transfer_status metric value is 3).
Summary Failed to transfer backup to remote site from cnDBTier site {{ $labels.site_name }}
Severity major
Condition db_tier_backup_transfer_status == 3
Expression Validity NA
SNMP Trap ID 2027
Affects Service (Y/N) Y
Recommended Action

Cause: The system failed to transfer a backup from the cnDBTier site to a remote site.

Diagnostic Information: The db_tier_backup_transfer_status metric (Table 5-5) provides information about the backup transfer status.

Recommended Actions:
  1. Check if all db-replication-svc pods of current site and mate site are up or not.
    kubectl get pods -n <namespace>  -l dbtierapp=dbreplicationsvc
        --no-headers
  2. Check if connectcurrent site leader db-replication-svc can be connected to mate site leader db-replication-svc.

    Log in to the db replication service pod where PVC is attached.

    kubectl -n <namespace> exec -it <leader_db_replication_service_pod> -- bash
    ssh -i /home/mysql/.ssh/id_rsa mysql@<leader_db_replication_service_site2_pod> -p 2022
  3. Verify the number of data nodes and the DataMemory value configured in the ndbmtd configuration and ensure that the PVC configured in the data nodes and db-replication-svc aligns with the cnDBTier dimensions.
    1. Verify the data memory configured in the customer_value.yaml file.
      ndb:
       ndbdisksize: 60Gi
       ndbbackupdisksize: 100Gi
       datamemory: 12G
       
      db-replication-svc:
        dbreplsvcdeployments:
           
            pvc:
              name: pvc-cluster1-cluster2-replication-svc
              disksize: 100Gi
    2. Log in to one of the Bastion hosts and get the PVC for data nodes and db replication service.
      kubectl get pvc -n <namespace> | grep ndbmtd
      kubectl get pvc -n <namespace> | grep repl
  4. If db replication service pvc is not configured as per the cnDBTier dimension, then increase the PVC by following the scaling procedure provided inScaling ndbmtd Pods, and get the PVC configurations reviewed by the NF team before performing the scaling.
  5. In case the issue persists, contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-4 BACKUP_TRANSFER_IN_PROGRESS

Field Details
Description This alert is triggered with info severity when the backup transfer is in progress on the cnDBTier site (db_tier_backup_transfer_status metric value is 1).
Summary Backup Transfer is In Progress on cnDBTier site {{ $labels.site_name }}
Severity info
Condition db_tier_backup_transfer_status == 1
Expression Validity NA
SNMP Trap ID 2028
Affects Service (Y/N) N
Recommended Action

Cause: Backup transfer is in progress on the cnDBTier site.

Diagnostic Information: The db_tier_backup_transfer_status metric (Table 5-5) provides information about the backup transfer status.

Recommended Actions:

  1. Wait for the backup transfer to complete. The alert should automatically clear once the transfer is finished.
  2. If the backup transfer takes longer than expected, allow it to continue for an additional 15 minutes, as temporary delays may occur.
  3. If the transfer still does not complete within this time frame:
    • Check the CPU and memory resources allocated to the db-backup-executor-svc container within the ndbmtd pods.
    • Verify the configuration of the leader db-replication-service in both current and remote sites.
    • Ensure that all these components are configured in accordance with the cnDBTier dimension sheet.
  4. In case the issue persists, contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.3 cnDBTier Heartbeat Alerts

This section provides details about cnDBTier heartbeat alerts.

Table 6-5 HEARTBEAT_FAILED

Field Details
Description This alert is triggered with critical severity when HeartBeat fails on a remote site.
Summary HeartBeat failed on cnDBTier site {{ $labels.site_name }} connected to mate site {{ $labels.mate_site_name }} on replication channel group id {{ $labels.replchannel_group_id }} and kubernetes namespace {{ $labels.namespace }}"
Severity critical
Condition db_tier_heartbeat_failure == 1
Expression Validity NA
SNMP Trap ID 2025
Affects Service (Y/N) Y
Recommended Action

Cause: The system is unable to connect to remote site and Heartbeat failed.

Diagnostic Information: The db_tier_heartbeat_failure metric (Table 5-8) provides information about the heartbeat status and indicates whether the remote site is reachable or not.

Recommended Actions:

  1. Check the logs of the db-replication-svc service.
    $ kubectl logs -n <cnDBTier Namespace>  <replication service pod name>
        -f
  2. Check if Geo Replication Recovery (GRR) is in progress. During GRR, a HEARTBEAT_FAILED alert may be triggered.
    $ export IP=$(kubectl get svc -n <namespace of failed site> -l
          dbtierapp=dbreplicationsvc,servicetype=external --no-headers | awk '{print $4}' | head -n 1 )
    1. Get the replication service LoadBalancer Port for the site on which restore is being done.
      $ export PORT=$(kubectl get svc -n <namespace of failed site> -l
            dbtierapp=dbreplicationsvc,servicetype=external --no-headers | awk '{print $5}' |  cut -d '/' -f 1 |  cut -d ':' -f 1 | head -n 1)
    2. Run the following command to get geo replication restore status.
      $ curl -X GET http://$IP:$PORT/db-tier/gr-recovery/site/{sitename}/status
    3. If HTTPS is enabled, run the following commands:
      $ kubectl exec it n <cnDBTier namespace> <current_site_db_replication_svc_pod> – bash
      $ curl --cert client-cert.pem --cert-type PEM --key client-key.pem --key-type PEM --cacert combine-ca.pem -X GET http://$IP:$PORT/db-tier/gr-recovery/site/{sitename}/status
  3. Check if the mate site IP and port are configured correctly:
    replication:
           # Local site replication service LoadBalancer ip can be configured.
           localsiteip: ""
           localsiteport: "80"
           channelgroupid: "1"
           matesitename: "<${OCCNE_MATE_SITE_NAME}>"
           preferredIpFamily: "IPv4"
           remotesiteip: "<${OCCNE_MATE_REPLICATION_SVC}>"
           remotesiteport: "80"
     
        service:
           type: LoadBalancer
           loadBalancerIP: ""
           httpport: 80
           httpsport: 443
  4. If HTTPS is enabled, then check if the certificates are configured rightly. Check the db-replication-svc pod logs for ssl handshake related errors.
    $ kubectl logs -n <cnDBTier Namespace>  <replication service pod name>
        -f
  5. Check if the connectivity is established between the current and remote db-replication-svc service in both directions.
    1. To check the Curl connection use the following command and get the replication service LoadBalancer IP for cluster1:

      $ IP=$(kubectl get svc -n <cnDBTier namespace> -l
            dbtierapp=dbreplicationsvc,servicetype=external --no-headers | awk '{print $4}' | head -n 1 )
    2. Get the replication service LoadBalancer Port for cluster1.
      $ PORT=$(kubectl get svc -n <cnDBTier namespace> -l
            dbtierapp=dbreplicationsvc,servicetype=external --no-headers | awk '{print $5}' | cut -d '/' -f 1 | cut -d ':' -f 1 | head -n 1)
      $ kubectl exec it n <cnDBTier namespace> <current_site_db_replication_svc_pod> – bash
      $ curl -X GET http://$IP:$PORT//db-tier/health/db-replication-svc/status
    3. If HTTPS is enabled:
      $ kubectl exec it n <cnDBTier namespace> <current_site_db_replication_svc_pod> – bash
      $ curl --cert client-cert.pem --cert-type PEM --key client-key.pem --key-type PEM --cacert combine-ca.pem -X  GET https://$IP:$PORT//db-tier/health/db-replication-svc/status
  6. In case the issue persists, contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.4 cnDBTier BinLog Injector Thread Alerts

This section provides details about cnDBTier BinLog injector alerts.

Table 6-6 BINLOG_INJECTOR_STOPPED

Field Details
Description This alert is triggered with critical severity when Bin Log Injector stops working.
The value of db_tier_binlog_injector_thread or db_tier_binlog_injector_thread_latest_epoch indicates the status of Bin Log Injector:
  • 0: indicates that the Bin Log Injector thread is not stopped for the specified node ID
  • 1: indicates that the Bin Log Injector thread is stopped for the specified node ID
Summary BinLog Injector Thread is stopped for MySQL node having node id {{ $labels.node_id }} on cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition

db_tier_binlog_injector_thread_latest_epoch == 1

or

db_tier_binlog_injector_thread == 1

Expression Validity NA
SNMP Trap ID 2024
Affects Service (Y/N) Y
Recommended Action

Cause: Bin Log Injector thread stalled for the replication SQL node.

Diagnostic Information: The db_tier_binlog_injector_thread_latest_epoch or db_tier_binlog_injector_thread metrics (Table 5-84 or Table 5-83) provide information whether the Bin Log Injector thread is stalled or not.

Recommended Actions:

  1. Check the engine ndb status by connecting to MySQL server running inside the ndbmysqld pod and run SHOW ENGINE NDB STATUS.

    Check if the latest_epoch in ndbappmysqld and ndbmysqld pod is changing. Check the value of the latest_applied_binlog_epoch in ndbmysqld pod using the following command.

    $ kubectl -n <namespace> exec -it  ndbmysqld-0 -- mysql -h127.0.0.1 -uroot -pNextGenCne
    $ SHOW ENGINE NDB STATUS;
    Run this command a few times and monitor the values of latest_epoch and latest_applied_binlog_epoch in the ndbmysqld/ndbappmysqld pod. If these values are not getting changed, then it means that binlog injector thread is stalled.
  2. If the alert is not getting cleared, then restart the ndbmysqld/ndbappmysqld pod where the binlog injector thread is stalled.
    $ kubectl -n <namespace> delete pod <ndbmysqld/ndbappmysqld POD NAME>
  3. In case the issue persists, contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.5 cnDBTier Replication Error Skip Alerts

This section provides details about the cnDBTier replication error skip alerts.

Table 6-7 REPLICATION_SWITCHOVER_DUE_CLUSTERDISCONNECT

Field Details
Description This alert is triggered when switch over happens on an API node due to configured cluster disconnect error, if skip replication error is enabled.
Summary Replication channel on SQL node with node ID {{ $labels.node_id }} had switchover due to cluster disconnecterror number {{ $labels.error_number }}
Severity info
Condition db_tier_replication_switchover_due_to_clusterdisconnect == 1
Expression Validity NA
SNMP Trap ID 2019
Affects Service (Y/N) N
Recommended Action

Cause: Skip replication error is enabled on an API node and a switchover occurred on the node as the configured cluster disconnected.

Diagnostic Information: The db_tier_replication_switchover_due_to_clusterdisconnect metric (Table 5-82) provides information whether a switchover occurred on an API node.

Recommended Actions:

  1. Check cluster status and pod status in the remote site.

    If multiple pods restarted in the remote site then capture all the outputs for the above steps and contact My Oracle Support.

    1. Check the cluster status by running the following command:
      kubectl -n <cnDBTier remote site Namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
    2. Check the pod status by running the following command:
      kubectl get pod -n <cnDBTier remote site Namespace>
  2. Verify if the cnDBTier pods (ndbmtd, ndbmysqld) are configured with resources (CPUs and memory) as per cnDBTier dimensions in the remote site.
  3. If CPU and memory for the pods are not configured as per the cnDBTier dimensions, then increase the CPU and memory by following the Scaling ndbmtd Pods procedure and ensure that CPU and memory configurations are reviewed by the NF team before upgrading.
  4. Check for resource pressure on the pods of the remote site for which the replication is switched over.
    kubectl top pod -n <cnDBTier remote site Namespace>
  5. Check db_tier_replication_switchover_due_to_clusterdisconnect metric to get the error number resulting the switchover.

    If the replication switch over occurs because of the replication error 13119 and 1296, then their may be issue that cluster nodes got disconnected and restarted or both the replicas in the replication group got disconnected from cluster. In this case, collect the logs and contact My Oracle Support.

  6. Check the ndbmysqld and ndbmgmd logs of the remote site and try to find out the cause of the particular error number obtained above (for example, network issue, heartbeat failures, cluster disconnection, infrastructure events like network issue or network delays occurred in the infrastructure and so on).
    1. Check the current and previous logs of ndbmtd pods:
      kubectl logs -n <cnDBTier remote site Namespace> < ndbmtd pod> -f
      kubectl logs -n <cnDBTier remote site Namespace> < ndbmtd pod> --previous
    2. Check the logs of the ndbmysql pods:
      1. Log in to ndbmysqld pod.
        kubectl -n  <cnDBTier remote site Namespace> exec -it <ndbmysqld pod> -- bash
      2. Change to /var/occnedb/mysql/ directory.
        cd /var/occnedb/mysql/
      3. Open the mysqld.log file to check the logs.
        vim mysqld.log
  7. In case if the geo replication is down with the remote site and the issue persists and alert did not automatically clear within configured threshold time(1 hour), capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-8 REPLICATION_TOO_MANY_EPOCHS_LOST

Field Details
Description This alert is triggered when the epochs lost due to skip error is greater than 10000 and less than or equal to 80000.

This alert is cleared one hour after the event.

Summary Too many epochs are lost for skipping replication errors
Severity major
Condition (db_tier_epochs_lost_due_to_skiperror > 10000) and (db_tier_epochs_lost_due_to_skiperror <= 80000)
Expression Validity NA
SNMP Trap ID 2020
Affects Service (Y/N) N
Recommended Action

Cause: Between 10000 and 80000 epochs are lost due to skip errors.

Diagnostic Information: The db_tier_epochs_lost_due_to_skiperror metric (Table 5-81) provides information about the number of epochs lost due to skip errors.

Recommended Actions:

  1. Check cluster status and pod status in the remote site.

    If multiple pods restarted in the remote site then capture all the outputs for the above steps and contact My Oracle Support.

    1. Check the cluster status by running the following command:
      kubectl -n <cnDBTier remote site Namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
    2. Check the pod status by running the following command:
      kubectl get pod -n <cnDBTier remote site Namespace>
  2. Verify if the cnDBTier pods (ndbmtd, ndbmysqld) are configured with resources (CPUs and memory) as per cnDBTier dimensions in the remote site.
  3. If CPU and memory for the pods are not configured as per the cnDBTier dimensions, then increase the CPU and memory by following the Scaling ndbmtd Pods procedure and ensure that CPU and memory configurations are reviewed by the NF team before upgrading.
  4. Check for resource pressure on the pods of the remote site for which the replication is switched over.
    kubectl top pod -n <cnDBTier remote site Namespace>
  5. Check db_tier_replication_switchover_due_to_clusterdisconnect metric to get the error number resulting the switchover.

    If the replication switch over occurs because of the replication error 13119 and 1296, then their may be issue that cluster nodes got disconnected and restarted or both the replicas in the replication group got disconnected from cluster. In this case, collect the logs and contact My Oracle Support.

  6. Check the ndbmysqld and ndbmgmd logs of the remote site and try to find out the cause of the particular error number obtained above (for example, network issue, heartbeat failures, cluster disconnection, infrastructure events like network issue or network delays occurred in the infrastructure and so on).
    1. Check the current and previous logs of ndbmtd pods:
      kubectl logs -n <cnDBTier remote site Namespace> < ndbmtd pod> -f
      kubectl logs -n <cnDBTier remote site Namespace> < ndbmtd pod>
          --previous
    2. Check the logs of the ndbmysql pods:
      1. Log in to ndbmysqld pod.
        kubectl -n  <cnDBTier remote site Namespace> exec -it <ndbmysqld pod> -- bash
      2. Change to /var/occnedb/mysql/ directory.
        cd /var/occnedb/mysql/
      3. Open the mysqld.log file to check the logs.
        vim mysqld.log
  7. In case if the geo replication is down with the remote site and the issue persists and alert did not automatically clear within configured threshold time (1 hour), capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-9 REPLICATION_SKIP_ERRORS_LOW

Field Details
Description This alert is triggered when the replication is halted due to skip error count less than or equal to 5.

This alert is cleared one hour after the event.

Summary Cross-site replication errors are skipped
Severity minor
Condition (db_tier_replication_halted_due_to_skiperror > 0) and (db_tier_replication_halted_due_to_skiperror <= 5)
Expression Validity NA
SNMP Trap ID 2021
Affects Service (Y/N) N
Recommended Action

Cause: Replication halted due to less than five skip errors.

Diagnostic Information: The db_tier_replication_halted_due_to_skiperror metric (Table 5-80) provides information about the number of skip errors due to which the replication halted.

Recommended Actions:

  1. Monitor the alert. If new skip errors accumulate or replication halts again, it may escalate to REPLICATION_SKIP_ERRORS_HIGH.
  2. Check cluster status and pod status in the remote site.

    If multiple pods restarted in the remote site then capture all the outputs for the above steps and contact My Oracle Support.

    1. Check the cluster status by running the following command:
      kubectl -n <cnDBTier remote site Namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
    2. Check the pod status by running the following command:
      kubectl get pod -n <cnDBTier remote site Namespace>
  3. Check for resource pressure on the pods on the remotesite.
    kubectl top pod -n <cnDBTier remote site Namespace>
  4. Check the db_tier_replication_switchover_due_to_skiperror metric to get the error number which is raising the alert.

    If the replication switch over occurs because of the replication error 13119 and 1296, then their may be issue that cluster nodes got disconnected and restarted or both the replicas in the replication group got disconnected from cluster. In this case, collect the logs and contact My Oracle Support.

  5. Check the ndbmysqld and ndbmgmd logs of the remote site and try to find out the cause of the particular error number obtained above (for example, network issue, heartbeat failures, cluster disconnection, infrastructure events like network issue or network delays occurred in the infrastructure and so on).
    1. Check the current and previous logs of ndbmtd pods:
      kubectl logs -n <cnDBTier remote site Namespace> < ndbmtd pod> -f
      kubectl logs -n <cnDBTier remote site Namespace> < ndbmtd pod> --previous
    2. Check the logs of the ndbmysql pods:
      1. Log in to ndbmysqld pod.
        kubectl -n  <cnDBTier remote site Namespace> exec -it <ndbmysqld pod> -- bash
      2. Change to /var/occnedb/mysql/ directory.
        cd /var/occnedb/mysql/
      3. Open the mysqld.log file to check the logs.
        vim mysqld.log
  6. In case if the geo replication is down with the remote site and the issue persists and alert did not automatically clear within configured threshold time (1 hour), capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-10 REPLICATION_SKIP_ERRORS_HIGH

Field Details
Description This alert is triggered when the replication is halted due to skip error counts greater than 5.

This alert is cleared one hour after the event.

Summary Cross-site replication errors skipped are high
Severity major
Condition db_tier_replication_halted_due_to_skiperror > 5
Expression Validity NA
SNMP Trap ID 2022
Affects Service (Y/N) N
Recommended Action

Cause: Replication halted due to more than five skip errors.

Diagnostic Information: The db_tier_replication_halted_due_to_skiperror metric (Table 5-80) provides information about the number of skip errors due to which the replication halted.

Recommended Actions:

  1. Check cluster status and pod status in the remote site.

    If multiple pods restarted in the remote site then capture all the outputs for the above steps and contact My Oracle Support.

    1. Check the cluster status by running the following command:
      kubectl -n <cnDBTier remote site Namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
    2. Check the pod status by running the following command:
      kubectl get pod -n <cnDBTier remote site Namespace>
  2. Check for resource pressure on the pods on the remotesite.
    kubectl top pod -n <cnDBTier remote site Namespace>
  3. Check the db_tier_replication_switchover_due_to_skiperror metric to get the error number which is raising the alert.

    If the replication switch over occurs because of the replication error 13119 and 1296, then their may be issue that cluster nodes got disconnected and restarted or both the replicas in the replication group got disconnected from cluster. In this case, collect the logs and contact My Oracle Support.

  4. Check the ndbmysqld and ndbmgmd logs of the remote site and try to find out the cause of the particular error number obtained above (for example, network issue, heartbeat failures, cluster disconnection, infrastructure events like network issue or network delays occurred in the infrastructure and so on).
    1. Check the current and previous logs of ndbmtd pods:
      kubectl logs -n <cnDBTier remote site Namespace> < ndbmtd pod> -f
      kubectl logs -n <cnDBTier remote site Namespace> < ndbmtd pod> --previous
    2. Check the logs of the ndbmysql pods:
      1. Log in to ndbmysqld pod.
        kubectl -n  <cnDBTier remote site Namespace> exec -it <ndbmysqld pod> -- bash
      2. Change to /var/occnedb/mysql/ directory.
        cd /var/occnedb/mysql/
      3. Open the mysqld.log file to check the logs.
        vim mysqld.log
  5. In case if the geo replication is down with the remote site and the issue persists and alert did not automatically clear within configured threshold time (1 hour), capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-11 REPLICATION_EPOCHS_LOST

Field Details
Description This alert is triggered when the epochs lost due to skip error is greater than 0 and less than 2000.

This alert is cleared one hour after the event.

Summary Epochs are lost for skipping replication errors
Severity info
Condition db_tier_epochs_lost_due_to_skiperror > 0 and db_tier_epochs_lost_due_to_skiperror <= <Configured epoch interval lower threshold>
Expression Validity NA
SNMP Trap ID 2023
Affects Service (Y/N) N
Recommended Action

Cause: Less than 2000 epochs are lost due to skip errors.

Diagnostic Information: The db_tier_epochs_lost_due_to_skiperror metric (Table 5-81) provides information about the number of epochs lost due to skip errors.

Recommended Actions:

  1. Monitor the alert, If new skip errors accumulate or replication halts again, it may escalate to REPLICATION_TOO_MANY_EPOCHSLOST.
  2. Check cluster status and pod status in the remote site.

    If multiple pods restarted in the remote site then capture all the outputs for the above steps and contact My Oracle Support.

    1. Check the cluster status by running the following command:
      kubectl -n <cnDBTier remote site Namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
    2. Check the pod status by running the following command:
      kubectl get pod -n <cnDBTier remote site Namespace>
  3. Check for resource pressure on the pods on the remotesite.
    kubectl top pod -n <cnDBTier remote site Namespace>
  4. Check the db_tier_epochs_lost_due_to_skiperror metric to get the error number which is raising the alert.

    If the replication switch over occurs because of the replication error 13119 and 1296, then their may be issue that cluster nodes got disconnected and restarted or both the replicas in the replication group got disconnected from cluster. In this case, collect the logs and contact My Oracle Support.

  5. Check the ndbmysqld and ndbmgmd logs of the remote site and try to find out the cause of the particular error number obtained above (for example, network issue, heartbeat failures, cluster disconnection, infrastructure events like network issue or network delays occurred in the infrastructure and so on).
    1. Check the current and previous logs of ndbmtd pods:
      kubectl logs -n <cnDBTier remote site Namespace> < ndbmtd pod> -f
      kubectl logs -n <cnDBTier remote site Namespace> < ndbmtd pod> --previous
    2. Check the logs of the ndbmysql pods:
      1. Log in to ndbmysqld pod.
        kubectl -n  <cnDBTier remote site Namespace> exec -it <ndbmysqld pod> -- bash
      2. Change to /var/occnedb/mysql/ directory.
        cd /var/occnedb/mysql/
      3. Open the mysqld.log file to check the logs.
        vim mysqld.log
  6. In case if the geo replication is down with the remote site and the issue persists and alert did not automatically clear within configured threshold time (1 hour), capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.6 cnDBTier Georeplication Recovery Status Alerts

This section provides details about the cnDBTier georeplication recovery status alerts.

Table 6-12 GEOREPLICATION_RECOVERY_IN_PROGRESS

Field Details
Description This alert is triggered with critical severity when the georeplication recovery is in progress and the alert is cleared when georeplication recovery is complete.
Summary Identified cnDBTier Site {{ $labels.site_name }} georeplication recovery is in progress for kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition db_tier_georeplication_recovery_state == 1
Expression Validity 1m
SNMP Trap ID 2018
Affects Service (Y/N) Y
Recommended Action

Cause: When you perform georeplication recovery to recover failed site from a healthy site, that is when georeplication recovery is in progress.

Diagnostic Information: The db_tier_georeplication_recovery_state metric (Table 5-36) provides information whether georeplication recovery is in progress.

Recommended Actions:
  1. Since the time required to complete the GRR process depends on the database size, wait for the process to complete. The alert will be automatically cleared once the GRR process is successfully completed.
  2. Monitor the status of GRR using the GRR Status REST API.
    1. If gr_state is "COMPLETED", then database restoration and re-establishing the replication channels are completed.

      Get the replication service LoadBalancer IP for the site on which restore is being done:
      $ export IP=$(kubectl get svc -n <namespace of failed site> -l
            dbtierapp=dbreplicationsvc,servicetype=external --no-headers | awk '{print $4}' | head -n 1 )   
    2. Get the replication service LoadBalancer Port for the site on which restore is being done:
      $ export PORT=$(kubectl get svc -n <namespace of failed site> -l
            dbtierapp=dbreplicationsvc,servicetype=external --no-headers | awk '{print $5}' |  cut -d '/' -f 1 |  cut -d ':' -f 1 | head -n 1)
    3. Run the following command to get the geo replication restore status:
      $ curl -X GET http://$IP:$PORT/db-tier/gr-recovery/site/{sitename}/status
  3. Monitor the GRR states and check if any other alerts related to backup transfer has failed are raised:
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.7 cnDBTier Cluster Status Alerts

This section provides details about cnDBTier cluster status alerts.

Table 6-13 CLUSTER_DOWN

Field Details
Description This alert is triggered with critical severity when cnDBTier NDB cluster is not UP.
Summary MySQL Cluster is down for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition db_tier_cluster_status == 0
Expression Validity 1m
SNMP Trap ID 2017
Affects Service (Y/N) Y
Recommended Action
Cause:
  • When pod restarts due to Kubernetes liveliness or readiness probe failures.
  • When cnDBTier application restarts or fails to start.
Diagnostic Information:
  • Run the following command to check the status of cnDBTier namespace:
    kubectl -n <namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
    The cluster is down if:
    • the ndbappmysqld pods are down, not running, and not connected
    • the remaining pods are not running and not connected
  • Check Kubernetes events for probe failures in the platform logs.
  • Check if any exception is reported in the cnDBTier application logs.

Recommended Actions:

  1. If the GEOREPLICATION_RECOVERY_IN_PROGRESS alert is raised along with the CLUSTER_DOWN alert, wait until the GRR process is completed.
    1. Get the replication service LoadBalancer IP for the site on which restore is being done.
      $ export IP=$(kubectl get svc -n <namespace of failed site> -l
            dbtierapp=dbreplicationsvc,servicetype=external --no-headers | awk '{print $4}' | head -n 1 )
    2. Get the replication service LoadBalancer Port for the site on which restore is being done:
      $ export PORT=$(kubectl get svc -n <namespace of failed site> -l
            dbtierapp=dbreplicationsvc,servicetype=external --no-headers | awk '{print $5}' |  cut -d '/' -f 1 |  cut -d ':' -f 1 | head -n 1)
    3. Run the following command to get Georeplication restore status:
      $ curl -X GET http://$IP:$PORT/db-tier/gr-recovery/site/{sitename}/status
  2. Run the following command to check the status of cnDBTier namespace:
    The following command checks the node status:
    $ kubectl -n ${OCCNE_NAMESPACE} exec -it ndbmgmd-0 -- ndb_mgm -e show
  3. Check the management pod logs also for checking if there are any frequent heartbeat missed warnings or other ALERT are logged for the pod.

    $ kubectl -n <namespace> exec -it ndbmgmd-0 -- bash
    $ tail -f /var/occnedb/mysqlndbcluster/ndbmgmd_cluster.log
  4. Check the pod status, if the pod is not coming up then analyze the previous container logs of the pod to see the error information.
    $ kubectl -n <namespace> logs <ndbmtd/ndbmysqld/ndbappmysqld podname> --previous
  5. Verify if any network or infrastructure-related events occurred that might have disrupted communication between pods.
  6. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-14 MYSQL_NDB_CLUSTER_DISCONNECT

Field Details
Description This alert is triggered with critical severity when cnDBTier NDB cluster is not UP.
Summary MySQL NDB Cluster Disconnected {{ $value }} times for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition db_tier_cluster_disconnect > 0
Expression Validity 1m
SNMP Trap ID 2034
Affects Service (Y/N) Y
Recommended Action
Cause:
  • When all ndbmtd pods or all ndbmtd pods of the same node group restart due to Kubernetes probe failures or infrastructure related issues.
Diagnostic Information:
  • Run the following command to check the status of cnDBTier namespace:
    kubectl -n <namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show

    The cluster is down if: all data nodes or all data nodes of the same node group are not connected.

  • Check Kubernetes events for probe failures in the platform logs.
  • Check if there is any network fluctuation or platform related issue which can cause the ndbmtd pods to restart.
  • Check if any exception is reported in the cnDBTier application logs.

Recommended Actions:

  1. Run the following command to check the status of cnDBTier namespace:
    The following command checks the node status:
    $ kubectl -n ${OCCNE_NAMESPACE} exec -it ndbmgmd-0 -- ndb_mgm -e show
  2. Check the management pod logs also for checking if there are any frequent heartbeat missed warnings or other ALERT are logged for the pod.

    $ kubectl -n <namespace> exec -it ndbmgmd-0 -- bash
    $ tail -f /var/occnedb/mysqlndbcluster/ndbmgmd_cluster.log
  3. Check the pod status, if the pod is not coming up then analyze the previous container logs of the pod to see the error information.
    $ kubectl -n <namespace> logs <ndbmtd/ndbmysqld/ndbappmysqld podname> --previous
  4. Verify if any network or infrastructure-related events occurred that might have disrupted communication between pods.
  5. Check if any exception is reported in the cnDBTier application logs:
    $ kubectl logs -n <cnDBTier Namespace> <cnDBTier Pod name> -f
  6. If the pod alone is down and all the other pods are up and running then delete the pod and its pvc, so that pod can restart by re-initializing:
    $ kubectl -n <namespace> exec -it <ndbmtd/ndbmysqld/ndbappmysqld podname> -c db-infra-monitor-svc -- bash
    $ ndb_mgm -c ndbmgmd-0.ndbmgmdsvc:1186 -e show
  7. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.8 cnDBTier Automated Backup Alerts

This section provides details about the cnDBTier automated backup alerts.

Table 6-15 BACKUP_FAILED

Field Details
Description This alert is triggered with minor severity when the backup service fails to complete the backup successfully.
Summary Could not backup database for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity minor
Condition db_tier_backup{status='FAILED'}
Expression Validity N/A
SNMP Trap ID 2011
Affects Service (Y/N) N
Recommended Action
Cause:
  • When backup service fails to complete the backup successfully.
  • When PVC size is not enough and as a result the backup fails.

Diagnostic Information: The db_tier_backup metric (Table 5-34) provides information if the backup failed or not.

Recommended Actions:

  1. Run the following command to check if all data pods pods are up and running:
    kubectl get pods -n <namespace> -l dbtierapp=ndbmtd --no-headers
  2. Check the data pod status, if the pod is not coming up then analyze the previous container logs of the pod to see the error information.
    kubectl -n <namespace> logs <ndbmtd pod name> --previous
  3. Verify the number of data nodes and the DataMemory value configured in the ndbmtd configuration and ensure that the PVC configured in data nodes aligns with the cnDBTier dimensions.
    1. Verify the data memory configured in custom_values.yaml.
       ndb:
       ndbdisksize: 60Gi
       ndbbackupdisksize: 100Gi
       datamemory: 12G
    2. Log in to one of the Bastion hosts and get the pvc for data nodes and db replication service:
      kubectl get pvc -n <namespace> | grep ndbmtd
  4. If the PVC is not configured as per cnDBTier dimensions, then increase the database capacity by following the scaling procedures below, and get the PVC resources configuration reviewed by the NF team before performing the scaling:
    1. Horizontal scaling by adding more data nodes.
    2. Vertical scaling by increasing the PVC of data nodes.
  5. Validate the network connectivity between data nodes and management nodes and make sure that ports and nodes are reachable.
  6. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-16 BACKUP_PURGED_EARLY

Field Details
Description This alert is triggered with minor severity when the backup service purges old backups earlier than expected to create space for new backup.
Summary A backup was deleted prematurely for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity minor
Condition db_tier_backup{status='PURGED_EARLY'}
Expression Validity N/A
SNMP Trap ID 2012
Affects Service (Y/N) N
Recommended Action

Cause: When the backup service purges the old backups earlier than the expected time, to create space for a new backup.

Diagnostic Information: The db_tier_backup metric (Table 5-34) provides information if the backup is purged earlier than expected.

Recommended Actions:

  1. Verify if there is sufficient disk space available on data pods:
    kubectl exec -it ndbmtd-0 -c mysqlndbcluster -n <namespace> -- df -kh
  2. Validate that DBTIER_BACKUP_INFO retains the correct number of backups retainbackupno and purges older one.
    1. Check the retainbackupno of the cnDBTier cluster:
      $ kubectl exec -it <db-backup-manager-svc pod> -n <namespace> -- printenv RETAIN_BACKUP_NO
    2. Identify the exception tables using the following query:
      $ kubectl -n <namespace> exec -it ndbappmysqld-0 -- mysql -h 127.0.0.1 -uroot -p<PASSWORD>
      mysql> SELECT *  FROM backup_info.DBTIER_BACKUP_INFO  ORDER BY creation_ts DESC;
  3. Verify the number of data nodes and the DataMemory value configured in the ndbmtd configuration and ensure that the PVC configured in data nodes aligns with the cnDBTier dimensions.
    1. Verify the data memory configured in custom_values.yaml.
       ndb:
       ndbdisksize: 60Gi
       ndbbackupdisksize: 100Gi
       datamemory: 12G
    2. Log in to one of the Bastion hosts and get the pvc for data nodes and db replication service:
      kubectl get pvc -n <namespace> | grep ndbmtd
  4. If the PVC is not configured as per cnDBTier dimensions, then increase the database capacity by following the scaling procedures below, and get the PVC resources configuration reviewed by the NF team before performing the scaling:
    1. Horizontal scaling by adding more data nodes.
    2. Vertical scaling by increasing the PVC of data nodes.
  5. Validate the network connectivity between data nodes and management nodes and make sure that ports and nodes are reachable.
  6. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-17 BACKUP_SIZE_GROWTH

Field Details
Description This alert is triggered with minor severity whenever the current backup size exceeds 20% of the average of the previous backups.
Summary Backup size exceeded expected size for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity minor
Condition (db_tier_backup_used_disk_percentage/(avg_over_time(db_tier_backup_used_disk_percentage[5d])))>1.05
Expression Validity N/A
SNMP Trap ID 2013
Affects Service (Y/N) N
Recommended Action

Cause: When the current backup size exceeds 20% of the average of the previous backups.

Diagnostic Information: The db_tier_backup_used_disk_percentage metric (Table 5-32) provides information if the current backup size exceeds 20% of the average.

Recommended Actions:

  1. This alert detects a spike in the disk usage by comparing the current value of db_tier_backup_used_disk_percentage to the 5 days average. If the spike exceeds a defined backupSizeGrowthAlertThreshold this alert will be triggered.
  2. The alert will automatically clear when the size of the current backup does not exceed 20% above the average size of the backups taken over the last five days.
  3. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-18 BACKUP_STORAGE_LOW

Field Details
Description This alert is triggered with minor severity when the total backup size of the data node is >= 70% and < 80% of the total data node disk size.
Summary Disk storage on DATA node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity minor
Condition (avg_over_time(db_tier_backup_used_disk_percentage[5m])>=70) and (avg_over_time(db_tier_backup_used_disk_percentage[5m])<80)
Expression Validity N/A
SNMP Trap ID 2014
Affects Service (Y/N) N
Recommended Action

Cause: When the total backup size of the data node is >= 70% and < 80% of the total data node disk size.

Diagnostic Information: The db_tier_backup_used_disk_percentage metric (Table 5-32) provides information if the current backup size is >= 70% and < 80% of the total data node disk size.

Recommended Actions:

  1. Verify the number of data nodes and the DataMemory value configured in the ndbmtd configuration and ensure that the PVC configured in data nodes aligns with the cnDBTier dimensions.
    1. Verify the data memory configured in custom_values.yaml.
       ndb:
       ndbdisksize: 60Gi
       ndbbackupdisksize: 100Gi
       datamemory: 12G
    2. Log in to one of the Bastion hosts and get the pvc for data nodes and db replication service:
      kubectl get pvc -n <namespace> | grep ndbmtd
  2. If the PVC is not configured as per cnDBTier dimensions, then increase the database capacity by following the scaling procedures below, and get the PVC resources configuration reviewed by the NF team before performing the scaling:
    1. Horizontal scaling by adding more data nodes.
    2. Vertical scaling by increasing the PVC of data nodes.
  3. The alert will be cleared once the db_tier_backup_used_disk_percentage is less than 70% or more than or equal to 80%.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-19 BACKUP_STORAGE_LOW

Field Details
Description This alert is triggered with major severity when the total backup size of the data node is >= 80% and < 95% of the total data node disk size.
Summary Disk storage on DATA node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition (avg_over_time(db_tier_backup_used_disk_percentage[5m])>=80) and (avg_over_time(db_tier_backup_used_disk_percentage[5m])<95)
Expression Validity N/A
SNMP Trap ID 2015
Affects Service (Y/N) N
Recommended Action

Cause: When the total backup size of the data node is >= 80% and < 95% of the total data node disk size.

Diagnostic Information: The db_tier_backup_used_disk_percentage metric (Table 5-32) provides information if the current backup size is >= 80% and < 95% of the total data node disk size.

Recommended Actions:

  1. Verify the number of data nodes and the DataMemory value configured in the ndbmtd configuration and ensure that the PVC configured in data nodes aligns with the cnDBTier dimensions.
    1. Verify the data memory configured in custom_values.yaml.
       ndb:
       ndbdisksize: 60Gi
       ndbbackupdisksize: 100Gi
       datamemory: 12G
    2. Log in to one of the Bastion hosts and get the pvc for data nodes and db replication service:
      kubectl get pvc -n <namespace> | grep ndbmtd
  2. If the PVC is not configured as per cnDBTier dimensions, then increase the database capacity by following the scaling procedures below, and get the PVC resources configuration reviewed by the NF team before performing the scaling:
    1. Horizontal scaling by adding more data nodes.
    2. Vertical scaling by increasing the PVC of data nodes.
  3. The alert will be cleared once the db_tier_backup_used_disk_percentage is less than 80% or more than or equal to 95%.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-20 BACKUP_STORAGE_FULL

Field Details
Description This alert is triggered with critical severity when the total backup size of the data node is >= 95% of the total data node disk size.
Summary Disk storage on DATA node with node ID {{ $labels.node_id }} is full for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (avg_over_time(db_tier_backup_used_disk_percentage[5m])>=95)
Expression Validity N/A
SNMP Trap ID 2016
Affects Service (Y/N) N
Recommended Action

Cause: When the total backup size of the data node is >= 95% of the total data node disk size.

Diagnostic Information: The db_tier_backup_used_disk_percentage metric (Table 5-32) provides information if the current backup size is >= 95% of the total data node disk size.

Recommended Actions:

  1. Verify the number of data nodes and the DataMemory value configured in the ndbmtd configuration and ensure that the PVC configured in data nodes aligns with the cnDBTier dimensions.
    1. Verify the data memory configured in custom_values.yaml.
       ndb:
       ndbdisksize: 60Gi
       ndbbackupdisksize: 100Gi
       datamemory: 12G
    2. Log in to one of the Bastion hosts and get the pvc for data nodes and db replication service:
      kubectl get pvc -n <namespace> | grep ndbmtd
  2. If the PVC is not configured as per cnDBTier dimensions, then increase the database capacity by following the scaling procedures below, and get the PVC resources configuration reviewed by the NF team before performing the scaling:
    1. Horizontal scaling by adding more data nodes.
    2. Vertical scaling by increasing the PVC of data nodes.
  3. The alert will be cleared once the db_tier_backup_used_disk_percentage is less than 95%.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-21 DB_TIER_NDB_BACKUP_IN_PROGRESS

Field Details
Description This alert is triggered with minor severity when a data node backup is in progress in the current site.
Summary Indicates that a data node backup process is in progress in the current site.
Severity minor
Condition db_tier_ndb_backup_in_progress == 1
Expression Validity N/A
SNMP Trap ID 2037
Affects Service (Y/N) N
Recommended Action

Cause: When a data node backup is in progress in the current site.

Diagnostic Information: The db_tier_ndb_backup_in_progress metric (Table 5-35) provides information if a data node backup is in progress or not. Ensure that you don't make any schema changes until the backup completes.

Recommended Actions:

  1. Wait for the backup to complete. Once completed, the alert should automatically clear.
  2. Check if all data pods pods are up and running by running the following command:
    kubectl get pods -n <namespace> -l dbtierapp=ndbmtd --no-headers
  3. Check the logs of the db-backup-manager-svc service:
    $ kubectl logs -n <cnDBTier Namespace>  <backup manager service pod name>
        -f
  4. Verify and check if any backup is in progress state:
    $ kubectl -n cluster1 exec -it mysql-cluster-db-backup-manager-svc-b49488f8f-lbpbb -- bash
    $ curl -X GET http://mysql-cluster-db-backup-manager-svc:8080/db-tier/backup/status
  5. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.9 cnDBTier Bin Log Usage Alerts

This section provides details about the cnDBTier binlog usage alerts.

Table 6-22 BINLOG_STORAGE_LOW

Field Details
Description This alert is triggered with a minor severity when the total BinLog size of the SQL node is >= 70% and < 80% of the total SQL node disk size.
Summary Disk storage on SQL node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity minor
Condition (avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) >= 70) and (avg_over_time( db_tier_binlog_used_bytes_percentage[5m] ) < 80)
Expression Validity 5m
SNMP Trap ID 2007
Affects Service (Y/N) N
Recommended Action

Cause: When the total BinLog size of the SQL node is >= 70% and < 80% of the total SQL node disk size.

Diagnostic Information: The db_tier_binlog_used_bytes_percentage metric (Table 5-29) provides information if the total BinLog size of the SQL node is >=70% and <80% of total SQL node disk size.

Recommended Actions:
  1. Verify if the PVC configured for the ndbmysqld pods is as per the cnDBTier dimensions. if PVC is not configured as per the cndbtier dimensions, then increase the PVC by following the scaling procedures below, and get the PVC configurations reviewed by the NF team before performing the scaling:

    For more information, see the Scaling cnDBTier Pods section to scale the cnDBTier pod for increasing the CPU allocation.

  2. Monitor BinLog growth closely over the next few hours and check if binlogs are getting purged.
    1. Run the following command inside the ndbmysql pod:
      $ kubectl exec -it -n <cnDBTier Namespace> ndbmysqld-0 -- mysql -h127.0.0.1 -uroot
          -pNextGenCne;
    2. Run the command to monitor Binlog growth:
      $ show binary logs;
  3. Check if the relay logs are increasing continously in the ndbmysqld pod.

    Run the following commands inside ndbmysqld pod and check all the relay logs. It should not be more than 3 or 4

    $ kubectl exec -it -n <cnDBTier Namespace> ndbmysqld-0 -- bash $ cd /var/occnedb/mysld/$ ls -lrth | grep 'relay'
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-23 BINLOG_STORAGE_LOW

Field Details
Description This alert is triggered with major severity when the total BinLog size of the SQL node is >=80% and <95% of the total SQL node disk size.
Summary Disk storage on SQL node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition (avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) >= 80) and (avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) < 95)
Expression Validity 5m
SNMP Trap ID 2036
Affects Service (Y/N) N
Recommended Action

Cause: When the total BinLog size of the SQL node is >=80% and <95% of the total SQL node disk size.

Diagnostic Information: The db_tier_binlog_used_bytes_percentage metric (Table 5-29) provides information if the total BinLog size of the SQL node is >=80% and <95% of total SQL node disk size.

Recommended Actions:
  1. Verify if the PVC configured for the ndbmysqld pods is as per the cnDBTier dimensions. if PVC is not configured as per the cndbtier dimensions, then increase the PVC by following the scaling procedures below, and get the PVC configurations reviewed by the NF team before performing the scaling:

    For more information, see the Scaling cnDBTier Pods section to scale the cnDBTier pod for increasing the CPU allocation.

  2. Monitor BinLog growth closely over the next few hours and check if binlogs are getting purged.
    1. Run the following command inside the ndbmysql pod:
      $ kubectl exec -it -n <cnDBTier Namespace> ndbmysqld-0 -- mysql -h127.0.0.1 -uroot
          -pNextGenCne;
    2. Run the command to monitor Binlog growth:
      $ show binary logs;
  3. Check if the relay logs are increasing continously in the ndbmysqld pod.

    Run the following commands inside ndbmysqld pod and check all the relay logs. It should not be more than 3 or 4:

    $ kubectl exec -it -n <cnDBTier Namespace> ndbmysqld-0 -- bash $ cd /var/occnedb/mysld/$ ls -lrth | grep 'relay'
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-24 BINLOG_STORAGE_FULL

Field Details
Description This alert is triggered with critical severity when the total BinLog size of the SQL node is >= 95% of the total SQL node disk size.
Summary Disk storage on SQL node with node ID {{ $labels.node_id }} is full for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) >= 95
Expression Validity N/A
SNMP Trap ID 2008
Affects Service (Y/N) Y
Recommended Action

Cause: When the total BinLog size of the SQL node is >= 95% of the total SQL node disk size.

Diagnostic Information: The db_tier_binlog_used_bytes_percentage metric (Table 5-29) provides information if the total BinLog size of the SQL node is >= 95% of total SQL node disk size.

Recommended Actions:
  1. Verify if the PVC configured for the ndbmysqld pods is as per the cnDBTier dimensions. if PVC is not configured as per the cndbtier dimensions, then increase the PVC by following the scaling procedures below, and get the PVC configurations reviewed by the NF team before performing the scaling:

    For more information, see the Scaling cnDBTier Pods section to scale the cnDBTier pod for increasing the CPU allocation.

  2. Monitor BinLog growth closely over the next few hours and check if binlogs are getting purged.
    1. Run the following command inside the ndbmysql pod:
      $ kubectl exec -it -n <cnDBTier Namespace> ndbmysqld-0 -- mysql -h127.0.0.1 -uroot
          -pNextGenCne;
    2. Run the command to monitor Binlog growth:
      $ show binary logs;
  3. Check if the relay logs are increasing continously in the ndbmysqld pod.

    Run the following commands inside ndbmysqld pod and check all the relay logs. It should not be more than 3 or 4

    $ kubectl exec -it -n <cnDBTier Namespace> ndbmysqld-0 -- bash $ cd /var/occnedb/mysld/$ ls -lrth | grep 'relay'
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.10 cnDBTier Replication Alerts

This section provides details about cnDBTier replication alerts.

Table 6-25 REPLICATION_CHANNEL_DOWN

Field Details
Description This alert is triggered with major severity when an ACTIVE channel goes to the FAILED state.
Summary Cross-site replication is down on node {{ $labels.node_id }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition (db_tier_replication_status{role="failed"} == 0) or

(db_tier_replication_status{role="active"} == 0)

Expression Validity N/A
SNMP Trap ID 2005
Affects Service (Y/N) N
Recommended Action

Cause: When any ACTIVE channel goes to the FAILED state when the crosssite replication is down on a node.

Diagnostic Information: The following metrics provide information if the replication channel is down:
  • db_tier_replication_status{role="failed"} == 0 (Table 5-30)
  • db_tier_replication_status{role="active"} == 0 (Table 5-30)

Recommended Actions:

  1. Verify the replication status of all the ndbmysqld pods using the 'SHOW REPLICA STATUS' command:
    kubectl exec -it ndbmysqld-0 --namespace=<namesapce>-- mysql -h127.0.0.1 -uroot
          -p<PASSWORD>-e "SHOW REPLICA STATUS\G"
  2. Check the pod status, if the pod is not coming up then analyze the previous container logs of the pod to see the error information.
    $ kubectl -n <namespace> logs <ndbmtd/ndbmysqld/ndbappmysqld podname> --previous
  3. Check the channel details for all cnDBTier sites.

    Identify the exception tables using the following query:

    $ kubectl -n <namespace> exec -it ndbappmysqld-0 -- mysql -h 127.0.0.1 -uroot -p<PASSWORD>
    mysql> SELECT remote_site_name,channel_id,role,replchannel_group_id FROM replication_info.DBTIER_REPLICATION_CHANNEL_INFO;
    
  4. Check the management pod logs also for checking if there are any frequent heartbeat missed warnings or other alerts are logged for the pod.
    $ kubectl -n <namespace> exec -it ndbmgmd-0 -- bash
    $ tail -f /var/occnedb/mysqlndbcluster/ndbmgmd_cluster.log
  5. Validate the network connectivity between the cluster nodes and make sure that pods and nodes are reachable.
  6. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-26 REPLICATION_FAILED

Field Details
Description This alert is triggered with critical severity when all the channels are in the STANDBY or FAILED state.
Summary Cross-site replication is down for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (count by (site_name, namespace, replchannel_group_id) (db_tier_replication_status) == count by (site_name, namespace, replchannel_group_id) (db_tier_replication_status{role="standby"})) or (count by (namespace, replchannel_group_id, site_name) (db_tier_replication_status) == count by (namespace, replchannel_group_id, site_name) (db_tier_replication_status{role="failed"}))
Expression Validity N/A
SNMP Trap ID 2006
Affects Service (Y/N) Y
Recommended Action

Cause: When all the channels are in the STANDBY or FAILED state as the cross-site replication is down for the cnDBTier site.

Diagnostic Information: The db_tier_replication_status metric (Table 5-30) provides information if the replication failed.

Recommended Actions:

  1. Verify the replication status of all the ndbmysqld pods using the 'SHOW REPLICA STATUS' command:
    kubectl exec -it ndbmysqld-0 --namespace=<namesapce>-- mysql -h127.0.0.1 -uroot
          -p<PASSWORD>-e "SHOW REPLICA STATUS\G"
  2. Check the pod status, if the pod is not coming up then analyze the previous container logs of the pod to see the error information.
    $ kubectl -n <namespace> logs <ndbmtd/ndbmysqld/ndbappmysqld podname> --previous
  3. Check the channel details for all cnDBTier sites.

    Retrieve the replication channels configurations

    $ kubectl -n <namespace> exec -it ndbappmysqld-0 -- mysql -h 127.0.0.1 -uroot -p<PASSWORD>
    mysql> SELECT remote_site_name,channel_id,role,replchannel_group_id FROM replication_info.DBTIER_REPLICATION_CHANNEL_INFO;
    
  4. Check the management pod logs also for checking if there are any frequent heartbeat missed warnings or other alerts are logged for the pod.
    $ kubectl -n <namespace> exec -it ndbmgmd-0 -- bash
    $ tail -f /var/occnedb/mysqlndbcluster/ndbmgmd_cluster.log
  5. Validate the network connectivity between the cluster nodes and make sure that pods and nodes are reachable.
  6. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-27 REPLICA_REPLICATION_DELAY_HIGH

Field Details
Description This alert is triggered when the last record read by the replica is more than five minutes behind the latest record written by the source.
Summary Replica replication on SQL node at {{ $labels.replica_node_ip }} is {{ $value }} seconds behind the source for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition avg(avg_over_time(db_tier_replication_replica_delay[5m])) by (source_node_ip,replica_node_ip) >= 300 and avg(avg_over_time(db_tier_replication_replica_delay[5m])) by (source_node_ip,replica_node_ip) < 48*3600
Expression Validity 1m
SNMP Trap ID 2009
Affects Service (Y/N) N
Recommended Action

Cause: When the last record read by the worker node is more than 5 minutes and less than 48 hours behind the latest record written by the controller.

Diagnostic Information: The db_tier_replication_replica_delay metric (Table 5-31) provides information if there is a delay in the worker node replication.

Recommended Actions:

  1. Verify the replication status of all the ndbmysqld pods using the 'SHOW REPLICA STATUS' command:
    kubectl exec -it ndbmysqld-0 --namespace=<namesapce>-- mysql -h127.0.0.1 -uroot
          -p<PASSWORD>-e "SHOW REPLICA STATUS\G"
  2. Check the management pod logs also for checking if there are any frequent heartbeat missed warnings or other alerts are logged for the pod.
    $ kubectl -n <namespace> exec -it ndbmgmd-0 -- bash
    $ tail -f /var/occnedb/mysqlndbcluster/ndbmgmd_cluster.log
  3. Validate the network connectivity between the cluster nodes and make sure that pods and nodes are reachable.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-28 REPLICA_REPLICATION_FAILED

Field Details
Description This alert is triggered when the last record read by the replica is more than 48 hours behind the latest record written by the source.
Summary Replica replication has fallen more than 48 hours behind the source. Manual restore from backup may be required for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition avg(avg_over_time(db_tier_replication_replica_delay[5m])) by (source_node_ip,replica_node_ip) >= 48*3600
Expression Validity 1m
SNMP Trap ID 2010
Affects Service (Y/N) Y
Recommended Action

Cause: When the last record read by the worker node is more than 48 hours behind the latest record written by the controller.

Diagnostic Information: The db_tier_replication_replica_delay metric (Table 5-31) provides information if the worker node replication failed.

Recommended Actions:

  1. Verify the replication status of all the ndbmysqld pods using the 'SHOW REPLICA STATUS' command:
    kubectl exec -it ndbmysqld-0 --namespace=<namesapce>-- mysql -h127.0.0.1 -uroot
          -p<PASSWORD>-e "SHOW REPLICA STATUS\G"
  2. Check the channel details for all cnDBTier sites.

    Retrieve the replication channels configurations

    $ kubectl -n <namespace> exec -it ndbappmysqld-0 -- mysql -h 127.0.0.1 -uroot -p<PASSWORD>
    mysql> SELECT remote_site_name,channel_id,role,replchannel_group_id FROM replication_info.DBTIER_REPLICATION_CHANNEL_INFO;
    
  3. Check the management pod logs also for checking if there are any frequent heartbeat missed warnings or other alerts are logged for the pod.
    $ kubectl -n <namespace> exec -it ndbmgmd-0 -- bash
    $ tail -f /var/occnedb/mysqlndbcluster/ndbmgmd_cluster.log
  4. Check the pod status, if the pod is not coming up then analyze the previous container logs of the pod to see the error information.
    $ kubectl -n <namespace> logs <ndbmtd/ndbmysqld/ndbappmysqld podname> --previous
  5. Validate the network connectivity between the cluster nodes and make sure that pods and nodes are reachable.
  6. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-29 GEOREPLICATION_RECOVERY_FAILED

Field Details
Description This alert is triggered with critical severity when georeplication recovery fails on a unhealthy site where georeplication recovery was started.
Summary Georeplication recovery has failed on cnDBTier Site {{ $labels.site_name }} from kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition db_tier_georeplication_recovery_state == 2
Expression Validity NA
SNMP Trap ID 2033
Affects Service (Y/N) Y
Recommended Action

Cause: Incorrect disk size, incorrect SSH key configurations, or other similar reasons.

Diagnostic Information: This alert indicates that georeplication recovery failed on a unhealthy site and replication couldn't be reestablished using the georeplication recovery procedure. This alert requires immediate attention.

Recommended Actions:
  1. Verify the replication status of all the ndbmysqld pods using the 'SHOW REPLICA STATUS' command:
    kubectl exec -it ndbmysqld-0 --namespace=<namesapce>-- mysql -h127.0.0.1 -uroot
          -p<PASSWORD>-e "SHOW REPLICA STATUS\G"
  2. Check the pod status, if the pod is not coming up then analyze the previous container logs of the pod to see the error information.
    $ kubectl -n <namespace> logs <ndbmtd/ndbmysqld/ndbappmysqld podname> --previous
  3. Check the channel details for all cnDBTier sites.

    Retrieve the replication channels configurations:

    $ kubectl -n <namespace> exec -it ndbappmysqld-0 -- mysql -h 127.0.0.1 -uroot -p<PASSWORD>
    mysql> SELECT remote_site_name,channel_id,role,replchannel_group_id FROM replication_info.DBTIER_REPLICATION_CHANNEL_INFO;
    
  4. Check the management pod logs also for checking if there are any frequent heartbeat missed warnings or other alerts are logged for the pod.
    $ kubectl -n <namespace> exec -it ndbmgmd-0 -- bash
    $ tail -f /var/occnedb/mysqlndbcluster/ndbmgmd_cluster.log
  5. Monitor the GRR state during the execution of GEOREPLICATION_RECOVERY and identify the specific state after which GRR is failing.
  6. Check if backups can be successfully transferred from a healthy site to the recovering site.

    Log in to one of the leader replication svc pod of healthy site:

    $ kubectl -n <namespace> exec -it <leader-replication-pod> -- bash
    ssh -i /home/mysql/.ssh/id_rsa mysql@<db-replication-service-svc> -p 2022    
    $ sftp -i /home/mysql/.ssh/id_rsa -P 2022 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null mysql@<leader_db_Replication_service_svc>:/var/occnedb/ <<< $'put -r <file/dir to be copied>',
  7. Validate the network connectivity between the cluster nodes and make sure that pods and nodes are reachable.
  8. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.11 cnDBTier Memory Usage Alerts

This section provides details about the cnDBTier memory usage alerts.

Table 6-30 LOW_MEMORY

Field Details
Description This alert is triggered when the RAM usage of any node is greater than or equal to 80%.
Summary Node ID {{ $labels.node_id }}, memory utilization at {{ $value }} percent for memory type {{ $labels.memory_type }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition ((avg_over_time(db_tier_memory_used_bytes{memory_type="Data memory"}[1m]) / avg_over_time(db_tier_memory_total_bytes{memory_type="Data memory"}[1m])) * 100) >= 80
Expression Validity 1m
SNMP Trap ID 2003
Affects Service (Y/N) N
Recommended Action

Cause: When the RAM or memory usage of any node reaches the major level of threshold value.

Diagnostic Information: Check if the memory usage of the following metrics are too high:

Recommended Actions:

  1. Check the memory (Data and Index) usage of the data nodes.
    $ kubectl -n <NAMESPACE> exec -it ndbmgmd-0 -- ndb_mgm -e "ALL REPORT memoryusage"
  2. Check if the exceptions in the exception tables are continuously increasing, and if the record count is high, involve the NF team to clean up the entries and investigate the root cause.
    1. Identifying the exception tables using the following query:
      $ kubectl -n <namespace> exec -it ndbappmysqld-0 -- mysql -h 127.0.0.1 -uroot -p<PASSWORD>mysql> select TABLE_SCHEMA, TABLE_NAME from information_schema.TABLES where TABLE_NAME
              = '%$EX';
    2. Get the count in the exception table from the exception table lists:
      mysql> select count(*) AS COUNT from <ABOVE_TABLE_SCHEMA>.<ABOVE_TABLE_NAME>;
    3. If the number of subscribers continues to increase, scale the database capacity by following the procedures below. Before proceeding with the scaling, ensure that the memory resources and DataMemory configuration are reviewed and approved by the NF team:
      1. Horizontal scaling by adding more data nodes.
      2. Vertical scaling by increasing the memory allocated to ndbmtd pods and updating the DataMemory setting accordingly.
  3. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-31 OUT_OF_MEMORY

Field Details
Description This alert is triggered with critical severity when the RAM usage of any node is greater than or equal to 90%.
Summary Node ID {{ $labels.node_id }} out of memory for memory type {{ $labels.memory_type }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition ((avg_over_time(db_tier_memory_used_bytes{memory_type="Data memory"}[1m]) / avg_over_time(db_tier_memory_total_bytes{memory_type="Data memory"}[1m])) * 100) >= 90
Expression Validity 1m
SNMP Trap ID 2004
Affects Service (Y/N) Y
Recommended Action

Cause: When the RAM or memory usage of any node reaches the critical level of threshold value.

Diagnostic Information: Check if the memory usage of the following metrics are too high:

Recommended Actions:

  1. Check the memory (Data and Index) usage of the data nodes.
    $ kubectl -n <NAMESPACE> exec -it ndbmgmd-0 -- ndb_mgm -e "ALL REPORT memoryusage"
  2. Check if the exceptions in the exception tables are continuously increasing, and if the record count is high, involve the NF team to clean up the entries and investigate the root cause.
    1. Identifying the exception tables using the following query:
      $ kubectl -n <namespace> exec -it ndbappmysqld-0 -- mysql -h 127.0.0.1 -uroot -p<PASSWORD>mysql> select TABLE_SCHEMA, TABLE_NAME from information_schema.TABLES where TABLE_NAME
              = '%$EX';
    2. Get the count in the exception table from the exception table lists:
      mysql> select count(*) AS COUNT from <ABOVE_TABLE_SCHEMA>.<ABOVE_TABLE_NAME>;
    3. If the number of subscribers continues to increase, scale the database capacity by following the procedures below. Before proceeding with the scaling, ensure that the memory resources and DataMemory configuration are reviewed and approved by the NF team:
      1. Horizontal scaling by adding more data nodes.
      2. Vertical scaling by increasing the memory allocated to ndbmtd pods and updating the DataMemory setting accordingly.
  3. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.12 cnDBTier CPU Usage Alerts

This section provides details about cnDBTier CPU usage alerts.

Table 6-32 HIGH_CPU

Field Details
Description This alert is triggered with major severity when the CPU usage of any data node is greater than or equal to 80%, and less than 90%.
Summary Node ID {{ $labels.node_id }} CPU utilization at {{ $value }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition ((100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m])) by (node_id))) >= 80) and ((100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m])) by (node_id))) < 90)
Expression Validity 1m
SNMP Trap ID 2002
Affects Service (Y/N) N
Recommended Action

Cause: When the CPU utilization of any data node is greater than or equal to 80%, and less than 90%.

Diagnostic Information: Check the CPU threshold level status from the cnDBTier worker pod logs.

Recommended Actions:

  1. Monitor the CPU utilization of the pods.
  2. If CPU is not configured as per the cnDBTier dimensions, increase the CPU allocation by following the scaling procedures below. Ensure that the updated CPU configuration is reviewed and approved by the NF team before proceeding with the scaling operation. For more information, see the Scaling ndbmtd Pods section to scale cndbtier pod for increasing the CPU allocation.
  3. Check if the traffic is exceeding the configured cnDBTier capacity as per the cnDBTier dimensions.
  4. Check the logs of the specific node where CPU utilization is high.
    kubectl -n <namespace> logs <ndbmtd/ndbmysqld/ndbappmysqld podname>
  5. Verify if any worker nodes, network or infrastructure-related events occured that could be causing the CPU starvation for the pods.
  6. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-33 HIGH_CPU

Field Details
Description This alert is triggered with critical severity when the CPU usage of any data node is greater than or equal to 90%.
Summary Node ID {{ $labels.node_id }} CPU utilization at {{ $value }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m]))BY (node_id)))>= 90
Expression Validity 1m
SNMP Trap ID 2035
Affects Service (Y/N) N
Recommended Action

Cause: When the CPU utilization of any data node is greater than or equal to 90%.

Diagnostic Information: Check the CPU threshold level status from the cnDBTier worker pod logs.

Recommended Actions:

  1. Monitor the CPU utilization of the pods.
  2. If CPU is not configured as per the cnDBTier dimensions, increase the CPU allocation by following the scaling procedures below. Ensure that the updated CPU configuration is reviewed and approved by the NF team before proceeding with the scaling operation. For more information, see the Scaling ndbmtd Pods section to scale cndbtier pod for increasing the CPU allocation.
  3. Check if the traffic is exceeding the configured cnDBTier capacity as per the cnDBTier dimensions.
  4. Check the logs of the specific node where CPU utilization is high.
    kubectl -n <namespace> logs <ndbmtd/ndbmysqld/ndbappmysqld podname>
  5. Verify if any worker nodes, network or infrastructure-related events occured that could be causing the CPU starvation for the pods.
  6. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.13 cnDBTier Node Status Alerts

The section provides details about cnDBTier node status alerts.

Table 6-34 NODE_DOWN

Field Details
Description This alert is raised with critical severity when the data node is down. db_tier_node_status value:
  • 0: indicates that a node is DOWN
  • 1: indicates that the node is UP
Summary MySQL {{ $labels.node_type }} node having node id {{ $labels.node_id }} is down for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition db_tier_node_status == 0
Expression Validity N/A
SNMP Trap ID 2001
Affects Service (Y/N) Y
Recommended Action
Cause:
  • When pod restarts due to Kubernetes liveliness or readiness probe failures
  • When cnDBTier application restarts or fails to start
Diagnostic Information:
  • Run the following command to check the status of cnDBTier namespace:
    kubectl -n <namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
  • Check the Kubernetes events for probe failures in the platform logs.
  • Check if any exception is reported in the cnDBTier application logs.

Recommended Actions:

  1. Refer to the application logs on Kibana and filter based on pod name.
  2. Check the pod status, if the pod is not coming up then analyze the previous container logs of the pod to see the error information.
    $ kubectl -n <namespace> logs <ndbmtd/ndbmysqld/ndbappmysqld podname> --previous
  3. Check the management pod logs also for checking if there are any frequent heartbeat missed warnings are logged for the pod.
    $ kubectl -n <namespace> exec -it ndbmgmd-0 -- bash
    $ tail -f /var/occnedb/mysqlndbcluster/ndbmgmd_cluster.log
  4. Check if the pod is able to connect to the management pods from pod to other pods.
  5. If the pod alone is down and all the other pods are up and running then delete the pod and its pvc, so that pod can restart by re-initializing.
    $ kubectl -n <namespace> exec -it <ndbmtd/ndbmysqld/ndbappmysqld podname> -c db-infra-monitor-svc -- bash
    $ ndb_mgm -c ndbmgmd-0.ndbmgmdsvc:1186 -e show
    Follow the Disaster Recovery guide for restoring the single node.
  6. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.14 cnDBTier Node Data Volume Alerts

This section provides details about cnDBTier node data volume alerts.

Table 6-35 DB_TIER_API_SEND_NODE_DATA_VOLUME_LOW

Field Details
Description This alert is triggered when any NDB application node sends less data to NDB when compared to the other NDB application nodes.
Summary Send Node Data Volume Low for API Node ID {{ $labels.remote_node_id }} at kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (((sum by (remote_node_id,namespace) (avg_over_time(rate(db_tier_node_transporter_bytes_received{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m])))/scalar(sum (avg_over_time(rate(db_tier_node_transporter_bytes_received{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m]))))*100) < (100/(scalar(count(count by (remote_node_id) (db_tier_node_transporter_bytes_received{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"})))*1.6))
Expression Validity NA
SNMP Trap ID 3001
Affects Service (Y/N) Y
Recommended Action

Cause: When NDB application node sends less data to NDB when compared to other NDB application nodes.

Diagnostic Information: The alert indicates that the NDB application node is slow, therefore check the underlying infrastructure.

Recommended Actions:

  1. Identify the worker node.
    $ kubectl get nodes -o wide
  2. Check if an SSH connection can be successfully established to the worker node.
  3. Capture a PCAP file and analyze it to identify any network disconnects or transmission activity.
  4. Check the file system and underlying hardware for any issues that might be causing slowness.
  5. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-36 DB_TIER_API_RECEIVE_NODE_DATA_VOLUME_LOW

Field Details
Description This alert is triggered when any NDB sends less data to any specific NDB application node when compared to the other NDB application nodes.
Summary Receive Node Data Volume Low for API Node ID {{ $labels.remote_node_id }} at kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (((sum by (remote_node_id,namespace) (avg_over_time(rate(db_tier_node_transporter_bytes_sent{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m])))/scalar(sum (avg_over_time(rate(db_tier_node_transporter_bytes_sent{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m]))))*100) < (100/(scalar(count(count by (remote_node_id) (db_tier_node_transporter_bytes_sent{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"})))*1.6))
Expression Validity NA
SNMP Trap ID 3002
Affects Service (Y/N) Y
Recommended Action

Cause: When NDB application node sends less data to any specific NDB application node when compared to other NDB application nodes due to the communication not happening intermittently or fully or the underlying platform is slow.

Diagnostic Information: The alert indicates that the NDB application node is slow, or underlying infrastructure is slow or communication did not happen intermittently or fully, therefore check the underlying infrastructure.

Recommended Actions:

  1. Identify the worker node.
    $ kubectl get nodes -o wide
  2. Check if an SSH connection can be successfully established to the worker node.
  3. Capture a PCAP file and analyze it to identify any network disconnects or transmission activity.
  4. Check the file system and underlying hardware for any issues that might be causing the slowness.
  5. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-37 DB_TIER_SEND_DATA_NODE_DATA_VOLUME_LOW

Field Details
Description This alert is triggered when any NDB doesn't send the traffic data in the required speed or when the speed is slower when compared to another data node.
Summary Send Data Node Data Volume Low for DATA Node ID {{ $labels.node_id }} at kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition ((sum by (node_id,namespace) (avg_over_time(rate(db_tier_node_transporter_bytes_sent{namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m]))/scalar(sum (avg_over_time(rate(db_tier_node_transporter_bytes_sent{namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m])))) * 100) < (100/(scalar(count(count by (node_id) (db_tier_node_transporter_bytes_sent{namespace="<${CNDBTIER_NAMESPACE}>"})))*1.6))
Expression Validity NA
SNMP Trap ID 3003
Affects Service (Y/N) Y
Recommended Action

Cause: When any NDB doesn't send the traffic data in the required speed or when the speed is slower when compared to another data node.

Diagnostic Information: The alert indicates that the NDB application node is slow, therefore check the underlying infrastructure.

Recommended Actions:

  1. Identify the worker node.
    $ kubectl get nodes -o wide
  2. Check if an SSH connection can be successfully established to the worker node.
  3. Capture a PCAP file and analyze it to identify any network disconnects or transmission activity.
  4. Check the file system and underlying hardware for any issues that might be causing the slowness.
  5. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-38 DB_TIER_RECEIVE_DATA_NODE_DATA_VOLUME_LOW

Field Details
Description This alert is triggered when any NDB doesn't receive the traffic data in the required speed or when the speed is slower when compared to another data node.
Summary Receive Data Node Data Volume Low for DATA Node ID {{ $labels.node_id }} at kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition ((sum by (node_id,namespace) (avg_over_time(rate(db_tier_node_transporter_bytes_received{namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m]))/scalar(sum (avg_over_time(rate(db_tier_node_transporter_bytes_received{namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m])))) * 100)< (100/(scalar(count(count by (node_id) (db_tier_node_transporter_bytes_received{namespace="<${CNDBTIER_NAMESPACE}>"})))*1.6))
Expression Validity NA
SNMP Trap ID 3004
Affects Service (Y/N) Y
Recommended Action

Cause: When any NDB doesn't receive the traffic data in the required speed or when the speed is slower when compared to another data node.

Diagnostic Information: The alert indicates that the NDB application node is slow, therefore check the underlying infrastructure.

Recommended Actions:
  1. Identify the worker node.
    $ kubectl get nodes -o wide
  2. Check if an SSH connection can be successfully established to the worker node.
  3. Capture a PCAP file and analyze it to identify any network disconnects or transmission activity.
  4. Check the file system and underlying hardware for any issues that might be causing slow response.
  5. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

For any assistance, contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-39 DB_TIER_DATA_NODE_SCAN_FRAGMENT_SLOW

Field Details
Description This alert is triggered when any data node scan fragment is slow when compared with other data nodes.
Summary Scan Fragment is Slow for DATA Node ID {{ $labels.node_id }} at kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (((sum by (node_id,comm_node_id,namespace) (rate(db_tier_tc_time_track_stats_total_scan_fragments_time{path_type="INTERNAL",namespace="<${CNDBTIER_NAMESPACE}>"}[15m]))/sum by (node_id,comm_node_id,namespace) (rate(db_tier_tc_time_track_stats_total_scan_fragments_count{path_type="INTERNAL",namespace="<${CNDBTIER_NAMESPACE}>"}[15m])))/ (scalar(sum(sum by (node_id,comm_node_id,namespace) (rate(db_tier_tc_time_track_stats_total_scan_fragments_time{path_type="INTERNAL",namespace="<${CNDBTIER_NAMESPACE}>"}[15m]))/sum by (node_id,comm_node_id,namespace) (rate(db_tier_tc_time_track_stats_total_scan_fragments_count{path_type="INTERNAL",namespace="<${CNDBTIER_NAMESPACE}>"}[15m]))))))*100) > ((100/scalar(count(sum by (node_id,comm_node_id,namespace)(db_tier_tc_time_track_stats_total_scan_fragments_time{path_type="INTERNAL",namespace="<${CNDBTIER_NAMESPACE}>"}))))*1.6)
Expression Validity NA
SNMP Trap ID 3005
Affects Service (Y/N) Y
Recommended Action

Cause: When the scan fragment for any particular data node is slow.

Diagnostic Information: The alert indicates that the NDB application node is slow, therefore check the underlying infrastructure.

Recommended Actions:

  1. Identify the worker node.
    $ kubectl get nodes -o wide
  2. Check if an SSH connection can be successfully established to the worker node.
  3. Capture a PCAP file and analyze it to identify any network disconnects or transmission activity.
  4. Check the file system and underlying hardware for any issues that might be causing slow response.
  5. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.15 cnDBTier Certificate Expiry Alerts

This section provides details about cnDBTier certificate expiry alerts.

Table 6-40 DBTIER_CERTIFICATE_EXPIRY_INFO

Field Details
Description This alert is triggered with info severity whenever the certificate for a cnDBTier is set to expire within the next 90 days.
Summary dbtier Certificate {{ $labels.certType }}for {{ $labels.hostname }} is expiring with in 90 days for cnDBTier site {{ $labels.site_name }}and kubernetes namespace {{ $labels.namespace }}
Severity info
Condition (db_tier_cert_expiry / 1000 - time()) > 2592000 and (db_tier_cert_expiry / 1000 - time()) <= 7776000
Expression Validity NA
OID 1.3.6.1.4.1.323.5.3.50.1.2.2045
Metric Used db_tier_cert_expiry
Affects Service (Y/N) N
Recommended Action

Cause: This alert is triggered when any cnDBTier certificate is going to expire in next 90 days.

Diagnostic Information: This alert is triggered with info severity whenever the certificate for a cnDBTier is set to expire within the next 90 days.

Recommended Actions:

  1. Update the cnDBTier certificates by following the steps provided in the "Update Certificate" section in Oracle Communications Cloud Native Core, cnDBTier User Guide.
  2. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note:

    Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.
Available in OCI Yes

Table 6-41 DBTIER_CERTIFICATE_EXPIRY_MAJOR

Field Details
Description This alert is triggered with major severity whenever the certificate for a cnDBTier is set to expire within the next 30 days.
Summary dbtier Certificate {{ $labels.certType }}for {{ $labels.hostname }} is expiring with in 30 days for cnDBTier site {{ $labels.site_name }}and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition (db_tier_cert_expiry / 1000 - time()) > 604800 and (db_tier_cert_expiry / 1000 - time()) <= 2592000
OID 1.3.6.1.4.1.323.5.3.50.1.2.2040
Metric Used db_tier_cert_expiry
Expression Validity NA
Affects Service (Y/N) N
Recommended Action

Cause: This alert is triggered when any cnDBTier certificate is going to expire in next 30 days.

Diagnostic Information: This alert is triggered with info severity whenever the certificate for a cnDBTier is set to expire within the next 30 days.

Recommended actions:

  1. Update the cnDBTier certificates by following the steps provided in the "Update Certificate" section in Oracle Communications Cloud Native Core, cnDBTier User Guide.

    Depending on the certificate type in alerts, follow the below procedures appropriately:

  2. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note:

    Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.
Available in OCI Yes

Table 6-42 DBTIER_CERTIFICATE_EXPIRY_CRITICAL

Field Details
Description This alert is triggered with critical severity whenever the certificate for a cnDBTier is set to expire within the next 7 days.
Summary dbtier Certificate {{ $labels.certType }}for {{ $labels.hostname }} is expiring with in 7 days for cnDBTier site {{ $labels.site_name }}and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (db_tier_cert_expiry / 1000 - time()) > 0 and (db_tier_cert_expiry / 1000 - time()) <= 604800
OID 1.3.6.1.4.1.323.5.3.50.1.2.2041
Metric Used db_tier_cert_expiry
Expression Validity NA
Affects Service (Y/N) Y
Recommended Action

Cause: This alert is triggered when any cnDBTier certificate is going to expire in next 7 days.

Diagnostic Information: This alert is triggered with critical severity whenever the certificate for a cnDBTier is set to expire within the next 7 days.

Recommended actions:

  1. Update the cnDBTier certificates by following the steps provided in the "Update Certificate" section in Oracle Communications Cloud Native Core, cnDBTier User Guide.

    Depending on the certificate type in alerts, follow the below procedures appropriately:

  2. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note:

    Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.
Available in OCI Yes

Table 6-43 DBTIER_CERTIFICATE_EXPIRED

Field Details
Description This alert is triggered with critical severity when any cnDBTier certificate has expired.
Summary dbtier Certificate {{ $labels.certType }}for {{ $labels.hostname }} is expiredfor cnDBTier site {{ $labels.site_name }}and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (db_tier_cert_expiry / 1000 - time()) <= 0
OID 1.3.6.1.4.1.323.5.3.50.1.2.2041
Metric Used db_tier_cert_expiry
Expression Validity NA
Affects Service (Y/N) Y
Recommended Action

Cause: This alert is triggered when any cnDBTier certificate is going to expire in next 7 days.

Diagnostic Information: This alert is triggered with critical severity whenever the certificate for a cnDBTier is set to expire within the next 7 days.

Recommended actions:

  1. Update the cnDBTier certificates by following the steps provided in the "Update Certificate" section in Oracle Communications Cloud Native Core, cnDBTier User Guide.

    Depending on the certificate type in alerts, follow the below procedures appropriately:

  2. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note:

    Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.
Available in OCI Yes

6.16 cnDBTier PVC Health Alerts

This section provides details about cnDBTier PVC health related alerts.

Table 6-44 PVC_NOT_ACCESSIBLE

Field Details
Description This alert is triggered with critical severity when db_tier_pvc_is_accesible condition is zero.
  • If the value of db_tier_pvc_is_accesible is 0, indicates that PVC is not accessible.
  • If the value of db_tier_pvc_is_accesible is 1, indicates that PVC is accessible.
Summary PVC is not accessible on cnDBTier site {{ $labels.site_name }
Severity critical
Condition db_tier_pvc_is_accesible == 0
OID 1.3.6.1.4.1.323.5.3.50.1.2.2029
Metric Used db_tier_pvc_is_accesible
Expression Validity 1m
Affects Service (Y/N) Y
Recommended Action

Cause: When PVC is not accessible for read or write operation.

Diagnostic Information: The db_tier_pvc_is_accesible metric provides information about the PVC is accessible or not.

Recommended Actions:
  1. Verify the Cluster and Pod status.

    Run the following command to check the cluster status:
    kubectl -n <cnDBTier Namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
    Run the following command to check the pod status:
    kubectl get pod -n <cnDBTier Namespace>
  2. Retrieve the pod name whose PVC is not accessible using db_tier_pvc_is_accessible metric. It is the hostname attribute.
  3. After retrieving the the name of the pod, get the PVC associated with it and describe the pod as well as PVC. Look for PVC bound status, any mount errors, events indicating mount failure or volume timeout.

    To describe the pod, run the following command:

    kubectl -n <cnDBTier Namespace> describe pod <pod name>
    To retrieve the PVC associated with the Pod, run the following command:
    kubectl -n <cnDBTier Namespace> get pvc

    To describe the PVC, run the following command:

    kubectl -n <cnDBTier Namespace> describe pvc <pvc name>
  4. Check PVC Mounting Inside the pod. Get mount_path from the db_tier_pvc_is_accesible. If mount_path is missing or empty. It confirms the PVC isn't mounted properly.

    Run the following commands to check the mount_path by logging in to the pod:

    kubectl -n <cnDBTier Namespace> exec -it <pod name> -- bash
    ls -l /var/occnedb
    df -h | grep occnedb
  5. Restart the pod. Sometimes a simple restart remounts the PVC correctly.
    Restart the podkubectl -n <cnDBTier Namespace> delete pod <pod name>
  6. Check the logs of the pod.

    Run the following command to check the logs of the main container:

    kubectl -n <cnDBTier Namespace> logs <pod name> -c <main container name>

    Run the following command to check logs of the infra monitor svc container:

    kubectl -n <cnDBTier Namespace> logs <pod name> -c <db-infra-monitor-svc
                                  container name>
  7. This alert will be cleared automatically when the PVC metric db_tier_pvc_failure_count become zero. In case if the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-45 PVC_STORAGE_FULL

Field Details
Description The PVC_STORAGE_FULL alert is triggered with critical severity when a pod's PVC reaches full capacity.
Summary PVC is not accessible on cnDBTier site {{ $labels.site_name }
Severity critical
Condition db_tier_pvc_is_accesible == 0
OID 1.3.6.1.4.1.323.5.3.50.1.2.2029
Metric Used db_tier_pvc_is_accesible
Expression Validity 1m
Affects Service (Y/N) Y
Recommended Action

Cause: This alert is triggered when the PVC reaches full capacity, preventing further write operations..

Diagnostic Information: The system detects that the PVC has no available space, leading to storage-related failures.

Recommended steps:
  1. Verify the Cluster and Pod status.

    Run the following command to check the cluster status:
    kubectl -n <cnDBTier Namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
    Run the following command to check the pod status:
    kubectl get pod -n <cnDBTier Namespace>
  2. From db_tier_volume_stats_used_bytes get the name of the pod whose pvc storage is full. It is there as the hostname attribute.
  3. Verify that the cnDBTier pods are configured with the resources (CPUs, Memory and PVC size) as per cnDBTier dimensions.
  4. If the PVC is not configured as per cnDBTier dimensions, then increase the database capacity by following the scaling procedures below before performing the scaling:
  5. Ensure that the application managing the PVC properly handles storage utilization. This alert will be cleared automatically when sufficient space becomes available. In case if the issue persists, capture all the outputs for the above steps and and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.17 cnDBTier Backup Manager Svc Down Alerts

This section provides details about cnDBTier backup manager Svc down alerts.

Table 6-46 DB_BACKUP_MANAGER_SVC_DOWN

Field Details
Description This alert is triggered with critical severity when db_backup_manager_svc pod is down.
Summary PVC is not accessible on cnDBTier site {{ $labels.site_name }
Severity critical
Condition kube_deployment_status_replicas_available{deployment=~".*db-backup-manager-svc.*"} == 0
OID 1.3.6.1.4.1.323.5.3.50.1.2.2039
Metric Used
kube_deployment_status_replicas_available{deployment=~".*db-backup-manager-svc.*"}
Expression Validity 1m
Affects Service (Y/N) Y
Recommended Action

Cause: When db_backup_manager_svc pod is down.

Diagnostic Information: The system detects that the db_backup_manager_svc pod is not up and unable to connect to database.

Recommended Actions:
  1. Check the db-backup-manager service pod status. Look for CrashLoopBackOff, ImagePullBackOff, OOMKilled, Init container failures.

    Run the following command to check the pod status:
    kubectl -n <cnDBTier namespace> get pods | grep "db-backup-manager-svc"
    Run the following command to describe the pod:
    kubectl -n <cnDBTier namespace> describe pod <backup-manager-pod name>
  2. Check the deployment status.

    Look for Available replicas: 0, Events section for scheduling or image pull errors.
    Run the following command to get the db-backup-manager-svc deployment:
    kubectl -n <cnDBTier namespace> get deployment | grep "db-backup-manager-svc"
    Run the following command to describe the deployment:
    kubectl -n <cnDBTier namespace> describe deployment <backup-manger-svc
        deployment>
  3. Check the DB Backup Manager service pod logs to see if there are a lot of database connectivity issues.
    Run the following command to get the backup-manager-svc pod:
    kubectl -n <cnDBTier namespace> get pods | grep "db-backup-manager-svc"

    Run the following command to check the logs of the backup-manger svc pod:

    kubectl -n <cnDBTier namespace> logs < db-backup-manager-svc pod name>
  4. In case if the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.18 cnDBTier Forced Switchover Disabled Alerts

This section provides details about cnDBTier forced switchover disabled alerts.

Table 6-47 DB_TIER_FORCED_SWITCHOVER_DISABLED

Field Details
Description This alert is triggered with critical severity when switchover is disabled forcefully.
Summary dbtier switchover is disabled forcefully for cnDBTier {{ $labels.site_name }}and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition kube_deployment_status_replicas_available{deployment=~".*db-backup-manager-svc.*"} == 0
OID 1.3.6.1.4.1.323.5.3.50.1.2.2039
Metric Used
kube_deployment_status_replicas_available{deployment=~".*db-backup-manager-svc.*"}
Expression Validity 1m
Affects Service (Y/N) Y
Recommended Action

Cause: When switchover is disabled forcefully.

Diagnostic Information: The alert informs the operator that switchover is currently disabled and needs to be updated.

Recommended Actions:
  1. Check the DBTIER_REPL_SITE_INFO table for the stop_repl_switchover value. If the value of stop_repl_switchover for the current site is 1, it means switchover is disabled forcefully. Get the site_name, mate_site_name and replchannel_group_id from the db_tier_stop_repl_switchover metric's attribute.

    Run the following command to check the replication_info.DBTIER_REPL_SITE_INFO table
    kubectl -n <cnDBTier namespace> ndbmysqld-0 -- mysql -h127.0.0.1 -uroot -p<root user
          password>;
    Run the following query to get the value of the column stop_repl_switchover:
    select stop_repl_switchover from DBTIER_REPL_SITE_INFO where site_name='<site name>' and mate_site_name='<mate site name>'and replchannel_group_id=<replication channel grp id>;
  2. The alert will automatically clear once the switchover is enabled. If the stop_repl_switchover value retrieved from the above step is 1 and if you wish to re-enable the switchover, then call the API mentioned in the cnDBTier Switchover APIs section to re-enable the switchover.
  3. In case if the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

Note:

Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.