6 cnDBTier Alerts

cnDBTier generates alerts when cnDBTier meets a specified condition. You can access the alerts using the Prometheus dashboard and take necessary actions. Prometheus gets installed as part of common services during the vCNE installation. This section provides details about the available cnDBTier alerts.

6.1 cnDBTier Remote Server Backup Transfer Status Alerts

This section provides details about the cnDBTier remote server backup transfer status alerts.

Table 6-1 REMOTE_SERVER_BACKUP_TRANSFER_FAILED

Field Details
Description This alert is triggered with major severity when the transfer of backup to a remote server fails.
Summary Secure transfer of backup to remote server failed on cnDBTier site {{ $labels.site_name }}
Severity major
Condition db_tier_remote_server_backup_transfer_status == 1
Expression Validity NA
SNMP Trap ID 2031
Affects Service (Y/N) N
Recommended Action

Cause: The transfer of backup to remote server failed.

Diagnostic Information: Check the status of the db_tier_remote_server_backup_transfer_status metric (Table 5-7).

Recovery: This alert is cleared automatically when the backup transfer status is updated from the failed state to other states, that is, the db_tier_remote_server_backup_transfer_status metric value is updated to any value other than 1.

To recover from this issue:
  • check if the remote server is in a healthy state
  • check the network
  • check if enough space is available

For any assistance, collect the logs and contact My Oracle Support.

6.2 cnDBTier PVC Health Monitoring Alerts

This section provides details about the cnDBTier PVC health monitoring alerts.

Table 6-2 PVC_NOT_ACCESSIBLE

Field Details
Description This alert is triggered with a critical severity when PVC is not accessible (db_tier_pvc_is_accesible metric is 0).
Summary PVC is not accessible on cnDBTier site {{ $labels.site_name }}
Severity critical
Condition db_tier_pvc_is_accesible == 0
Expression Validity 1m
SNMP Trap ID 2029
Affects Service (Y/N) Y
Recommended Action

Cause: The system is unable to access PVC for read or write operation.

Diagnostic Information: The db_tier_pvc_is_accesible metric (Table 5-10) provides information if the PVC is accessible or not.

Recovery: This alert is cleared automatically when the PVC is accessible.

Check the PVC and ensure that the PVC is accessible and doesn't have any errors.

For any assistance, contact My Oracle Support.

Table 6-3 PVC_FAILURE_COUNT

Field Details
Description This alert is triggered with info severity when the value of db_tier_pvc_failure_count was greater than 0 during the last scrape. The value of db_tier_pvc_failure_count metric indicates the number of times the PVC was not accessible.
Summary PVC of {{ $labels.hostname }} was unavailable for {{ $value }} times in the last 10m on cnDBTier site {{ $labels.site_name }}
Severity info
Condition sum(sum_over_time(db_tier_pvc_failure_count[10m])) by (hostname, site_name, namespace, mount_path) > 0
Expression Validity 5m
SNMP Trap ID 2030
Affects Service (Y/N) N
Recommended Action

Cause: The system is unable to access the PVC for read or write operation.

Diagnostic Information: The db_tier_pvc_failure_count metric (Table 5-11) provides information about the number of times the PVC was not accessible during the last scrape.

Recovery: This alert is cleared automatically when the db_tier_pvc_failure_count metric is zero.

Check the PVC and ensure that the PVC is accessible and doesn't have any errors.

For any assistance, contact My Oracle Support.

6.3 cnDBTier Backup Transfer Status Alerts

This section provides details about the cnDBTier backup transfer status alerts.

Table 6-4 BACKUP_TRANSFER_LOCAL_FAILED

Field Details
Description This alert is triggered with major severity when the system fails to transfer the backup from the data node to the replication service pod on the cnDBTier site (db_tier_backup_transfer_status metric value is 2).
Summary Failed to transfer backup from data node to replication service pod on cnDBTier site {{ $labels.site_name }}
Severity major
Condition db_tier_backup_transfer_status == 2
Expression Validity NA
SNMP Trap ID 2026
Affects Service (Y/N) Y
Recommended Action

Cause: The system failed to transfer a backup from the data node to the replication service pod on a cnDBTier site.

Diagnostic Information: The db_tier_backup_transfer_status metric (Table 5-8) provides information about the backup transfer status.

Recovery: This alert is cleared automatically when the db_tier_backup_transfer_status metric is updated to a value other than 2.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-5 BACKUP_TRANSFER_FAILED

Field Details
Description This alert is triggered with major severity when the backup transfer failed as the system failed to transfer the backup to the remote site from the cnDBTier site (db_tier_backup_transfer_status metric value is 3).
Summary Failed to transfer backup to remote site from cnDBTier site {{ $labels.site_name }}
Severity major
Condition db_tier_backup_transfer_status == 3
Expression Validity NA
SNMP Trap ID 2027
Affects Service (Y/N) Y
Recommended Action

Cause: The system failed to transfer a backup from the cnDBTier site to a remote site.

Diagnostic Information: The db_tier_backup_transfer_status metric (Table 5-8) provides information about the backup transfer status.

Recovery: This alert is cleared automatically when the db_tier_backup_transfer_status metric is updated to a value other than 3.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-6 BACKUP_TRANSFER_IN_PROGRESS

Field Details
Description This alert is triggered with info severity when the backup transfer is in progress on the cnDBTier site (db_tier_backup_transfer_status metric value is 1).
Summary Backup Transfer is In Progress on cnDBTier site {{ $labels.site_name }}
Severity info
Condition db_tier_backup_transfer_status == 1
Expression Validity NA
SNMP Trap ID 2028
Affects Service (Y/N) N
Recommended Action

Cause: Backup transfer is in progress on the cnDBTier site.

Diagnostic Information: The db_tier_backup_transfer_status metric (Table 5-8) provides information about the backup transfer status.

Recovery: This alert is cleared automatically when the db_tier_backup_transfer_status metric is updated to a value other than 1.

For any assistance, collect the logs and contact My Oracle Support.

6.4 cnDBTier Heartbeat Alerts

This section provides details about cnDBTier heartbeat alerts.

Table 6-7 HEARTBEAT_FAILED

Field Details
Description This alert is triggered with critical severity when HeartBeat fails on a remote site.
Summary HeartBeat failed on cnDBTier site {{ $labels.site_name }} connected to mate site {{ $labels.mate_site_name }} on replication channel group id {{ $labels.replchannel_group_id }} and kubernetes namespace {{ $labels.namespace }}"
Severity critical
Condition db_tier_heartbeat_failure == 1
Expression Validity NA
SNMP Trap ID 2025
Affects Service (Y/N) Y
Recommended Action

Cause: The system is unable to connect to remote site and Heartbeat failed.

Diagnostic Information: The db_tier_heartbeat_failure metric (Table 5-12) provides information about the heartbeat status and indicates whether the remote site is reachable or not.

Recovery: This alert is cleared automatically when the db_tier_heartbeat_failure metric is 0.

For any assistance, collect the logs and contact My Oracle Support.

6.5 cnDBTier BinLog Injector Thread Alerts

This section provides details about cnDBTier BinLog injector alerts.

Table 6-8 BINLOG_INJECTOR_STOPPED

Field Details
Description This alert is triggered with critical severity when Bin Log Injector stops working.
The value of db_tier_binlog_injector_thread or db_tier_binlog_injector_thread_latest_epoch indicates the status of Bin Log Injector:
  • 0: indicates that the Bin Log Injector thread is not stopped for the specified node ID
  • 1: indicates that the Bin Log Injector thread is stopped for the specified node ID
Summary BinLog Injector Thread is stopped for MySQL node having node id {{ $labels.node_id }} on cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition

db_tier_binlog_injector_thread_latest_epoch == 1

or

db_tier_binlog_injector_thread == 1

Expression Validity NA
SNMP Trap ID 2024
Affects Service (Y/N) Y
Recommended Action

Cause: Bin Log Injector thread stalled for the replication SQL node.

Diagnostic Information: The db_tier_binlog_injector_thread_latest_epoch or db_tier_binlog_injector_thread metrics (Table 5-85 or Table 5-84) provide information whether the Bin Log Injector thread is stalled or not.

Recovery: This alert is cleared automatically when the db_tier_binlog_injector_thread_latest_epoch or db_tier_binlog_injector_thread metric is 0.

For any assistance, collect the logs and contact My Oracle Support.

6.6 cnDBTier Replication Error Skip Alerts

This section provides details about the cnDBTier replication error skip alertss.

Table 6-9 REPLICATION_SWITCHOVER_DUE_CLUSTERDISCONNECT

Field Details
Description This alert is triggered when switch over happens on an API node due to configured cluster disconnect error, if skip replication error is enabled.
Summary Replication channel on SQL node with node ID {{ $labels.node_id }} had switchover due to cluster disconnecterror number {{ $labels.error_number }}
Severity info
Condition db_tier_replication_switchover_due_to_clusterdisconnect == 1
Expression Validity NA
SNMP Trap ID 2019
Affects Service (Y/N) N
Recommended Action

Cause: Skip replication error is enabled on an API node and a switchover occurred on the node as the configured cluster disconnected.

Diagnostic Information: The db_tier_replication_switchover_due_to_clusterdisconnect metric (Table 5-83) provides information whether a switchover occurred on an API node.

Recovery: This alert is cleared automatically one hour after the event.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-10 REPLICATION_TOO_MANY_EPOCHS_LOST

Field Details
Description This alert is triggered when the epochs lost due to skip error is greater than 10000 and less than or equal to 80000.

This alert is cleared one hour after the event.

Summary Too many epochs are lost for skipping replication errors
Severity major
Condition (db_tier_epochs_lost_due_to_skiperror > 10000) and (db_tier_epochs_lost_due_to_skiperror <= 80000)
Expression Validity NA
SNMP Trap ID 2020
Affects Service (Y/N) N
Recommended Action

Cause: Between 10000 and 80000 epochs are lost due to skip errors.

Diagnostic Information: The db_tier_epochs_lost_due_to_skiperror metric (Table 5-82) provides information about the number of epochs lost due to skip errors.

Recovery: This alert is cleared automatically one hour after the event.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-11 REPLICATION_SKIP_ERRORS_LOW

Field Details
Description This alert is triggered when the replication is halted due to skip error count less than or equal to 5.

This alert is cleared one hour after the event.

Summary Cross-site replication errors are skipped
Severity minor
Condition (db_tier_replication_halted_due_to_skiperror > 0) and (db_tier_replication_halted_due_to_skiperror <= 5)
Expression Validity NA
SNMP Trap ID 2021
Affects Service (Y/N) N
Recommended Action

Cause: Replication halted due to less than five skip errors.

Diagnostic Information: The db_tier_replication_halted_due_to_skiperror metric (Table 5-81) provides information about the number of skip errors due to which the replication halted.

Recovery: This alert is cleared automatically one hour after the event.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-12 REPLICATION_SKIP_ERRORS_HIGH

Field Details
Description This alert is triggered when the replication is halted due to skip error counts greater than 5.

This alert is cleared one hour after the event.

Summary Cross-site replication errors skipped are high
Severity major
Condition db_tier_replication_halted_due_to_skiperror > 5
Expression Validity NA
SNMP Trap ID 2022
Affects Service (Y/N) N
Recommended Action

Cause: Replication halted due to more than five skip errors.

Diagnostic Information: The db_tier_replication_halted_due_to_skiperror metric (Table 5-81) provides information about the number of skip errors due to which the replication halted.

Recovery: This alert is cleared automatically one hour after the event.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-13 REPLICATION_EPOCHS_LOST

Field Details
Description This alert is triggered when the epochs lost due to skip error is greater than 0 and less than 2000.

This alert is cleared one hour after the event.

Summary Epochs are lost for skipping replication errors
Severity info
Condition db_tier_epochs_lost_due_to_skiperror > 0 and db_tier_epochs_lost_due_to_skiperror <= <Configured epoch interval lower threshold>
Expression Validity NA
SNMP Trap ID 2023
Affects Service (Y/N) N
Recommended Action

Cause: Less than 2000 epochs are lost due to skip errors.

Diagnostic Information: The db_tier_epochs_lost_due_to_skiperror metric (Table 5-82) provides information about the number of epochs lost due to skip errors.

Recovery: This alert is cleared automatically one hour after the event.

For any assistance, collect the logs and contact My Oracle Support.

6.7 cnDBTier Georeplication Recovery Status Alerts

This section provides details about the cnDBTier georeplication recovery status alerts.

Table 6-14 GEOREPLICATION_RECOVERY_IN_PROGRESS

Field Details
Description This alert is triggered with critical severity when the georeplication recovery is in progress and the alert is cleared when georeplication recovery is complete.
Summary Identified cnDBTier Site {{ $labels.site_name }} georeplication recovery is in progress for kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition db_tier_georeplication_recovery_state == 1
Expression Validity 1m
SNMP Trap ID 2018
Affects Service (Y/N) Y
Recommended Action

Cause: When you perform georeplication recovery to recover failed site from a healthy site, that is when georeplication recovery is in progress.

Diagnostic Information: The db_tier_georeplication_recovery_state metric (Table 5-39) provides information whether georeplication recovery is in progress.

Recovery: This alert is cleared automatically when the georeplication recovery is complete.

For any assistance, collect the logs and contact My Oracle Support.

6.8 cnDBTier Cluster Status Alerts

This section provides details about cnDBTier cluster status alerts.

Table 6-15 CLUSTER_DOWN

Field Details
Description This alert is triggered with critical severity when cnDBTier NDB cluster is not UP.
Summary MySQL Cluster is down for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition db_tier_cluster_status == 0
Expression Validity 1m
SNMP Trap ID 2017
Affects Service (Y/N) Y
Recommended Action
Cause:
  • When pod restarts due to Kubernetes liveliness or readiness probe failures.
  • When cnDBTier application restarts or fails to start.
Diagnostic Information:
  • Run the following command to check the status of cnDBTier namespace:
    kubectl -n <namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
    The cluster is down if:
    • the ndbappmysqld pods are down, not running, and not connected
    • the remaining pods are not running and not connected
  • Check Kubernetes events for probe failures in the platform logs.
  • Check if any exception is reported in the cnDBTier application logs.

Recovery: This alert is cleared automatically when the inactive pod is active.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-16 MYSQL_NDB_CLUSTER_DISCONNECT

Field Details
Description This alert is triggered with critical severity when cnDBTier NDB cluster is not UP.
Summary MySQL NDB cluster disconnect for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition sum_over_time(db_tier_cluster_disconnect[5m]) > 0
Expression Validity 1m
SNMP Trap ID 2034
Affects Service (Y/N) Y
Recommended Action
Cause:
  • When pod restarts due to Kubernetes liveliness or readiness probe failures
  • When cnDBTier application restarts or fails to start.
Diagnostic Information:
  • Run the following command to check the status of cnDBTier namespace:
    kubectl -n <namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
    The cluster is down if:
    • the ndbappmysqld pods are down, not running, and not connected
    • the remaining pods are not running and not connected
  • Check Kubernetes events for probe failures in the platform logs.
  • Check if any exception is reported in the cnDBTier application logs.

Recovery: This alert is cleared automatically when the inactive pod is active.

For any assistance, collect the logs and contact My Oracle Support.

6.9 cnDBTier Automated Backup Alerts

This section provides details about the cnDBTier automated backup alerts.

Table 6-17 BACKUP_FAILED

Field Details
Description This alert is triggered with minor severity when the backup service fails to complete the backup successfully.
Summary Could not backup database for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity minor
Condition db_tier_backup{status='FAILED'}
Expression Validity N/A
SNMP Trap ID 2011
Affects Service (Y/N) N
Recommended Action
Cause:
  • When backup service fails to complete the backup successfully.
  • When PVC size is not enough and as a result the backup fails.

Diagnostic Information: The db_tier_backup metric (Table 5-38) provides information if the backup failed or not.

Recovery: This alert is cleared automatically when the db_tier_backup metric status changes from the FAILED state to other states.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-18 BACKUP_PURGED_EARLY

Field Details
Description This alert is triggered with minor severity when the backup service purges old backups earlier than expected to create space for new backup.
Summary A backup was deleted prematurely for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity minor
Condition db_tier_backup{status='PURGED_EARLY'}
Expression Validity N/A
SNMP Trap ID 2012
Affects Service (Y/N) N
Recommended Action

Cause: When the backup service purges the old backups earlier than the expected time, to create space for a new backup.

Diagnostic Information: The db_tier_backup metric (Table 5-38) provides information if the backup is purged earlier than expected.

Recovery: This alert is cleared automatically when the db_tier_backup metric status changes from the PURGED_EARLY state to other states.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-19 BACKUP_SIZE_GROWTH

Field Details
Description This alert is triggered with minor severity whenever the current backup size exceeds 5% of the average of the previous backups.
Summary Backup size exceeded expected size for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity minor
Condition (db_tier_backup_used_disk_percentage/(avg_over_time(db_tier_backup_used_disk_percentage[5d])))>1.05
Expression Validity N/A
SNMP Trap ID 2013
Affects Service (Y/N) N
Recommended Action

Cause: When the current backup size exceeds 5% of the average of the previous backups.

Diagnostic Information: The db_tier_backup_used_disk_percentage metric (Table 5-36) provides information if the current backup size exceeds 5% of the average.

Recovery: This alert is cleared automatically when the db_tier_backup_used_disk_percentage metric value is reduced to the threshold percentage.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-20 BACKUP_STORAGE_LOW

Field Details
Description This alert is triggered with minor severity when the total backup size of the data node is >= 70% and < 80% of the total data node disk size.
Summary Disk storage on DATA node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity minor
Condition (avg_over_time(db_tier_backup_used_disk_percentage[5m])>=70) and (avg_over_time(db_tier_backup_used_disk_percentage[5m])<80)
Expression Validity N/A
SNMP Trap ID 2014
Affects Service (Y/N) N
Recommended Action

Cause: When the total backup size of the data node is >= 70% and < 80% of the total data node disk size.

Diagnostic Information: The db_tier_backup_used_disk_percentage metric (Table 5-36) provides information if the current backup size is >= 70% and < 80% of the total data node disk size.

Recovery:
  • This alert is cleared automatically when the db_tier_backup_used_disk_percentage metric value is reduced to the threshold percentage.
  • Increase the disk size by performing the scaling procedure.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-21 BACKUP_STORAGE_LOW

Field Details
Description This alert is triggered with major severity when the total backup size of the data node is >= 80% and < 95% of the total data node disk size.
Summary Disk storage on DATA node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition (avg_over_time(db_tier_backup_used_disk_percentage[5m])>=80) and (avg_over_time(db_tier_backup_used_disk_percentage[5m])<95)
Expression Validity N/A
SNMP Trap ID 2015
Affects Service (Y/N) N
Recommended Action

Cause: When the total backup size of the data node is >= 80% and < 95% of the total data node disk size.

Diagnostic Information: The db_tier_backup_used_disk_percentage metric (Table 5-36) provides information if the current backup size is >= 80% and < 95% of the total data node disk size.

Recovery:
  • This alert is cleared automatically when the db_tier_backup_used_disk_percentage metric value is reduced to the threshold percentage.
  • Increase the disk size by performing the scaling procedure.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-22 BACKUP_STORAGE_FULL

Field Details
Description This alert is triggered with critical severity when the total backup size of the data node is >= 95% of the total data node disk size.
Summary Disk storage on DATA node with node ID {{ $labels.node_id }} is full for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (avg_over_time(db_tier_backup_used_disk_percentage[5m])>=95)
Expression Validity N/A
SNMP Trap ID 2016
Affects Service (Y/N) N
Recommended Action

Cause: When the total backup size of the data node is >= 95% of the total data node disk size.

Diagnostic Information: The db_tier_backup_used_disk_percentage metric (Table 5-36) provides information if the current backup size is >= 95% of the total data node disk size.

Recovery:

  • This alert is cleared automatically when the db_tier_backup_used_disk_percentage metric value is reduced to the threshold percentage.
  • Increase the disk size by performing the scaling procedure.

Note: Take immediate action to avoid the cnDBTier cluster going out of service.

For any assistance, collect the logs and contact My Oracle Support.

6.10 cnDBTier Bin Log Usage Alerts

This section provides details about the cnDBTier binlog usage alerts.

Table 6-23 BINLOG_STORAGE_LOW

Field Details
Description This alert is triggered with a minor severity when the total BinLog size of the SQL node is >= 70% and < 80% of the total SQL node disk size.
Summary Disk storage on SQL node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity minor
Condition (avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) >= 70) and (avg_over_time( db_tier_binlog_used_bytes_percentage[5m] ) < 80)
Expression Validity 5m
SNMP Trap ID 2007
Affects Service (Y/N) N
Recommended Action

Cause: When the total BinLog size of the SQL node is >= 70% and < 80% of the total SQL node disk size.

Diagnostic Information: The db_tier_binlog_used_bytes_percentage metric (Table 5-33) provides information if the total BinLog size of the SQL node is >=70% and <80% of total SQL node disk size.

Recovery: This alert is cleared automatically when the value of the db_tier_binlog_used_bytes_percentage metric is reduced to the threshold value.

For any assistance, contact My Oracle Support.

Table 6-24 BINLOG_STORAGE_LOW

Field Details
Description This alert is triggered with major severity when the total BinLog size of the SQL node is >=80% and <95% of the total SQL node disk size.
Summary Disk storage on SQL node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition (avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) >= 80) and (avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) < 95)
Expression Validity 5m
SNMP Trap ID 2036
Affects Service (Y/N) N
Recommended Action

Cause: When the total BinLog size of the SQL node is >=80% and <95% of the total SQL node disk size.

Diagnostic Information: The db_tier_binlog_used_bytes_percentage metric (Table 5-33) provides information if the total BinLog size of the SQL node is >=80% and <95% of total SQL node disk size.

Recovery: This alert is cleared automatically when the value of the db_tier_binlog_used_bytes_percentage metric is reduced to the threshold value.

For any assistance, contact My Oracle Support.

Table 6-25 BINLOG_STORAGE_FULL

Field Details
Description This alert is triggered with critical severity when the total BinLog size of the SQL node is >= 95% of the total SQL node disk size.
Summary Disk storage on SQL node with node ID {{ $labels.node_id }} is full for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) >= 95
Expression Validity N/A
SNMP Trap ID 2008
Affects Service (Y/N) Y
Recommended Action

Cause: When the total BinLog size of the SQL node is >= 95% of the total SQL node disk size.

Diagnostic Information: The db_tier_binlog_used_bytes_percentage metric (Table 5-33) provides information if the total BinLog size of the SQL node is >= 95% of total SQL node disk size.

Recovery:

This alert is cleared automatically when the value of the db_tier_binlog_used_bytes_percentage metric is reduced to the threshold value.

Note: Take immediate action to avoid the SQL node going into Crashbackloop and becoming inaccessible.

For any assistance, contact My Oracle Support.

6.11 cnDBTier Replication Alerts

This section provides details about cnDBTier replication alerts.

Table 6-26 REPLICATION_CHANNEL_DOWN

Field Details
Description This alert is triggered with major severity when an ACTIVE channel goes to the FAILED state.
Summary Cross-site replication is down on node {{ $labels.node_id }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition (db_tier_replication_status{role="failed"} == 0) or

(db_tier_replication_status{role="active"} == 0)

Expression Validity N/A
SNMP Trap ID 2005
Affects Service (Y/N) N
Recommended Action

Cause: When any ACTIVE channel goes to the FAILED state when the crosssite replication is down on a node.

Diagnostic Information: The following metrics provide information if the replication channel is down:
  • db_tier_replication_status{role="failed"} == 0 (Table 5-34)
  • db_tier_replication_status{role="active"} == 0 (Table 5-34)

Recovery:

This alert is cleared automatically when the cross-site replication is UP on the node and ACTIVE.

Note: Take immediate action to avoid the cnDBTier cluster going out of service.

For any assistance, contact My Oracle Support.

Table 6-27 REPLICATION_FAILED

Field Details
Description This alert is triggered with critical severity when all the channels are in the STANDBY or FAILED state.
Summary Cross-site replication is down for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (count by (site_name, namespace, replchannel_group_id) (db_tier_replication_status) == count by (site_name, namespace, replchannel_group_id) (db_tier_replication_status{role="standby"})) or (count by (namespace, replchannel_group_id, site_name) (db_tier_replication_status) == count by (namespace, replchannel_group_id, site_name) (db_tier_replication_status{role="failed"}))
Expression Validity N/A
SNMP Trap ID 2006
Affects Service (Y/N) Y
Recommended Action

Cause: When all the channels are in the STANDBY or FAILED state as the cross-site replication is down for the cnDBTier site.

Diagnostic Information: The db_tier_replication_status metric (Table 5-34) provides information if the replication failed.

Recovery:

This alert is cleared automatically when the cross-site replication of the cnDBTier site is ACTIVE.

For any assistance, contact My Oracle Support.

Table 6-28 SLAVE_REPLICATION_DELAY_HIGH

Field Details
Description This alert is triggered when the last record read by the slave is more than five minutes behind the latest record written by the master.
Summary Slave replication on SQL node at {{ $labels.slave_node_ip }} is {{ $value }} seconds behind the master for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition avg(avg_over_time(db_tier_replication_slave_delay[5m])) by (master_node_ip,slave_node_ip) >= 300 and avg(avg_over_time(db_tier_replication_slave_delay[5m])) by (master_node_ip,slave_node_ip) < 48*3600
Expression Validity 1m
SNMP Trap ID 2009
Affects Service (Y/N) N
Recommended Action

Cause: When the last record read by the worker node is more than 5 minutes and less than 48 hours behind the latest record written by the controller.

Diagnostic Information: The db_tier_replication_slave_delay metric (Table 5-35) provides information if there is a delay in the worker node replication.

Recovery:

This alert is cleared automatically when the when the db_tier_replication_slave_delay metric value is reduced below the defined threshold value.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-29 SLAVE_REPLICATION_FAILED

Field Details
Description This alert is triggered when the last record read by the slave is more than 48 hours behind the latest record written by the master.
Summary Slave replication has fallen more than 48 hours behind the master. Manual restore from backup may be required for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition avg(avg_over_time(db_tier_replication_slave_delay[5m])) by (master_node_ip,slave_node_ip) >= 48*3600
Expression Validity 1m
SNMP Trap ID 2010
Affects Service (Y/N) Y
Recommended Action

Cause: When the last record read by the worker node is more than 48 hours behind the latest record written by the controller.

Diagnostic Information: The db_tier_replication_slave_delay metric (Table 5-35) provides information if the worker node replication failed.

Recovery: Perform georeplication recovery for the DB sync. For procedures, see Oracle Communications Cloud Native Core, cnDBTier Installation, Upgrade, and Fault Recovery Guide.

This alert is cleared automatically when the db_tier_replication_slave_delay metric value is reduced below the defined threshold value.

For any assistance, collect the logs and contact My Oracle Support.

Table 6-30 REPLICATION_SVC_STORAGE_FULL

Field Details
Description This alert is triggered with critical severity whenever the PVC consumption of replication service is more than 90% of the overall storage of replication service.
Summary Disk storage of replication service PVC {{ $labels.persistentvolumeclaim }} is full on kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*replication.*"}/kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*replication.*"}) * 100 > 90
Expression Validity NA
SNMP Trap ID 2032
Affects Service (Y/N) Y
Recommended Action

Cause: When the replication service PVC is filled more than 90% of overall PVC storage. A PVC fill can lead to replication service not functioning properly.

Diagnostic Information: This alert indicates that PVC of replication service is almost full and requires immediate attention to address the storage issue.

Recovery: Release storage for the replication service to function properly or scale PVC to accommodate any future data as the PVC is almost full.

Table 6-31 GEOREPLICATION_RECOVERY_FAILED

Field Details
Description This alert is triggered with critical severity when georeplication recovery fails on a unhealthy site where georeplication recovery was started.
Summary Georeplication recovery has failed on cnDBTier Site {{ $labels.site_name }} from kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition db_tier_georeplication_recovery_state == 2
Expression Validity NA
SNMP Trap ID 2033
Affects Service (Y/N) Y
Recommended Action

Cause: Incorrect disk size, incorrect SSH key configurations, or other similar reasons.

Diagnostic Information: This alert indicates that georeplication recovery failed on a unhealthy site and replication couldn't be reestablished using the georeplication recovery procedure. This alert requires immediate attention.

Recovery: Check the configurations like SSH key or disk size. Contact My Oracle Support for additional support.

6.12 cnDBTier Memory Usage Alerts

This section provides details about the cnDBTier memory usage alerts.

Table 6-32 LOW_MEMORY

Field Details
Description This alert is triggered when the RAM usage of any node is greater than or equal to 80%.
Summary Node ID {{ $labels.node_id }}, memory utilization at {{ $value }} percent for memory type {{ $labels.memory_type }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition ((avg_over_time(db_tier_memory_used_bytes{memory_type="Data memory"}[1m]) / avg_over_time(db_tier_memory_total_bytes{memory_type="Data memory"}[1m])) * 100) >= 80
Expression Validity 1m
SNMP Trap ID 2003
Affects Service (Y/N) N
Recommended Action

Cause: When the RAM or memory usage of any node reaches the major level of threshold value.

Diagnostic Information: Check if the memory usage of the following metrics are too high:

Recovery:

Reduce the incoming service request rate. This alert is cleared automatically when the memory usage of the cnDBTier worker pod is reduced below the defined threshold value.

For any assistance, contact My Oracle Support.

Table 6-33 OUT_OF_MEMORY

Field Details
Description This alert is triggered with critical severity when the RAM usage of any node is greater than or equal to 90%.
Summary Node ID {{ $labels.node_id }} out of memory for memory type {{ $labels.memory_type }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition ((avg_over_time(db_tier_memory_used_bytes{memory_type="Data memory"}[1m]) / avg_over_time(db_tier_memory_total_bytes{memory_type="Data memory"}[1m])) * 100) >= 90
Expression Validity 1m
SNMP Trap ID 2004
Affects Service (Y/N) Y
Recommended Action

Cause: When the RAM or memory usage of any node reaches the critical level of threshold value.

Diagnostic Information: Check if the memory usage of the following metrics are too high:

Recovery:

Reduce the incoming service request rate. This alert is cleared automatically when the memory usage of the cnDBTier worker pod is reduced below the defined threshold value.

Note: Take immediate action to avoid the cnDBTier cluster going out of service.

For any assistance, contact My Oracle Support.

6.13 cnDBTier CPU Usage Alerts

This section provides details about cnDBTier CPU usage alerts.

Table 6-34 HIGH_CPU

Field Details
Description This alert is triggered with major severity when the CPU usage of any node is greater than or equal to 80%, and less than 90%.
Summary Node ID {{ $labels.node_id }} CPU utilization at {{ $value }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity major
Condition ((100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m])) by (node_id))) >= 80) and ((100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m])) by (node_id))) < 90)
Expression Validity 1m
SNMP Trap ID 2002
Affects Service (Y/N) N
Recommended Action

Cause: When the CPU utilization of any node is greater than or equal to 80%, and less than 90%.

Diagnostic Information: Check the CPU threshold level status from the cnDBTier worker pod logs.

Recovery:

Reduce the incoming service request rate. This alert is cleared automatically when the CPU utilization is reduced below the threshold value.

For any assistance, contact My Oracle Support.

Table 6-35 HIGH_CPU

Field Details
Description This alert is triggered with critical severity when the CPU usage of any node is greater than or equal to 90%.
Summary Node ID {{ $labels.node_id }} CPU utilization at {{ $value }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition (100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m]))BY (node_id)))>= 90
Expression Validity 1m
SNMP Trap ID 2035
Affects Service (Y/N) N
Recommended Action

Cause: When the CPU utilization of any node is greater than or equal to 90%.

Diagnostic Information: Check the CPU threshold level status from the cnDBTier worker pod logs.

Recovery:

Reduce the incoming service request rate. This alert is cleared automatically when the CPU utilization is reduced below the threshold value.

Note: Take immediate action to avoid the cnDBTier cluster going out of service.

For any assistance, contact My Oracle Support.

6.14 cnDBTier Node Status Alerts

The section provides details about cnDBTier node status alerts.

Table 6-36 NODE_DOWN

Field Details
Description This alert is raised with critical severity when the data node is down. db_tier_node_status value:
  • 0: indicates that a node is DOWN
  • 1: indicates that the node is UP
Summary MySQL {{ $labels.node_type }} node having node id {{ $labels.node_id }} is down for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity critical
Condition db_tier_node_status == 0
Expression Validity N/A
SNMP Trap ID 2001
Affects Service (Y/N) Y
Recommended Action
Cause:
  • When pod restarts due to Kubernetes liveliness or readiness probe failures
  • When cnDBTier application restarts or fails to start
Diagnostic Information:
  • Run the following command to check the status of cnDBTier namespace:
    kubectl -n <namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show
  • Check the Kubernetes events for probe failures in the platform logs.
  • Check if any exception is reported in the cnDBTier application logs.

Recovery:

This alert is cleared automatically when the inactive pod becomes active.

For any assistance, collect the application logs and contact My Oracle Support.