cnDBTier Alerts

6.1 cnDBTier Remote Server Backup Transfer Status Alerts

This section provides details about the cnDBTier remote server backup transfer status alerts.

Table 6-1 REMOTE_SERVER_BACKUP_TRANSFER_FAILED

Field	Details
Description	This alert is triggered with major severity when the transfer of backup to a remote server fails.
Summary	Secure transfer of backup to remote server failed on cnDBTier site {{ $labels.site_name }}
Severity	major
Condition	db_tier_remote_server_backup_transfer_status == 1
Expression Validity	NA
SNMP Trap ID	2031
Affects Service (Y/N)	N
Recommended Action	Cause: The transfer of backup to remote server failed. Diagnostic Information: Check the status of the `db_tier_remote_server_backup_transfer_status` metric (Table 5-4). Recovery: This alert is cleared automatically when the backup transfer status is updated from the failed state to other states, that is, the `db_tier_remote_server_backup_transfer_status` metric value is updated to any value other than 1. To recover from this issue: check if the remote server is in a healthy state check the network check if enough space is available For any assistance, collect the logs and contact My Oracle Support.

6.2 cnDBTier Backup Transfer Status Alerts

This section provides details about the cnDBTier backup transfer status alerts.

Table 6-2 BACKUP_TRANSFER_LOCAL_FAILED

Field	Details
Description	This alert is triggered with major severity when the system fails to transfer the backup from the data node to the replication service pod on the cnDBTier site (`db_tier_backup_transfer_status` metric value is 2).
Summary	Failed to transfer backup from data node to replication service pod on cnDBTier site {{ $labels.site_name }}
Severity	major
Condition	db_tier_backup_transfer_status == 2
Expression Validity	NA
SNMP Trap ID	2026
Affects Service (Y/N)	Y
Recommended Action	Cause: The system failed to transfer a backup from the data node to the replication service pod on a cnDBTier site. Diagnostic Information: The `db_tier_backup_transfer_status` metric (Table 5-5) provides information about the backup transfer status. Recovery: This alert is cleared automatically when the `db_tier_backup_transfer_status` metric is updated to a value other than 2. For any assistance, collect the logs and contact My Oracle Support.

Table 6-3 BACKUP_TRANSFER_FAILED

Field	Details
Description	This alert is triggered with major severity when the backup transfer failed as the system failed to transfer the backup to the remote site from the cnDBTier site (`db_tier_backup_transfer_status` metric value is 3).
Summary	Failed to transfer backup to remote site from cnDBTier site {{ $labels.site_name }}
Severity	major
Condition	db_tier_backup_transfer_status == 3
Expression Validity	NA
SNMP Trap ID	2027
Affects Service (Y/N)	Y
Recommended Action	Cause: The system failed to transfer a backup from the cnDBTier site to a remote site. Diagnostic Information: The `db_tier_backup_transfer_status` metric (Table 5-5) provides information about the backup transfer status. Recovery: This alert is cleared automatically when the `db_tier_backup_transfer_status` metric is updated to a value other than 3. For any assistance, collect the logs and contact My Oracle Support.

Table 6-4 BACKUP_TRANSFER_IN_PROGRESS

Field	Details
Description	This alert is triggered with info severity when the backup transfer is in progress on the cnDBTier site (`db_tier_backup_transfer_status` metric value is 1).
Summary	Backup Transfer is In Progress on cnDBTier site {{ $labels.site_name }}
Severity	info
Condition	db_tier_backup_transfer_status == 1
Expression Validity	NA
SNMP Trap ID	2028
Affects Service (Y/N)	N
Recommended Action	Cause: Backup transfer is in progress on the cnDBTier site. Diagnostic Information: The `db_tier_backup_transfer_status` metric (Table 5-5) provides information about the backup transfer status. Recovery: This alert is cleared automatically when the `db_tier_backup_transfer_status` metric is updated to a value other than 1. For any assistance, collect the logs and contact My Oracle Support.

6.3 cnDBTier Heartbeat Alerts

This section provides details about cnDBTier heartbeat alerts.

Table 6-5 HEARTBEAT_FAILED

Field	Details
Description	This alert is triggered with critical severity when HeartBeat fails on a remote site.
Summary	HeartBeat failed on cnDBTier site {{ $labels.site_name }} connected to mate site {{ $labels.mate_site_name }} on replication channel group id {{ $labels.replchannel_group_id }} and kubernetes namespace {{ $labels.namespace }}"
Severity	critical
Condition	db_tier_heartbeat_failure == 1
Expression Validity	NA
SNMP Trap ID	2025
Affects Service (Y/N)	Y
Recommended Action	Cause: The system is unable to connect to remote site and Heartbeat failed. Diagnostic Information: The `db_tier_heartbeat_failure` metric (Table 5-8) provides information about the heartbeat status and indicates whether the remote site is reachable or not. Recovery: This alert is cleared automatically when the `db_tier_heartbeat_failure` metric is 0. For any assistance, collect the logs and contact My Oracle Support.

6.4 cnDBTier BinLog Injector Thread Alerts

This section provides details about cnDBTier BinLog injector alerts.

Table 6-6 BINLOG_INJECTOR_STOPPED

Field	Details
Description	This alert is triggered with critical severity when Bin Log Injector stops working. The value of `db_tier_binlog_injector_thread` or `db_tier_binlog_injector_thread_latest_epoch` indicates the status of Bin Log Injector: 0: indicates that the Bin Log Injector thread is not stopped for the specified node ID 1: indicates that the Bin Log Injector thread is stopped for the specified node ID
Summary	BinLog Injector Thread is stopped for MySQL node having node id {{ $labels.node_id }} on cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	db_tier_binlog_injector_thread_latest_epoch == 1 or db_tier_binlog_injector_thread == 1
Expression Validity	NA
SNMP Trap ID	2024
Affects Service (Y/N)	Y
Recommended Action	Cause: Bin Log Injector thread stalled for the replication SQL node. Diagnostic Information: The `db_tier_binlog_injector_thread_latest_epoch` or `db_tier_binlog_injector_thread` metrics (Table 5-84 or Table 5-83) provide information whether the Bin Log Injector thread is stalled or not. Recovery: This alert is cleared automatically when the`db_tier_binlog_injector_thread_latest_epoch` or `db_tier_binlog_injector_thread` metric is 0. For any assistance, collect the logs and contact My Oracle Support.

6.5 cnDBTier Replication Error Skip Alerts

This section provides details about the cnDBTier replication error skip alertss.

Table 6-7 REPLICATION_SWITCHOVER_DUE_CLUSTERDISCONNECT

Field	Details
Description	This alert is triggered when switch over happens on an API node due to configured cluster disconnect error, if skip replication error is enabled.
Summary	Replication channel on SQL node with node ID {{ $labels.node_id }} had switchover due to cluster disconnecterror number {{ $labels.error_number }}
Severity	info
Condition	db_tier_replication_switchover_due_to_clusterdisconnect == 1
Expression Validity	NA
SNMP Trap ID	2019
Affects Service (Y/N)	N
Recommended Action	Cause: Skip replication error is enabled on an API node and a switchover occurred on the node as the configured cluster disconnected. Diagnostic Information: The `db_tier_replication_switchover_due_to_clusterdisconnect` metric (Table 5-82) provides information whether a switchover occurred on an API node. Recovery: This alert is cleared automatically one hour after the event. For any assistance, collect the logs and contact My Oracle Support.

Table 6-8 REPLICATION_TOO_MANY_EPOCHS_LOST

Field	Details
Description	This alert is triggered when the epochs lost due to skip error is greater than 10000 and less than or equal to 80000. This alert is cleared one hour after the event.
Summary	Too many epochs are lost for skipping replication errors
Severity	major
Condition	(db_tier_epochs_lost_due_to_skiperror > 10000) and (db_tier_epochs_lost_due_to_skiperror <= 80000)
Expression Validity	NA
SNMP Trap ID	2020
Affects Service (Y/N)	N
Recommended Action	Cause: Between 10000 and 80000 epochs are lost due to skip errors. Diagnostic Information: The `db_tier_epochs_lost_due_to_skiperror` metric (Table 5-81) provides information about the number of epochs lost due to skip errors. Recovery: This alert is cleared automatically one hour after the event. For any assistance, collect the logs and contact My Oracle Support.

Table 6-9 REPLICATION_SKIP_ERRORS_LOW

Field	Details
Description	This alert is triggered when the replication is halted due to skip error count less than or equal to 5. This alert is cleared one hour after the event.
Summary	Cross-site replication errors are skipped
Severity	minor
Condition	(db_tier_replication_halted_due_to_skiperror > 0) and (db_tier_replication_halted_due_to_skiperror <= 5)
Expression Validity	NA
SNMP Trap ID	2021
Affects Service (Y/N)	N
Recommended Action	Cause: Replication halted due to less than five skip errors. Diagnostic Information: The `db_tier_replication_halted_due_to_skiperror` metric (Table 5-80) provides information about the number of skip errors due to which the replication halted. Recovery: This alert is cleared automatically one hour after the event. For any assistance, collect the logs and contact My Oracle Support.

Table 6-10 REPLICATION_SKIP_ERRORS_HIGH

Field	Details
Description	This alert is triggered when the replication is halted due to skip error counts greater than 5. This alert is cleared one hour after the event.
Summary	Cross-site replication errors skipped are high
Severity	major
Condition	db_tier_replication_halted_due_to_skiperror > 5
Expression Validity	NA
SNMP Trap ID	2022
Affects Service (Y/N)	N
Recommended Action	Cause: Replication halted due to more than five skip errors. Diagnostic Information: The `db_tier_replication_halted_due_to_skiperror` metric (Table 5-80) provides information about the number of skip errors due to which the replication halted. Recovery: This alert is cleared automatically one hour after the event. For any assistance, collect the logs and contact My Oracle Support.

Table 6-11 REPLICATION_EPOCHS_LOST

Field	Details
Description	This alert is triggered when the epochs lost due to skip error is greater than 0 and less than 2000. This alert is cleared one hour after the event.
Summary	Epochs are lost for skipping replication errors
Severity	info
Condition	db_tier_epochs_lost_due_to_skiperror > 0 and db_tier_epochs_lost_due_to_skiperror <= <Configured epoch interval lower threshold>
Expression Validity	NA
SNMP Trap ID	2023
Affects Service (Y/N)	N
Recommended Action	Cause: Less than 2000 epochs are lost due to skip errors. Diagnostic Information: The `db_tier_epochs_lost_due_to_skiperror` metric (Table 5-81) provides information about the number of epochs lost due to skip errors. Recovery: This alert is cleared automatically one hour after the event. For any assistance, collect the logs and contact My Oracle Support.

6.6 cnDBTier Georeplication Recovery Status Alerts

This section provides details about the cnDBTier georeplication recovery status alerts.

Table 6-12 GEOREPLICATION_RECOVERY_IN_PROGRESS

Field	Details
Description	This alert is triggered with critical severity when the georeplication recovery is in progress and the alert is cleared when georeplication recovery is complete.
Summary	Identified cnDBTier Site {{ $labels.site_name }} georeplication recovery is in progress for kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	db_tier_georeplication_recovery_state == 1
Expression Validity	1m
SNMP Trap ID	2018
Affects Service (Y/N)	Y
Recommended Action	Cause: When you perform georeplication recovery to recover failed site from a healthy site, that is when georeplication recovery is in progress. Diagnostic Information: The `db_tier_georeplication_recovery_state` metric (Table 5-36) provides information whether georeplication recovery is in progress. Recovery: This alert is cleared automatically when the georeplication recovery is complete. For any assistance, collect the logs and contact My Oracle Support.

6.7 cnDBTier Cluster Status Alerts

This section provides details about cnDBTier cluster status alerts.

Table 6-13 CLUSTER_DOWN

Field	Details
Description	This alert is triggered with critical severity when cnDBTier NDB cluster is not UP.
Summary	MySQL Cluster is down for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	db_tier_cluster_status == 0
Expression Validity	1m
SNMP Trap ID	2017
Affects Service (Y/N)	Y
Recommended Action	Cause: When pod restarts due to Kubernetes liveliness or readiness probe failures. When cnDBTier application restarts or fails to start. Diagnostic Information: Run the following command to check the status of cnDBTier namespace: `kubectl -n <namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show` The cluster is down if: the ndbappmysqld pods are down, not running, and not connected the remaining pods are not running and not connected Check Kubernetes events for probe failures in the platform logs. Check if any exception is reported in the cnDBTier application logs. Recovery: This alert is cleared automatically when the inactive pod is active. For any assistance, collect the logs and contact My Oracle Support.

Table 6-14 MYSQL_NDB_CLUSTER_DISCONNECT

Field	Details
Description	This alert is triggered with critical severity when cnDBTier NDB cluster is not UP.
Summary	MySQL NDB Cluster Disconnected {{ $value }} times for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	db_tier_cluster_disconnect > 0
Expression Validity	1m
SNMP Trap ID	2034
Affects Service (Y/N)	Y
Recommended Action	Cause: When all ndbmtd pods or all ndbmtd pods of the same node group restart due to Kubernetes probe failures or infrastructure related issues. Diagnostic Information: Run the following command to check the status of cnDBTier namespace: `kubectl -n <namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show` The cluster is down if: all data nodes or all data nodes of the same node group are not connected. Check Kubernetes events for probe failures in the platform logs. Check if there is any network fluctuation or platform related issue which can cause the ndbmtd pods to restart. Check if any exception is reported in the cnDBTier application logs. Recovery: This alert can be cleared by calling the `/db-tier/reset/parameter/cluster_restart_disconnect` REST API. For more information about this API, see cnDBTier Cluster Events APIs. For any assistance, collect the logs and contact My Oracle Support.

6.8 cnDBTier Automated Backup Alerts

This section provides details about the cnDBTier automated backup alerts.

Table 6-15 BACKUP_FAILED

Field	Details
Description	This alert is triggered with minor severity when the backup service fails to complete the backup successfully.
Summary	Could not backup database for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	minor
Condition	db_tier_backup{status='FAILED'}
Expression Validity	N/A
SNMP Trap ID	2011
Affects Service (Y/N)	N
Recommended Action	Cause: When backup service fails to complete the backup successfully. When PVC size is not enough and as a result the backup fails. Diagnostic Information: The `db_tier_backup` metric (Table 5-34) provides information if the backup failed or not. Recovery: This alert is cleared automatically when the `db_tier_backup` metric status changes from the FAILED state to other states. For any assistance, collect the logs and contact My Oracle Support.

Table 6-16 BACKUP_PURGED_EARLY

Field	Details
Description	This alert is triggered with minor severity when the backup service purges old backups earlier than expected to create space for new backup.
Summary	A backup was deleted prematurely for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	minor
Condition	db_tier_backup{status='PURGED_EARLY'}
Expression Validity	N/A
SNMP Trap ID	2012
Affects Service (Y/N)	N
Recommended Action	Cause: When the backup service purges the old backups earlier than the expected time, to create space for a new backup. Diagnostic Information: The `db_tier_backup` metric (Table 5-34) provides information if the backup is purged earlier than expected. Recovery: This alert is cleared automatically when the `db_tier_backup` metric status changes from the PURGED_EARLY state to other states. For any assistance, collect the logs and contact My Oracle Support.

Table 6-17 BACKUP_SIZE_GROWTH

Field	Details
Description	This alert is triggered with minor severity whenever the current backup size exceeds 20% of the average of the previous backups.
Summary	Backup size exceeded expected size for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	minor
Condition	(db_tier_backup_used_disk_percentage/(avg_over_time(db_tier_backup_used_disk_percentage[5d])))>1.05
Expression Validity	N/A
SNMP Trap ID	2013
Affects Service (Y/N)	N
Recommended Action	Cause: When the current backup size exceeds 20% of the average of the previous backups. Diagnostic Information: The `db_tier_backup_used_disk_percentage` metric (Table 5-32) provides information if the current backup size exceeds 20% of the average. Recovery: This alert is cleared automatically when the `db_tier_backup_used_disk_percentage` metric value is reduced to the threshold percentage. For any assistance, collect the logs and contact My Oracle Support.

Table 6-18 BACKUP_STORAGE_LOW

Field	Details
Description	This alert is triggered with minor severity when the total backup size of the data node is >= 70% and < 80% of the total data node disk size.
Summary	Disk storage on DATA node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	minor
Condition	(avg_over_time(db_tier_backup_used_disk_percentage[5m])>=70) and (avg_over_time(db_tier_backup_used_disk_percentage[5m])<80)
Expression Validity	N/A
SNMP Trap ID	2014
Affects Service (Y/N)	N
Recommended Action	Cause: When the total backup size of the data node is >= 70% and < 80% of the total data node disk size. Diagnostic Information: The `db_tier_backup_used_disk_percentage` metric (Table 5-32) provides information if the current backup size is >= 70% and < 80% of the total data node disk size. Recovery: This alert is cleared automatically when the `db_tier_backup_used_disk_percentage` metric value is reduced to the threshold percentage. Increase the disk size by performing the scaling procedure. For any assistance, collect the logs and contact My Oracle Support.

Table 6-19 BACKUP_STORAGE_LOW

Field	Details
Description	This alert is triggered with major severity when the total backup size of the data node is >= 80% and < 95% of the total data node disk size.
Summary	Disk storage on DATA node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	major
Condition	(avg_over_time(db_tier_backup_used_disk_percentage[5m])>=80) and (avg_over_time(db_tier_backup_used_disk_percentage[5m])<95)
Expression Validity	N/A
SNMP Trap ID	2015
Affects Service (Y/N)	N
Recommended Action	Cause: When the total backup size of the data node is >= 80% and < 95% of the total data node disk size. Diagnostic Information: The `db_tier_backup_used_disk_percentage` metric (Table 5-32) provides information if the current backup size is >= 80% and < 95% of the total data node disk size. Recovery: This alert is cleared automatically when the `db_tier_backup_used_disk_percentage` metric value is reduced to the threshold percentage. Increase the disk size by performing the scaling procedure. For any assistance, collect the logs and contact My Oracle Support.

Table 6-20 BACKUP_STORAGE_FULL

Field	Details
Description	This alert is triggered with critical severity when the total backup size of the data node is >= 95% of the total data node disk size.
Summary	Disk storage on DATA node with node ID {{ $labels.node_id }} is full for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	(avg_over_time(db_tier_backup_used_disk_percentage[5m])>=95)
Expression Validity	N/A
SNMP Trap ID	2016
Affects Service (Y/N)	N
Recommended Action	Cause: When the total backup size of the data node is >= 95% of the total data node disk size. Diagnostic Information: The `db_tier_backup_used_disk_percentage` metric (Table 5-32) provides information if the current backup size is >= 95% of the total data node disk size. Recovery: This alert is cleared automatically when the `db_tier_backup_used_disk_percentage` metric value is reduced to the threshold percentage. Increase the disk size by performing the scaling procedure. Note: Take immediate action to avoid the cnDBTier cluster going out of service. For any assistance, collect the logs and contact My Oracle Support.

Table 6-21 DB_TIER_NDB_BACKUP_IN_PROGRESS

Field	Details
Description	This alert is triggered with minor severity when a data node backup is in progress in the current site.
Summary	Indicates that a data node backup process is in progress in the current site.
Severity	minor
Condition	db_tier_ndb_backup_in_progress == 1
Expression Validity	N/A
SNMP Trap ID	2037
Affects Service (Y/N)	N
Recommended Action	Cause: When a data node backup is in progress in the current site. Diagnostic Information: The `db_tier_ndb_backup_in_progress` metric (Table 5-35) provides information if a data node backup is in progress or not. Ensure that you don't make any schema changes until the backup completes. Recovery: This alert is cleared automatically when the backup completes in the MySQL NDB cluster. For any assistance, collect the logs and contact My Oracle Support.

6.9 cnDBTier Bin Log Usage Alerts

This section provides details about the cnDBTier binlog usage alerts.

Table 6-22 BINLOG_STORAGE_LOW

Field	Details
Description	This alert is triggered with a minor severity when the total BinLog size of the SQL node is >= 70% and < 80% of the total SQL node disk size.
Summary	Disk storage on SQL node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	minor
Condition	(avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) >= 70) and (avg_over_time( db_tier_binlog_used_bytes_percentage[5m] ) < 80)
Expression Validity	5m
SNMP Trap ID	2007
Affects Service (Y/N)	N
Recommended Action	Cause: When the total BinLog size of the SQL node is >= 70% and < 80% of the total SQL node disk size. Diagnostic Information: The `db_tier_binlog_used_bytes_percentage` metric (Table 5-29) provides information if the total BinLog size of the SQL node is >=70% and <80% of total SQL node disk size. Recovery: This alert is cleared automatically when the value of the `db_tier_binlog_used_bytes_percentage` metric is reduced to the threshold value. For any assistance, contact My Oracle Support.

Table 6-23 BINLOG_STORAGE_LOW

Field	Details
Description	This alert is triggered with major severity when the total BinLog size of the SQL node is >=80% and <95% of the total SQL node disk size.
Summary	Disk storage on SQL node with node ID {{ $labels.node_id }} at {{ $value }} percent for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	major
Condition	(avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) >= 80) and (avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) < 95)
Expression Validity	5m
SNMP Trap ID	2036
Affects Service (Y/N)	N
Recommended Action	Cause: When the total BinLog size of the SQL node is >=80% and <95% of the total SQL node disk size. Diagnostic Information: The `db_tier_binlog_used_bytes_percentage` metric (Table 5-29) provides information if the total BinLog size of the SQL node is >=80% and <95% of total SQL node disk size. Recovery: This alert is cleared automatically when the value of the `db_tier_binlog_used_bytes_percentage` metric is reduced to the threshold value. For any assistance, contact My Oracle Support.

Table 6-24 BINLOG_STORAGE_FULL

Field	Details
Description	This alert is triggered with critical severity when the total BinLog size of the SQL node is >= 95% of the total SQL node disk size.
Summary	Disk storage on SQL node with node ID {{ $labels.node_id }} is full for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	avg_over_time( db_tier_binlog_used_bytes_percentage[5m]) >= 95
Expression Validity	N/A
SNMP Trap ID	2008
Affects Service (Y/N)	Y
Recommended Action	Cause: When the total BinLog size of the SQL node is >= 95% of the total SQL node disk size. Diagnostic Information: The `db_tier_binlog_used_bytes_percentage` metric (Table 5-29) provides information if the total BinLog size of the SQL node is >= 95% of total SQL node disk size. Recovery: This alert is cleared automatically when the value of the `db_tier_binlog_used_bytes_percentage` metric is reduced to the threshold value. Note: Take immediate action to avoid the SQL node going into Crashbackloop and becoming inaccessible. For any assistance, contact My Oracle Support.

6.10 cnDBTier Replication Alerts

This section provides details about cnDBTier replication alerts.

Table 6-25 REPLICATION_CHANNEL_DOWN

Field	Details
Description	This alert is triggered with major severity when an ACTIVE channel goes to the FAILED state.
Summary	Cross-site replication is down on node {{ $labels.node_id }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	major
Condition	(db_tier_replication_status{role="failed"} == 0) or (db_tier_replication_status{role="active"} == 0)
Expression Validity	N/A
SNMP Trap ID	2005
Affects Service (Y/N)	N
Recommended Action	Cause: When any ACTIVE channel goes to the FAILED state when the crosssite replication is down on a node. Diagnostic Information: The following metrics provide information if the replication channel is down: `db_tier_replication_status{role="failed"} == 0` (Table 5-30) `db_tier_replication_status{role="active"} == 0` (Table 5-30) Recovery: This alert is cleared automatically when the cross-site replication is UP on the node and ACTIVE. Note: Take immediate action to avoid the cnDBTier cluster going out of service. For any assistance, contact My Oracle Support.

Table 6-26 REPLICATION_FAILED

Field	Details
Description	This alert is triggered with critical severity when all the channels are in the STANDBY or FAILED state.
Summary	Cross-site replication is down for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	(count by (site_name, namespace, replchannel_group_id) (db_tier_replication_status) == count by (site_name, namespace, replchannel_group_id) (db_tier_replication_status{role="standby"})) or (count by (namespace, replchannel_group_id, site_name) (db_tier_replication_status) == count by (namespace, replchannel_group_id, site_name) (db_tier_replication_status{role="failed"}))
Expression Validity	N/A
SNMP Trap ID	2006
Affects Service (Y/N)	Y
Recommended Action	Cause: When all the channels are in the STANDBY or FAILED state as the cross-site replication is down for the cnDBTier site. Diagnostic Information: The `db_tier_replication_status` metric (Table 5-30) provides information if the replication failed. Recovery: This alert is cleared automatically when the cross-site replication of the cnDBTier site is ACTIVE. For any assistance, contact My Oracle Support.

Table 6-27 REPLICA_REPLICATION_DELAY_HIGH

Field	Details
Description	This alert is triggered when the last record read by the replica is more than five minutes behind the latest record written by the source.
Summary	Replica replication on SQL node at {{ $labels.replica_node_ip }} is {{ $value }} seconds behind the source for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	major
Condition	avg(avg_over_time(db_tier_replication_replica_delay[5m])) by (source_node_ip,replica_node_ip) >= 300 and avg(avg_over_time(db_tier_replication_replica_delay[5m])) by (source_node_ip,replica_node_ip) < 48*3600
Expression Validity	1m
SNMP Trap ID	2009
Affects Service (Y/N)	N
Recommended Action	Cause: When the last record read by the worker node is more than 5 minutes and less than 48 hours behind the latest record written by the controller. Diagnostic Information: The `db_tier_replication_replica_delay` metric (Table 5-31) provides information if there is a delay in the worker node replication. Recovery: This alert is cleared automatically when the when the `db_tier_replication_replica_delay` metric value is reduced below the defined threshold value. For any assistance, collect the logs and contact My Oracle Support.

Table 6-28 REPLICA_REPLICATION_FAILED

Field	Details
Description	This alert is triggered when the last record read by the replica is more than 48 hours behind the latest record written by the source.
Summary	Replica replication has fallen more than 48 hours behind the source. Manual restore from backup may be required for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	avg(avg_over_time(db_tier_replication_replica_delay[5m])) by (source_node_ip,replica_node_ip) >= 48*3600
Expression Validity	1m
SNMP Trap ID	2010
Affects Service (Y/N)	Y
Recommended Action	Cause: When the last record read by the worker node is more than 48 hours behind the latest record written by the controller. Diagnostic Information: The `db_tier_replication_replica_delay` metric (Table 5-31) provides information if the worker node replication failed. Recovery: Perform georeplication recovery for the DB sync. For procedures, see Oracle Communications Cloud Native Core, cnDBTier Installation, Upgrade, and Fault Recovery Guide. This alert is cleared automatically when the `db_tier_replication_replica_delay` metric value is reduced below the defined threshold value. For any assistance, collect the logs and contact My Oracle Support.

Table 6-29 REPLICATION_SVC_STORAGE_FULL

Field	Details
Description	This alert is triggered with critical severity whenever the PVC consumption of replication service is more than 90% of the overall storage of replication service.
Summary	Disk storage of replication service PVC {{ $labels.persistentvolumeclaim }} is full on kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	(kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".replication."}/kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".replication."}) * 100 > 90
Expression Validity	NA
SNMP Trap ID	2032
Affects Service (Y/N)	Y
Recommended Action	Cause: When the replication service PVC is filled more than 90% of overall PVC storage. A PVC fill can lead to replication service not functioning properly. Diagnostic Information: This alert indicates that PVC of replication service is almost full and requires immediate attention to address the storage issue. Recovery: Release storage for the replication service to function properly or scale PVC to accommodate any future data as the PVC is almost full.

Table 6-30 GEOREPLICATION_RECOVERY_FAILED

Field	Details
Description	This alert is triggered with critical severity when georeplication recovery fails on a unhealthy site where georeplication recovery was started.
Summary	Georeplication recovery has failed on cnDBTier Site {{ $labels.site_name }} from kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	db_tier_georeplication_recovery_state == 2
Expression Validity	NA
SNMP Trap ID	2033
Affects Service (Y/N)	Y
Recommended Action	Cause: Incorrect disk size, incorrect SSH key configurations, or other similar reasons. Diagnostic Information: This alert indicates that georeplication recovery failed on a unhealthy site and replication couldn't be reestablished using the georeplication recovery procedure. This alert requires immediate attention. Recovery: Check the configurations like SSH key or disk size. Contact My Oracle Support for additional support.

6.11 cnDBTier Memory Usage Alerts

This section provides details about the cnDBTier memory usage alerts.

Table 6-31 LOW_MEMORY

Field	Details
Description	This alert is triggered when the RAM usage of any node is greater than or equal to 80%.
Summary	Node ID {{ $labels.node_id }}, memory utilization at {{ $value }} percent for memory type {{ $labels.memory_type }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	major
Condition	((avg_over_time(db_tier_memory_used_bytes{memory_type="Data memory"}[1m]) / avg_over_time(db_tier_memory_total_bytes{memory_type="Data memory"}[1m])) * 100) >= 80
Expression Validity	1m
SNMP Trap ID	2003
Affects Service (Y/N)	N
Recommended Action	Cause: When the RAM or memory usage of any node reaches the major level of threshold value. Diagnostic Information: Check if the memory usage of the following metrics are too high: `db_tier_memory_used_bytes` (Table 5-27) `db_tier_memory_total_bytes` (Table 5-28) Recovery: Reduce the incoming service request rate. This alert is cleared automatically when the memory usage of the cnDBTier worker pod is reduced below the defined threshold value. For any assistance, contact My Oracle Support.

Table 6-32 OUT_OF_MEMORY

Field	Details
Description	This alert is triggered with critical severity when the RAM usage of any node is greater than or equal to 90%.
Summary	Node ID {{ $labels.node_id }} out of memory for memory type {{ $labels.memory_type }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	((avg_over_time(db_tier_memory_used_bytes{memory_type="Data memory"}[1m]) / avg_over_time(db_tier_memory_total_bytes{memory_type="Data memory"}[1m])) * 100) >= 90
Expression Validity	1m
SNMP Trap ID	2004
Affects Service (Y/N)	Y
Recommended Action	Cause: When the RAM or memory usage of any node reaches the critical level of threshold value. Diagnostic Information: Check if the memory usage of the following metrics are too high: `db_tier_memory_used_bytes` (Table 5-27) `db_tier_memory_total_bytes` (Table 5-28) Recovery: Reduce the incoming service request rate. This alert is cleared automatically when the memory usage of the cnDBTier worker pod is reduced below the defined threshold value. Note: Take immediate action to avoid the cnDBTier cluster going out of service. For any assistance, contact My Oracle Support.

6.12 cnDBTier CPU Usage Alerts

This section provides details about cnDBTier CPU usage alerts.

Table 6-33 HIGH_CPU

Field	Details
Description	This alert is triggered with major severity when the CPU usage of any data node is greater than or equal to 80%, and less than 90%.
Summary	Node ID {{ $labels.node_id }} CPU utilization at {{ $value }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	major
Condition	((100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m])) by (node_id))) >= 80) and ((100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m])) by (node_id))) < 90)
Expression Validity	1m
SNMP Trap ID	2002
Affects Service (Y/N)	N
Recommended Action	Cause: When the CPU utilization of any data node is greater than or equal to 80%, and less than 90%. Diagnostic Information: Check the CPU threshold level status from the cnDBTier worker pod logs. Recovery: Reduce the incoming service request rate. This alert is cleared automatically when the CPU utilization is reduced below the threshold value. For any assistance, contact My Oracle Support.

Table 6-34 HIGH_CPU

Field	Details
Description	This alert is triggered with critical severity when the CPU usage of any data node is greater than or equal to 90%.
Summary	Node ID {{ $labels.node_id }} CPU utilization at {{ $value }} for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	(100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m]))BY (node_id)))>= 90
Expression Validity	1m
SNMP Trap ID	2035
Affects Service (Y/N)	N
Recommended Action	Cause: When the CPU utilization of any data node is greater than or equal to 90%. Diagnostic Information: Check the CPU threshold level status from the cnDBTier worker pod logs. Recovery: Reduce the incoming service request rate. This alert is cleared automatically when the CPU utilization is reduced below the threshold value. Note: Take immediate action to avoid the cnDBTier cluster going out of service. For any assistance, contact My Oracle Support.

6.13 cnDBTier Node Status Alerts

The section provides details about cnDBTier node status alerts.

Table 6-35 NODE_DOWN

Field	Details
Description	This alert is raised with critical severity when the data node is down. db_tier_node_status value: 0: indicates that a node is DOWN 1: indicates that the node is UP
Summary	MySQL {{ $labels.node_type }} node having node id {{ $labels.node_id }} is down for cnDBTier site {{ $labels.site_name }} and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	db_tier_node_status == 0
Expression Validity	N/A
SNMP Trap ID	2001
Affects Service (Y/N)	Y
Recommended Action	Cause: When pod restarts due to Kubernetes liveliness or readiness probe failures When cnDBTier application restarts or fails to start Diagnostic Information: Run the following command to check the status of cnDBTier namespace: `kubectl -n <namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show` Check the Kubernetes events for probe failures in the platform logs. Check if any exception is reported in the cnDBTier application logs. Recovery: This alert is cleared automatically when the inactive pod becomes active. For any assistance, collect the application logs and contact My Oracle Support.

6.14 cnDBTier Node Data Volume Alerts

This section provides details about cnDBTier node data volume alerts.

Table 6-36 DB_TIER_API_SEND_NODE_DATA_VOLUME_LOW

Field	Details
Description	This alert is triggered when any NDB application node sends less data to NDB when compared to the other NDB application nodes.
Summary	Send Node Data Volume Low for API Node ID {{ $labels.remote_node_id }} at kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	(((sum by (remote_node_id,namespace) (avg_over_time(rate(db_tier_node_transporter_bytes_received{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m])))/scalar(sum (avg_over_time(rate(db_tier_node_transporter_bytes_received{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m]))))100) < (100/(scalar(count(count by (remote_node_id) (db_tier_node_transporter_bytes_received{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"})))1.6))
Expression Validity	NA
SNMP Trap ID	3001
Affects Service (Y/N)	Y
Recommended Action	Cause: When NDB application node sends less data to NDB when compared to other NDB application nodes. Diagnostic Information: The alert indicates that the NDB application node is slow, therefore check the underlying infrastructure. Recovery: This alert is cleared automatically when the NDB application node sends the data at the same rate as the other NDB application node. For any assistance, contact My Oracle Support.

Table 6-37 DB_TIER_API_RECEIVE_NODE_DATA_VOLUME_LOW

Field	Details
Description	This alert is triggered when any NDB sends less data to any specific NDB application node when compared to the other NDB application nodes.
Summary	Receive Node Data Volume Low for API Node ID {{ $labels.remote_node_id }} at kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	(((sum by (remote_node_id,namespace) (avg_over_time(rate(db_tier_node_transporter_bytes_sent{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m])))/scalar(sum (avg_over_time(rate(db_tier_node_transporter_bytes_sent{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m]))))100) < (100/(scalar(count(count by (remote_node_id) (db_tier_node_transporter_bytes_sent{node_type=~"ndbapp_node",namespace="<${CNDBTIER_NAMESPACE}>"})))1.6))
Expression Validity	NA
SNMP Trap ID	3002
Affects Service (Y/N)	Y
Recommended Action	Cause: When NDB application node sends less data to any specific NDB application node when compared to other NDB application nodes. Diagnostic Information: The alert indicates that the NDB application node is slow, therefore check the underlying infrastructure. Recovery: This alert is cleared automatically when the NDB application node sends the data at the same rate as the other NDB application node. For any assistance, contact My Oracle Support.

Table 6-38 DB_TIER_SEND_DATA_NODE_DATA_VOLUME_LOW

Field	Details
Description	This alert is triggered when any NDB doesn't send the traffic data in the required speed or when the speed is slower when compared to another data node.
Summary	Send Data Node Data Volume Low for DATA Node ID {{ $labels.node_id }} at kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	((sum by (node_id,namespace) (avg_over_time(rate(db_tier_node_transporter_bytes_sent{namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m]))/scalar(sum (avg_over_time(rate(db_tier_node_transporter_bytes_sent{namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m])))) * 100) < (100/(scalar(count(count by (node_id) (db_tier_node_transporter_bytes_sent{namespace="<${CNDBTIER_NAMESPACE}>"})))*1.6))
Expression Validity	NA
SNMP Trap ID	3003
Affects Service (Y/N)	Y
Recommended Action	Cause: When any NDB doesn't send the traffic data in the required speed or when the speed is slower when compared to another data node. Diagnostic Information: The alert indicates that the NDB application node is slow, therefore check the underlying infrastructure. Recovery: This alert is cleared automatically when the NDB application node sends the same amount of data to other data node. For any assistance, contact My Oracle Support.

Table 6-39 DB_TIER_RECEIVE_DATA_NODE_DATA_VOLUME_LOW

Field	Details
Description	This alert is triggered when any NDB doesn't receive the traffic data in the required speed or when the speed is slower when compared to another data node.
Summary	Receive Data Node Data Volume Low for DATA Node ID {{ $labels.node_id }} at kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	((sum by (node_id,namespace) (avg_over_time(rate(db_tier_node_transporter_bytes_received{namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m]))/scalar(sum (avg_over_time(rate(db_tier_node_transporter_bytes_received{namespace="<${CNDBTIER_NAMESPACE}>"}[5m])[15m:5m])))) * 100)< (100/(scalar(count(count by (node_id) (db_tier_node_transporter_bytes_received{namespace="<${CNDBTIER_NAMESPACE}>"})))*1.6))
Expression Validity	NA
SNMP Trap ID	3004
Affects Service (Y/N)	Y
Recommended Action	Cause: When any NDB doesn't receive the traffic data in the required speed or when the speed is slower when compared to another data node. Diagnostic Information: The alert indicates that the NDB application node is slow, therefore check the underlying infrastructure. Recovery: This alert is cleared automatically when the NDB application node receives the same amount of data to other data node. For any assistance, contact My Oracle Support.

Table 6-40 DB_TIER_DATA_NODE_SCAN_FRAGMENT_SLOW

Field	Details
Description	This alert is triggered when any data node scan fragment is slow when compared with other data nodes.
Summary	Scan Fragment is Slow for DATA Node ID {{ $labels.node_id }} at kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	(((sum by (node_id,comm_node_id,namespace) (rate(db_tier_tc_time_track_stats_total_scan_fragments_time{path_type="INTERNAL",namespace="<${CNDBTIER_NAMESPACE}>"}[15m]))/sum by (node_id,comm_node_id,namespace) (rate(db_tier_tc_time_track_stats_total_scan_fragments_count{path_type="INTERNAL",namespace="<${CNDBTIER_NAMESPACE}>"}[15m])))/ (scalar(sum(sum by (node_id,comm_node_id,namespace) (rate(db_tier_tc_time_track_stats_total_scan_fragments_time{path_type="INTERNAL",namespace="<${CNDBTIER_NAMESPACE}>"}[15m]))/sum by (node_id,comm_node_id,namespace) (rate(db_tier_tc_time_track_stats_total_scan_fragments_count{path_type="INTERNAL",namespace="<${CNDBTIER_NAMESPACE}>"}[15m]))))))100) > ((100/scalar(count(sum by (node_id,comm_node_id,namespace)(db_tier_tc_time_track_stats_total_scan_fragments_time{path_type="INTERNAL",namespace="<${CNDBTIER_NAMESPACE}>"}))))1.6)
Expression Validity	NA
SNMP Trap ID	3005
Affects Service (Y/N)	Y
Recommended Action	Cause: When the scan fragment for any particular data node is slow. Diagnostic Information: The alert indicates that the NDB application node is slow, therefore check the underlying infrastructure. Recovery: This alert is cleared automatically when the data node fragment scan is fast as compared to other data nodes. For any assistance, contact My Oracle Support.

6.15 cnDBTier Certificate Expiry Alerts

This section provides details about cnDBTier certificate expiry alerts.

Table 6-41 DBTIER_CERTIFICATE_EXPIRY_INFO

Field	Details
Description	This alert is triggered with `info` severity whenever the certificate for a cnDBTier is set to expire within the next 90 days.
Summary	dbtier Certificate {{ $labels.certType }}for {{ $labels.hostname }} is expiring with in 90 days for cnDBTier site {{ $labels.site_name }}and kubernetes namespace {{ $labels.namespace }}
Severity	info
Condition	(db_tier_cert_expiry / 1000 - time()) > 2592000 and (db_tier_cert_expiry / 1000 - time()) <= 7776000
Expression Validity	NA
OID	1.3.6.1.4.1.323.5.3.50.1.2.2045
Metric Used	`db_tier_cert_expiry`
Affects Service (Y/N)	N
Recommended Action	Cause: This alert is triggered when any cnDBTier certificate is going to expire in next 90 days. Diagnostic Information: This alert is triggered with `info` severity whenever the certificate for a cnDBTier is set to expire within the next 90 days. Recommended actions: Update the cnDBTier certificates by following the steps provided in the "Update Certificate" section in Oracle Communications Cloud Native Core, cnDBTier User Guide. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.
Available in OCI	Yes

Table 6-42 DBTIER_CERTIFICATE_EXPIRY_MAJOR

Field	Details
Description	This alert is triggered with `major` severity whenever the certificate for a cnDBTier is set to expire within the next 30 days.
Summary	dbtier Certificate {{ $labels.certType }}for {{ $labels.hostname }} is expiring with in 30 days for cnDBTier site {{ $labels.site_name }}and kubernetes namespace {{ $labels.namespace }}
Severity	major
Condition	(db_tier_cert_expiry / 1000 - time()) > 604800 and (db_tier_cert_expiry / 1000 - time()) <= 2592000
OID	1.3.6.1.4.1.323.5.3.50.1.2.2040
Metric Used	`db_tier_cert_expiry`
Expression Validity	NA
Affects Service (Y/N)	N
Recommended Action	Cause: This alert is triggered when any cnDBTier certificate is going to expire in next 30 days. Diagnostic Information: This alert is triggered with `info` severity whenever the certificate for a cnDBTier is set to expire within the next 30 days. Recommended actions: Update the cnDBTier certificates by following the steps provided in the "Update Certificate" section in Oracle Communications Cloud Native Core, cnDBTier User Guide. Depending on the certificate type in alerts, follow the below procedures appropriately: Modifying cnDBTier Certificates to Establish TLS for Communication with NFs Modifying HTTPS Certificates In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.
Available in OCI	Yes

Table 6-43 DBTIER_CERTIFICATE_EXPIRY_CRITICAL

Field	Details
Description	This alert is triggered with `critical` severity whenever the certificate for a cnDBTier is set to expire within the next 7 days.
Summary	dbtier Certificate {{ $labels.certType }}for {{ $labels.hostname }} is expiring with in 7 days for cnDBTier site {{ $labels.site_name }}and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	(db_tier_cert_expiry / 1000 - time()) > 0 and (db_tier_cert_expiry / 1000 - time()) <= 604800
OID	1.3.6.1.4.1.323.5.3.50.1.2.2041
Metric Used	`db_tier_cert_expiry`
Expression Validity	NA
Affects Service (Y/N)	Y
Recommended Action	Cause: This alert is triggered when any cnDBTier certificate is going to expire in next 7 days. Diagnostic Information: This alert is triggered with `critical` severity whenever the certificate for a cnDBTier is set to expire within the next 7 days. Recommended actions: Update the cnDBTier certificates by following the steps provided in the "Update Certificate" section in Oracle Communications Cloud Native Core, cnDBTier User Guide. Depending on the certificate type in alerts, follow the below procedures appropriately: Modifying cnDBTier Certificates to Establish TLS for Communication with NFs Modifying HTTPS Certificates In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.
Available in OCI	Yes

Table 6-44 DBTIER_CERTIFICATE_EXPIRED

Field	Details
Description	This alert is triggered with `critical` severity when any cnDBTier certificate has expired.
Summary	dbtier Certificate {{ $labels.certType }}for {{ $labels.hostname }} is expiredfor cnDBTier site {{ $labels.site_name }}and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	(db_tier_cert_expiry / 1000 - time()) <= 0
OID	1.3.6.1.4.1.323.5.3.50.1.2.2041
Metric Used	`db_tier_cert_expiry`
Expression Validity	NA
Affects Service (Y/N)	Y
Recommended Action	Cause: This alert is triggered when any cnDBTier certificate is going to expire in next 7 days. Diagnostic Information: This alert is triggered with `critical` severity whenever the certificate for a cnDBTier is set to expire within the next 7 days. Recommended actions: Update the cnDBTier certificates by following the steps provided in the "Update Certificate" section in Oracle Communications Cloud Native Core, cnDBTier User Guide. Depending on the certificate type in alerts, follow the below procedures appropriately: Modifying cnDBTier Certificates to Establish TLS for Communication with NFs Modifying HTTPS Certificates In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.
Available in OCI	Yes

6.16 cnDBTier PVC Health Alerts

This section provides details about cnDBTier PVC health related alerts.

Table 6-45 PVC_NOT_ACCESSIBLE

Field	Details
Description	This alert is triggered with `critical` severity when db_tier_pvc_is_accesible condition is zero. If the value of `db_tier_pvc_is_accesible` is 0, indicates that PVC is not accessible. If the value of `db_tier_pvc_is_accesible` is 1, indicates that PVC is accessible.
Summary	PVC is not accessible on cnDBTier site {{ $labels.site_name }
Severity	critical
Condition	db_tier_pvc_is_accesible == 0
OID	1.3.6.1.4.1.323.5.3.50.1.2.2029
Metric Used	db_tier_pvc_is_accesible
Expression Validity	1m
Affects Service (Y/N)	Y
Recommended Action	Cause: When PVC is not accessible for read or write operation. Diagnostic Information: The`db_tier_pvc_is_accesible` metric provides information about the PVC is accessible or not. Recommended steps: Verify the Cluster and Pod status. Run the following command to check the cluster status: `kubectl -n <cnDBTier Namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show` Run the following command to check the pod status: `kubectl get pod -n <cnDBTier Namespace>` Retrieve the pod name whose PVC is not accessible using db_tier_pvc_is_accessible metric. It is the hostname attribute. After retrieving the the name of the pod, get the PVC associated with it and describe the pod as well as PVC. Look for PVC bound status, any mount errors, events indicating mount failure or volume timeout. To describe the pod, run the following command: `kubectl -n <cnDBTier Namespace> describe pod <pod name>` To retrieve the PVC associated with the Pod, run the following command: `kubectl -n <cnDBTier Namespace> get pvc` To describe the PVC, run the following command: `kubectl -n <cnDBTier Namespace> describe pvc <pvc name>` Check PVC Mounting Inside the pod. Get mount_path from the db_tier_pvc_is_accesible. If mount_path is missing or empty. It confirms the PVC isn't mounted properly. Run the following commands to check the mount_path by logging in to the pod: `kubectl -n <cnDBTier Namespace> exec -it <pod name> -- bash ls -l /var/occnedb df -h \| grep occnedb` Restart the pod. Sometimes a simple restart remounts the PVC correctly. `Restart the podkubectl -n <cnDBTier Namespace> delete pod <pod name>` Check the logs of the pod. Run the following command to check the logs of the main container: `kubectl -n <cnDBTier Namespace> logs <pod name> -c <main container name>` Run the following command to check logs of the infra monitor svc container: `kubectl -n <cnDBTier Namespace> logs <pod name> -c <db-infra-monitor-svc container name>` This alert will be cleared automatically when the PVC metric db_tier_pvc_failure_count become zero. In case if the issue persists, capture all the outputs for the above steps and contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

Table 6-46 PVC_STORAGE_FULL

Field	Details
Description	The PVC_STORAGE_FULL alert is triggered with critical severity when a pod's PVC reaches full capacity.
Summary	PVC is not accessible on cnDBTier site {{ $labels.site_name }
Severity	critical
Condition	db_tier_pvc_is_accesible == 0
OID	1.3.6.1.4.1.323.5.3.50.1.2.2029
Metric Used	db_tier_pvc_is_accesible
Expression Validity	1m
Affects Service (Y/N)	Y
Recommended Action	Cause: This alert is triggered when the PVC reaches full capacity, preventing further write operations.. Diagnostic Information: The system detects that the PVC has no available space, leading to storage-related failures. Recommended steps: Verify the Cluster and Pod status. Run the following command to check the cluster status: `kubectl -n <cnDBTier Namespace> exec -it ndbmgmd-0 -- ndb_mgm -e show` Run the following command to check the pod status: `kubectl get pod -n <cnDBTier Namespace>` From db_tier_volume_stats_used_bytes get the name of the pod whose pvc storage is full. It is there as the hostname attribute. Verify that the cnDBTier pods are configured with the resources (CPUs, Memory and PVC size) as per cnDBTier dimensions. If the PVC is not configured as per cnDBTier dimensions, then increase the database capacity by following the scaling procedures below before performing the scaling: Vertical Scaling Horizontal Scaling Ensure that the application managing the PVC properly handles storage utilization. This alert will be cleared automatically when sufficient space becomes available. In case if the issue persists, capture all the outputs for the above steps and and contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.17 cnDBTier Backup Manager Svc Down Alerts

This section provides details about cnDBTier backup manager Svc down alerts.

Table 6-47 DB_BACKUP_MANAGER_SVC_DOWN

Field	Details
Description	This alert is triggered with `critical` severity when db_backup_manager_svc pod is down.
Summary	PVC is not accessible on cnDBTier site {{ $labels.site_name }
Severity	critical
Condition	kube_deployment_status_replicas_available{deployment=~".db-backup-manager-svc."} == 0
OID	1.3.6.1.4.1.323.5.3.50.1.2.2039
Metric Used	kube_deployment_status_replicas_available{deployment=~".db-backup-manager-svc."}
Expression Validity	1m
Affects Service (Y/N)	Y
Recommended Action	Cause: When `db_backup_manager_svc` pod is down. Diagnostic Information: The system detects that the `db_backup_manager_svc` pod is not up and unable to connect to database. Recommended steps: Check the db-backup-manager service pod status. Look for CrashLoopBackOff, ImagePullBackOff, OOMKilled, Init container failures. Run the following command to check the pod status: `kubectl -n <cnDBTier namespace> get pods \| grep "db-backup-manager-svc"` Run the following command to describe the pod: `kubectl -n <cnDBTier namespace> describe pod <backup-manager-pod name>` Check the deployment status. Look for Available replicas: 0, `Events` section for scheduling or image pull errors. Run the following command to get the `db-backup-manager-svc` deployment: `kubectl -n <cnDBTier namespace> get deployment \| grep "db-backup-manager-svc"` Run the following command to describe the deployment: `kubectl -n <cnDBTier namespace> describe deployment <backup-manger-svc deployment>` Check the DB Backup Manager service pod logs to see if there are a lot of database connectivity issues. Run the following command to get the `backup-manager-svc` pod: `kubectl -n <cnDBTier namespace> get pods \| grep "db-backup-manager-svc"` Run the following command to check the logs of the backup-manger svc pod: `kubectl -n <cnDBTier namespace> logs < db-backup-manager-svc pod name>` In case if the issue persists, capture all the outputs for the above steps and contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6.18 cnDBTier Forced Switchover Disabled Alerts

This section provides details about cnDBTier forced switchover disabled alerts.

Table 6-48 DB_TIER_FORCED_SWITCHOVER_DISABLED

Field	Details
Description	This alert is triggered with `critical` severity when switchover is disabled forcefully.
Summary	dbtier switchover is disabled forcefully for cnDBTier {{ $labels.site_name }}and kubernetes namespace {{ $labels.namespace }}
Severity	critical
Condition	kube_deployment_status_replicas_available{deployment=~".db-backup-manager-svc."} == 0
OID	1.3.6.1.4.1.323.5.3.50.1.2.2039
Metric Used	kube_deployment_status_replicas_available{deployment=~".db-backup-manager-svc."}
Expression Validity	1m
Affects Service (Y/N)	Y
Recommended Action	Cause: When switchover is disabled forcefully. Diagnostic Information: The alert informs the operator that switchover is currently disabled and needs to be updated. Recommended steps: Check the DBTIER_REPL_SITE_INFO table for the `stop_repl_switchover` value. If the value of `stop_repl_switchover` for the current site is 1, it means switchover is disabled forcefully. Get the site_name, mate_site_name and replchannel_group_id from the `db_tier_stop_repl_switchover` metric's attribute. Run the following command to check the replication_info.DBTIER_REPL_SITE_INFO table `kubectl -n <cnDBTier namespace> ndbmysqld-0 -- mysql -h127.0.0.1 -uroot -p<root user password>;` Run the following query to get the value of the column stop_repl_switchover: `select stop_repl_switchover from DBTIER_REPL_SITE_INFO where site_name='<site name>' and mate_site_name='<mate site name>'and replchannel_group_id=<replication channel grp id>;` The alert will automatically clear once the switchover is enabled. If the `stop_repl_switchover` value retrieved from the above step is 1 and if you wish to re-enable the switchover, then call the API mentioned in the cnDBTier Switchover APIs section to re-enable the switchover. In case if the issue persists, capture all the outputs for the above steps and contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. For more information on how to collect logs using Data Collector tool, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

6 cnDBTier Alerts