Remediation for Database Service Events

HEALTH.DB_GUEST.FILESYSTEM.FREE_SPACE

Event Name

HEALTH.DB_GUEST.FILESYSTEM.FREE_SPACE

Event Description

This event is reported when VM guest file system free space falls below 10% free, as determined by the operating system df(1) command, for the following file systems:

  • /
  • /u01
  • /u02
  • /var
  • /tmp

Problem Statement

One or more VM guest file systems has free space below 10% free.

Risk

Insufficient VM guest file system free space can cause disk space allocation failure, which can result in wide-ranging errors and failures in Oracle software (Database, Clusterware, Cloud Tooling).

Action/Repair

Oracle Cloud and DCS Agent run automatically to purge old log files and trace files created by cloud tooling to reclaim file system space.

If the automatic file system space reclamation utilities cannot sufficiently purge old files to clear this event, then perform the following actions:

  1. Remove unneeded files and/or directories created manually or by customer-installed applications or utilities. Files created by customer-installed software are outside the scope of Oracle's automatic file system space reclamation utilities. The following operating system command, run as the root user, is useful for identifying directories consuming excessive disk space:
    sudo du -hx <file system mount point> | sort -hr

    Only remove files or directories you are certain can be safely removed.

  2. Set the automatic purging policy using cloud tooling. For more information, see Autologcleanpolicy Commands.
  3. Open service request to receive additional guidance about reducing file system space use.

AVAILABILITY.DB_GUEST.CRS_INSTANCE.DOWN

Event Name

AVAILABILITY.DB_GUEST.CRS_INSTANCE.DOWN

Event Description

An event of type CRITICAL is created when the Cluster Ready Service (CRS) is detected to be down.

Problem Statement

The Cluster Ready Stack is in an offline state or has failed.

Risk

If the CRS is offline on a node, the node cannot provide database services for the application.

Action/Repair

  1. Check if CRS was stopped by your administrator, as part of a planned maintenance event, or a scale up or down of local storage
    1. The following patching events will stop CRS
      1. GRID Update
      2. Update of Guest
      3. Update of Host
  2. If CRS has stopped unexpectedly, the current status can be checked by issuing the crsctl check crs command.
    1. If the node is not responding, the VM node may be rebooting. Wait for the node reboot to finish, CRS will normally be started through the init process.
  3. If CRS is still down, investigate the cause of the failure by referring to the alert.log found in /u01/app/grid/diag/crs/<node_name>/crs/trace. Review the log entries corresponding to the date/time of the down event and act on any potential remediation.
  4. Restart the CRS, by issuing the crsctl start crs command.
  5. A successful restart of CRS will generate the clearing event: AVAILABILITY.DB_GUEST.CRS_INSTANCE.DOWN_CLEARED

Clearing Event

AVAILABILITY.DB_GUEST.CRS_INSTANCE.DOWN_CLEARED

Clearing Event Description

An INFORMATION event is created once the CRS is successfully started.

AVAILABILITY.DB_GUEST.CRS_INSTANCE.EVICTION

Event Name

AVAILABILITY.DB_GUEST.CRS_INSTANCE.EVICTION

Event Description

An event of type CRITICAL is created when the Cluster Ready Service (CRS) evicts a node from the cluster. The CRS alert.log is parsed for the CRS-1632 error indicating that a node is being removed from the cluster.

Problem Statement

The Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected. A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process. The purpose of this node eviction is to maintain the overall health of the cluster by removing unhealthy members.

Risk

During the time it takes to restart the evicted node, the node cannot provide database services for the application.

Action/Repair

A CRS node eviction could be caused by OCSSD (aka CSS daemon), CSSDAGENT or CSSDMONITOR processes. This requires determining which process was responsible for the node eviction and reviewing the relevant log files. Common causes of OCSSD eviction are network failures/latencies, IO issues with CSS voting disks, a member kill escalation. CSSDAGENT or CSSDMONITOR evictions could be OS scheduler problem or a hung thread within CSS daemon. Log files to review include clusterware alert log, cssdagent log, cssdmonitor log, ocssd log, lastgasp log, /var/log/messages, CHM/OS Watcher data, and opatch lsinventory detail.

For more information on collecting files together, see Autonomous Health Framework (AHF) Trace File Analyzer (TFA) & ORAchk/EXAchk . For more information on troubleshooting CRS node eviction, see Troubleshooting Clusterware Node Evictions (Reboots).

AVAILABILITY.DB_CLUSTER.SCAN_LISTENER.DOWN

Event Name

AVAILABILITY.DB_CLUSTER.SCAN_LISTENER.DOWN

Event Description

A DOWN event is created when a SCAN listener goes down. The event is of type INFORMATION when a SCAN listener is shutdown due to user action, such as with the Server Control Utility (srvctl) or Listener Control (lsnrctl) commands, or any Oracle Cloud maintenance action that uses those commands, such as performing a grid infrastructure software update. The event is of type CRITICAL when a SCAN listener goes down unexpectedly. A corresponding DOWN_CLEARED event is created when a SCAN listener is started.

There are three SCAN listeners per cluster called LISTENER_SCAN[1,2,3].

Problem Statement

A SCAN listener is down and unable to accept application connections.

Risk

If all SCAN listeners are down, application connections to the database through the SCAN listener will fail.

Action/Repair

Start the SCAN listener to receive the DOWN_CLEARED event.

DOWN event of type INFORMATION

  1. If the event was caused by an Oracle Cloud maintenance action, such as performing a grid infrastructure software update, then no action is required. The affected SCAN listener will automatically failover to an available instance.
  2. If the event was caused by user action, then start the SCAN listener at the next opportunity.

DOWN event of type CRITICAL

  1. Check SCAN status and restart the SCAN listener
    • Login to the VM as opc user and sudo to the grid user:
      [opc@vm ~] sudo su - grid
    • Check the SCAN listeners status on any node:
      [grid@vm ~] srvctl status scan_listener
    • Start the SCAN listener:
      [grid@vm ~] srvctl start scan_listener
    • Recheck the SCAN listeners status on any node: if the scan_listener is still down, investigate the cause of the scan listener failure:
      1. Collect both the CRS and OS logs 30 minutes prior and 10 minutes for the <hostName> indicated in log. Note the time in the event payload is always provided in UTC: For tfactl collection, adjust the time to the timezone of the VM cluster.
        [grid@vm ~] tfactl diagcollect -crs -os -node <hostName> 
            –from "<eventTime adjusted for local vm timezone> - 30 minute " 
            -to "<eventTime adjusted for local vm timezone> + 10 minutes"
      2. SCAN listener issues are logged:
        /u01/app/grid/diag/tnslsnr/<hostName>/<listenerName>/trace

AVAILABILITY.DB_GUEST.CLIENT_LISTENER.DOWN

Event Name

AVAILABILITY.DB_GUEST.CLIENT_LISTENER.DOWN

Event Description

A DOWN event is created when a client listener goes down. The event is of type INFORMATION when a client listener is shutdown due to user action, such as with the Server Control Utility (srvctl) or Listener Control (lsnrctl) commands, or any Oracle Cloud maintenance action that uses those commands, such as performing a grid infrastructure software update. The event is of type CRITICAL when a client listener goes down unexpectedly. A corresponding DOWN_CLEARED event is created when a client listener is started.

There is one client listener per node, each called LISTENER.

Problem Statement

A client listener is down and unable to accept application connections.

Risk

If the node's client listener is down, the database instances on the node cannot provide services for the application.

If the client listener is down on all nodes, any application that connects to any database using the SCAN or VIP will fail.

Action/Repair

Start the client listener to receive the DOWN_CLEARED event.

DOWN event of type INFORMATION

  1. If the event was caused by an Oracle Cloud maintenance action, such as performing a grid infrastructure software update, then no action is required. The affected client listener will automatically restart when maintenance affecting the grid instance is complete.
  2. If the event was caused by user action, then start the client listener at the next opportunity.

DOWN event of type CRITICAL

  1. Check client listener status and restart the client listener:
    • Login to the VM as opc user and sudo to the grid user:
      [opc@vm ~] sudo su - grid
    • Check the client listener status on any node:
      [grid@vm ~] srvctl status listener
    • Start the client listener:
      [grid@vm ~] srvctl start listener
    • Recheck the client listener status on any node: if client listener is still down. Investigate the cause of the client listener failure:
      1. Use tfactl to collect both the CRS and OS logs 30 minutes prior and 10 minutes for the hostName indicated in log. Note the time in the event payload is always provided in UTC: For tfactl collection, adjust the time to the timezone of the VM cluster.
        [grid@vm ~] tfactl diagcollect -crs -os -node <hostName> 
            –from "<eventTime adjusted for local vm timezone> - 30 minute " 
            -to "<eventTime adjusted for local vm timezone> + 10 minutes"
      2. Review the listener log located under:
        /u01/app/grid/diag/tnslsnr/<hostName>/<listenerName>/trace

AVAILABILITY.DB_GUEST.CDB_INSTANCE.DOWN

Event Name

AVAILABILITY.DB_GUEST.CDB_INSTANCE.DOWN

Event Description

A DOWN event is created when a database instance goes down. The event is of type INFORMATION when a database instance is shutdown due to user action, such as with the SQL*Plus (sqlplus) or Server Control Utility (srvctl) commands, or any Oracle Cloud maintenance action that uses those commands, such as performing a database home software update. The event is of type CRITICAL when a database instance goes down unexpectedly. A corresponding DOWN_CLEARED event is created when a database instance is started.

Problem Statement

A database instance has gone down.

Risk

A database instance has gone down., which may result in reduced performance if database instances are available on other nodes in the cluster, or complete downtime if database instances on all nodes are down.

Action/Repair

Start the database instance to receive the DOWN_CLEARED event.

DOWN event of type INFORMATION

  1. If the event was caused by an Oracle Cloud maintenance action, such as performing a database home software update, then no action is required. The affected database instance will automatically restart when maintenance affecting the instance is complete.
  2. If the event was caused by user action, then start the affected database instance at the next opportunity.

DOWN event of type CRITICAL

  1. Check database status and restart the down database instance.
    1. Login to the VM as oracle user:
    2. Set the environment:
      [oracle@vm ~] . <dbName>.env
    3. Check the database status:
      [oracle@vm ~] srvctl status database -db <dbName>
    4. Start the database instance:
      [oracle@vm ~] srvctl start instance -db <dbName> -instance <instanceName>
  2. Investigate the cause of the database instance failure.
    1. Review Trace File Analyzer (TFA) events for the database:
      [oracle@vm ~] tfactl events -database <dbName> -instance <instanceName>
    2. Review the database alert log located at:
      $ORACLE_BASE/diag/rdbms/<dbName>/<instanceName>/trace/alert_<instanceName>.log

HEALTH.DB_CLUSTER.CDB.CORRUPTION

Event Name

HEALTH.DB_CLUSTER.CDB.CORRUPTION

Event Description

Database corruption has been detected on your primary or standby database. The database alert.log is parsed for any specific errors that are indicative of physical block corruptions, logical block corruptions, or logical block corruptions caused by lost writes.

Problem Statement

Corruptions can lead to application or database errors and in worse case result in significant data loss if not addressed promptly.

A corrupt block is a block that was changed so that it differs from what Oracle Database expects to find. Block corruptions can be categorized as physical or logical:

  • In a physical block corruption, which is also called a media corruption, the database does not recognize the block at all; the checksum is invalid or the block contains all zeros. An example of a more sophisticated block corruption is when the block header and footer do not match.
  • In a logical block corruption, the contents of the block are physically sound and pass the physical block checks; however, the block can be logically inconsistent. Examples of logical block corruption include incorrect block type, incorrect data or redo block sequence number, corruption of a row piece or index entry, or data dictionary corruptions.

Block corruptions can also be divided into interblock corruption and intrablock corruption:

  • In an intra-block corruption, the corruption occurs in the block itself and can be either a physical or a logical block corruption.
  • In an inter-block corruption, the corruption occurs between blocks and can only be a logical block corruption.

Oracle checks for the following errors in the alert.log:

  • ORA-01578
  • ORA-00752
  • ORA-00753
  • ORA-00600 [3020]
  • ORA-00600 [kdsgrp1]
  • ORA-00600 [kclchkblk_3]
  • ORA-00600 [13013]
  • ORA-00600 [5463]

Risk

A data corruption outage occurs when a hardware, software, or network component causes corrupt data to be read or written. The service-level impact of a data corruption outage may vary, from a small portion of the application or database (down to a single database block) to a large portion of the application or database (making it essentially unusable). If remediation action is not taken promptly, potential downtime and data loss can increase.

Action/Repair

The current event notification currently triggers on physical block corruptions (ORA-01578), lost writes (ORA-00752, ORA-00753, and ORA-00600 with first argument 3020) and logical corruptions (typical detected from ORA-00600 with first argument of kdsgrp1, kdsgrp1, kclchkblk_3, 13013, or 5463).

We recommend the following steps:

  1. Confirm that these corruptions were reported in the alert.log trace file. Log a Service Request (SR) with latest EXAchk report, excerpt of the alert.log and trace file containing the corruption errors, any history of recent application, database or software changes and any system, clusterware and database logs for the same time period. For all these cases, a TFA collection should be available and should be attached to the SR.
  2. For more information on repair recommendations, see Primary Note for Handling Oracle Database Corruption Issues.

For physical corruptions or ORA-1578 errors, the following notes will be helpful:

Note:

RMAN can be used to recover one or many data block that are physically corrupted. Also using Active Data Guard with real time apply, auto block repair of physical data corruptions would have occurred automatically.

For logical corruptions caused by lost writes (ORA-00752, ORA-00753, and ORA-00600 with first argument 3020) on the primary or standby databases, they will be detected on the primary or with standby's redo apply process. The following notes will be helpful:

For logical corruptions (typical detected from ORA-00600 with arguments of kdsgrp1, kclchkblk_3, 13013, or 5463)

HEALTH.DB_CLUSTER.CDB.ARCHIVER_HANG

Event Name

HEALTH.DB_CLUSTER.CDB.ARCHIVER_HANG

Event Description

An event of type CRITICAL is created if a container database (CDB) is either unable to archive the active online redo log or unable to archive the active online redo log fast enough to the log archive destinations.

Problem Statement

CDB RAC Instance may temporarily or permanently stall due to the log writer's (LGWR) inability to write the log buffers to an online redo log. This occurs because all online logs need archiving. Once the archiver (ARC) can archive at least one online redo log, LGWR will be able to resume writing the log buffers to online redo logs and the application impact will be alleviated.

Risk

If the archiver hang is temporary, this can result in a small application brown out or stall for application processes attempting to commit their database changes. If the archiver is not unblocked, applications can experience extended delay in processing.

Action/Repair

  • To determine the hourly frequency for each thread/instance, see Script To Find Redolog Switch History And Find Archivelog Size For Each Instances In RAC (Doc ID 2373477.1).
    • If any hourly bucket is greater than 12, consider resizing the online redo logs. See item 2 below for resizing steps.
  • If the database hangs are temporary, the archiver may be unable to keep up with the redo log generated.
    • Check the alert.log, $ORACLE_BASE/diag/rdbms/<dbName>/<instanceName>/trace/alert_<instanceName>.log, for "All online logs need archiving", multiple events in a short period can indicate 2 possible solutions.
      1. If the number of redo logs groups per thread is less than 4, consider adding additional logs groups to reach 4, see item1 below for add redo log steps.
      2. The other possible solution is to resize the redo logs, see item 2 below for resizing steps.
  • For sizing guidelines for Data Guard and non Data Guard, see Configure Online Redo Logs Appropriately.
  • Add a redo log group for each thread. The additional redo log should equal the current log size.
    1. Use the following query:
      select max(group#) Ending_group_number, thread#, 
          count(*) number_of_groups_per_thread, 
          bytes redo_size_in_bytes from v$log group by thread#,bytes
    2. Add one new group per thread using the same size as the current redo logs.
      alter database add logfile thread <thread_number> 
          group <max group + 1> ('<DATA_DISKGROUP>') size <redo_size_in_bytes>
  • Resize the online redo logs by adding larger redo logs and dropping the current smaller redo logs.
    1. Use the following query:
      select max(group#) Ending_group_number, thread#, 
          count(*) number_of_groups_per_thread, 
          bytes redo_size_in_bytes from v$log group by thread#,bytes
    2. Add the same number of redo logs for each thread number_of_groups_per_thread that currently exist. The new_redo_size_in_bytes should be based on Configure Online Redo Logs Appropriately.
      1. alter database add logfile thread <thread_number> 
            group <max group + 1> ('<DATA_DISKGROUP>') size <new_redo_size_in_bytes>
      2. The original smaller redo logs should be deleted. A redo log can only be deleted if its status is inactive. To determine the status of a redo logs issue the following select.
        select group#, thread#, status, bytes from v$log order by bytes, group#, thread#;

        Delete the original smaller redo logs:

        alter database drop logfile <group#>;
  • If the database is hung, the primary log archive destination and alternate may be full.

HEALTH.DB_CLUSTER.CDB.DATABASE_HANG

Event Name

HEALTH.DB_CLUSTER.CDB.DATABASE_HANG

Event Description

An event of type CRITICAL is created when a process/session hang is detected in the container database (CDB).

Problem Statement

Hang management detected a process hang and generated a ORA-32701 error message. Additional, this event may be raised if Diagnostic Process (DIA0) process detects a hang in a critical database process.

Risk

A hang can indicate resource, OS or application coding related issues.

Action/Repair

Investigate the cause of the session hang.

  • Review TFA events for the database for the following message patterns corresponding to the date/time of the event: ORA-32701, "DIA0 Critical Database Process Blocked", or "DIA0 Critical Database Process As Root".
    tfactl events -database <dbName> -instance <instanceName>
  • Review the alert.log associated at the following location:
    $ORACLE_BASE/diag/rdbms/<dbName>/<instanceName>/trace/alert_<instanceName>.log
  • For ORA-32701: An overloaded system can cause slow progress, which can be interpreted as a hang. The hang manager may attempt to resolve the hang by terminating the final blocker process.
  • For DIA0 Critical Database Process messages: Review the related diagnostic lines indicating the process and the reason for the hang.

HEALTH.DB_CLUSTER.CDB.BACKUP_FAILURE

Event Name

HEALTH.DB_CLUSTER.CDB.BACKUP_FAILURE

Event Description

An event of type CRITICAL is created if there is a CDB backup with a FAILED status reported in the v$rman_status view.

Problem Statement

A daily incremental BACKUP of the CDB failed.

Risk

A failure of the backup can compromise the ability to use the backups for restore/recoverability of the database. Recoverability Point Object (RPO) and the Recoverability Time Object (RTO) can be impacted.

Action/Repair

Review the RMAN logs corresponding to the date/time of the event. Note the event time stamp eventTime is in UTC, adjust as necessary for the VM's timezone.

For Oracle managed backups:

  • RMAN output can be found at /opt/oracle/dcs/log/<hostname>/rman.
  • Review the log for any failures:
    • If the failure is due to an external event outside of RMAN, for example the backup location was full or a networking issue, resolve the external issue.
    • For other RMAN script errors, collect the diagnostic logs and open a Service Request.
      dbcli collect-diagnostics -h
      Usage: collect-diagnostics [options]
        Options:
          --components, -c
             Supported components: [all, dcs, crs, acfs, asm, db]
             all -- Collects diagnosis for all supported components [all, dcs, crs, acfs, asm, db]
             dcs -- Collects diagnosis for dcs
             crs -- Collects diagnosis for crs
             acfs -- Collects diagnosis for acfs
             asm -- Collects diagnosis for asm.
             db -- Collects diagnosis for db.
             For multiple parameter values, follow the example as "-c c1 c2"
             Default: [dcs]
          --dbNames, -d
             Comma separated database names. Valid only if 'db' or 'all' specified in Components list.
          --endTime, -et
             End time of diagnostic logs. Please give time in yyyy-MM-dd HH:mm:ss format
          --help, -h
             get help
          --json, -j
             json output
          --objectstoreuri, -ou
             Pre Authenticated Request URI
          --redaction, -r
             Diagnostic logs redaction. Might take longer time with some components.
          --startTime, -st
          Start time of diagnostic logs. Please give time in yyyy-MM-dd HH:mm:ss format
  • If the issue is transient or is resolved, then take a new incremental backup. For more information, see Back Up a Database Using the Console.

For customer owned and managed backup taken through RMAN:

  • Review the RMAN logs for the backup.

HEALTH.DB_CLUSTER.DISK_GROUP.FREE_SPACE

Event Name

HEALTH.DB_CLUSTER.DISK_GROUP.FREE_SPACE

Event Description

An event of type CRITICAL is created when an ASM disk group reaches space usage of 90% or higher. An event of type INFORMATION is created when the ASM disk group space usage drops below 90%.

Problem Statement

ASM disk group space usage is at or exceeds 90%.

Risk

Insufficient ASM disk group space can cause database creation failure, tablespace and data file creation failure, automatic data file extension failure, or ASM rebalance failure.

Action/Repair

ASM disk group used space is determined by the running the following query while connected to the ASM instance.

sudo su - grid
sqlplus / as sysasm
select 'ora.'||name||'.dg', total_mb, free_mb, 
    round ((1-(free_mb/total_mb))*100,2) pct_used from v$asm_diskgroup;
NAME             TOTAL_MB    FREE_MB   PCT_USED
---------------- ---------- ---------- ----------
ora.DATAC1.dg     75497472    7408292      90.19
ora.RECOC1.dg     18874368   17720208       6.11

ASM disk group capacity can be increased in the following ways:

  1. Scale the VM Cluster storage to add more ASM disk group capacity. For more information, see the Scale Up the Storage for a Virtual Machine DB System.

DATA disk group space use can be reduced in the following ways:

  1. Drop unused data files and temp files from databases. For more information, see Dropping Data Files.

RECO disk group space use can be reduced in the following ways:

  1. Drop unnecessary Guaranteed Restore Points. For more information, see Using Normal and Guaranteed Restore Points.
  2. Delete archived redo logs or database backups already backed up outside the Flash Recovery Area (FRA). For more information, see Maintaining the Fast Recovery Area.