Remediation for Database Service Events
This article describes the fixes needed for problems encountered while using Database Service Events.
The following remediations are available:
- HEALTH.DB_GUEST.FILESYSTEM.FREE_SPACE
- AVAILABILITY.DB_GUEST.CRS_INSTANCE.DOWN
- AVAILABILITY.DB_GUEST.CRS_INSTANCE.EVICTION
- AVAILABILITY.DB_CLUSTER.SCAN_LISTENER.DOWN
- AVAILABILITY.DB_GUEST.CLIENT_LISTENER.DOWN
- AVAILABILITY.DB_GUEST.CDB_INSTANCE.DOWN
- HEALTH.DB_CLUSTER.CDB.CORRUPTION
- HEALTH.DB_CLUSTER.CDB.ARCHIVER_HANG
- HEALTH.DB_CLUSTER.CDB.DATABASE_HANG
- HEALTH.DB_CLUSTER.CDB.BACKUP_FAILURE
- HEALTH.DB_CLUSTER.DISK_GROUP.FREE_SPACE
HEALTH.DB_GUEST.FILESYSTEM.FREE_SPACE
Event Name
HEALTH.DB_GUEST.FILESYSTEM.FREE_SPACE
Event Description
This event is reported when VM guest file system free space falls below 10% free, as determined by the operating system df(1)
command, for the following file systems:
/
/u01
/u02
/var
/tmp
Problem Statement
One or more VM guest file systems has free space below 10% free.
Risk
Insufficient VM guest file system free space can cause disk space allocation failure, which can result in wide-ranging errors and failures in Oracle software (Database, Clusterware, Cloud Tooling).
Action/Repair
Oracle Cloud and DCS Agent run automatically to purge old log files and trace files created by cloud tooling to reclaim file system space.
If the automatic file system space reclamation utilities cannot sufficiently purge old files to clear this event, then perform the following actions:
- Remove unneeded files and/or directories created manually or by customer-installed applications or utilities. Files created by customer-installed software are outside the scope of Oracle's automatic file system space reclamation utilities. The following operating system command, run as the
root
user, is useful for identifying directories consuming excessive disk space:sudo du -hx <file system mount point> | sort -hr
Only remove files or directories you are certain can be safely removed.
- Set the automatic purging policy using cloud tooling. For more information, see Autologcleanpolicy Commands.
- Open service request to receive additional guidance about reducing file system space use.
AVAILABILITY.DB_GUEST.CRS_INSTANCE.DOWN
Event Name
AVAILABILITY.DB_GUEST.CRS_INSTANCE.DOWN
Event Description
An event of type CRITICAL is created when the Cluster Ready Service (CRS) is detected to be down.
Problem Statement
The Cluster Ready Stack is in an offline state or has failed.
Risk
If the CRS is offline on a node, the node cannot provide database services for the application.
Action/Repair
- Check if CRS was stopped by your administrator, as part of a planned maintenance event, or a scale up or down of local storage
- The following patching events will stop CRS
- GRID Update
- Update of Guest
- Update of Host
- The following patching events will stop CRS
- If CRS has stopped unexpectedly, the current status can be checked by issuing the
crsctl check crs
command.- If the node is not responding, the VM node may be rebooting. Wait for the node reboot to finish, CRS will normally be started through the
init
process.
- If the node is not responding, the VM node may be rebooting. Wait for the node reboot to finish, CRS will normally be started through the
- If CRS is still down, investigate the cause of the failure by referring to the
alert.log
found in/u01/app/grid/diag/crs/<node_name>/crs/trace
. Review the log entries corresponding to the date/time of the down event and act on any potential remediation. - Restart the CRS, by issuing the
crsctl start crs
command. - A successful restart of CRS will generate the clearing event: AVAILABILITY.DB_GUEST.CRS_INSTANCE.DOWN_CLEARED
Clearing Event
AVAILABILITY.DB_GUEST.CRS_INSTANCE.DOWN_CLEARED
Clearing Event Description
An INFORMATION event is created once the CRS is successfully started.
AVAILABILITY.DB_GUEST.CRS_INSTANCE.EVICTION
Event Name
AVAILABILITY.DB_GUEST.CRS_INSTANCE.EVICTION
Event Description
An event of type CRITICAL is created when the Cluster Ready Service (CRS) evicts a node from the cluster. The CRS alert.log
is parsed for the CRS-1632
error indicating that a node is being removed from the cluster.
Problem Statement
The Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected. A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin
process. The purpose of this node eviction is to maintain the overall health of the cluster by removing unhealthy members.
Risk
During the time it takes to restart the evicted node, the node cannot provide database services for the application.
Action/Repair
A CRS node eviction could be caused by OCSSD (aka CSS daemon), CSSDAGENT or CSSDMONITOR processes. This requires determining which process was responsible for the node eviction and reviewing the relevant log files. Common causes of OCSSD eviction are network failures/latencies, IO issues with CSS voting disks, a member kill escalation. CSSDAGENT or CSSDMONITOR evictions could be OS scheduler problem or a hung thread within CSS daemon. Log files to review include clusterware alert log, cssdagent log, cssdmonitor log, ocssd log, lastgasp log, /var/log/messages, CHM/OS Watcher data, and opatch lsinventory detail.
For more information on collecting files together, see Autonomous Health Framework (AHF) Trace File Analyzer (TFA) & ORAchk/EXAchk . For more information on troubleshooting CRS node eviction, see Troubleshooting Clusterware Node Evictions (Reboots).
AVAILABILITY.DB_CLUSTER.SCAN_LISTENER.DOWN
Event Name
AVAILABILITY.DB_CLUSTER.SCAN_LISTENER.DOWN
Event Description
A DOWN event is created when a SCAN listener goes down. The event is of type INFORMATION when a SCAN listener is shutdown due to user action, such as with the Server Control Utility (srvctl
) or Listener Control (lsnrctl
) commands, or any Oracle Cloud maintenance action that uses those commands, such as performing a grid infrastructure software update. The event is of type CRITICAL when a SCAN listener goes down unexpectedly. A corresponding DOWN_CLEARED event is created when a SCAN listener is started.
There are three SCAN listeners per cluster called LISTENER_SCAN[1,2,3].
Problem Statement
A SCAN listener is down and unable to accept application connections.
Risk
If all SCAN listeners are down, application connections to the database through the SCAN listener will fail.
Action/Repair
Start the SCAN listener to receive the DOWN_CLEARED event.
DOWN event of type INFORMATION
- If the event was caused by an Oracle Cloud maintenance action, such as performing a grid infrastructure software update, then no action is required. The affected SCAN listener will automatically failover to an available instance.
- If the event was caused by user action, then start the SCAN listener at the next opportunity.
DOWN event of type CRITICAL
- Check SCAN status and restart the SCAN listener
- Login to the VM as
opc
user andsudo
to thegrid
user:[opc@vm ~] sudo su - grid
- Check the SCAN listeners status on any node:
[grid@vm ~] srvctl status scan_listener
- Start the SCAN listener:
[grid@vm ~] srvctl start scan_listener
- Recheck the SCAN listeners status on any node: if the
scan_listener
is still down, investigate the cause of the scan listener failure:- Collect both the CRS and OS logs 30 minutes prior and 10 minutes for the
<hostName>
indicated in log. Note the time in the event payload is always provided in UTC: Fortfactl
collection, adjust the time to the timezone of the VM cluster.[grid@vm ~] tfactl diagcollect -crs -os -node <hostName> –from "<eventTime adjusted for local vm timezone> - 30 minute " -to "<eventTime adjusted for local vm timezone> + 10 minutes"
- SCAN listener issues are logged:
/u01/app/grid/diag/tnslsnr/<hostName>/<listenerName>/trace
- Collect both the CRS and OS logs 30 minutes prior and 10 minutes for the
- Login to the VM as
AVAILABILITY.DB_GUEST.CLIENT_LISTENER.DOWN
Event Name
AVAILABILITY.DB_GUEST.CLIENT_LISTENER.DOWN
Event Description
A DOWN event is created when a client listener goes down. The event is of type INFORMATION when a client listener is shutdown due to user action, such as with the Server Control Utility (srvctl
) or Listener Control (lsnrctl
) commands, or any Oracle Cloud maintenance action that uses those commands, such as performing a grid infrastructure software update. The event is of type CRITICAL when a client listener goes down unexpectedly. A corresponding DOWN_CLEARED event is created when a client listener is started.
There is one client listener per node, each called LISTENER.
Problem Statement
A client listener is down and unable to accept application connections.
Risk
If the node's client listener is down, the database instances on the node cannot provide services for the application.
If the client listener is down on all nodes, any application that connects to any database using the SCAN or VIP will fail.
Action/Repair
Start the client listener to receive the DOWN_CLEARED event.
DOWN event of type INFORMATION
- If the event was caused by an Oracle Cloud maintenance action, such as performing a grid infrastructure software update, then no action is required. The affected client listener will automatically restart when maintenance affecting the grid instance is complete.
- If the event was caused by user action, then start the client listener at the next opportunity.
DOWN event of type CRITICAL
- Check client listener status and restart the client listener:
- Login to the VM as
opc
user andsudo
to thegrid
user:[opc@vm ~] sudo su - grid
- Check the client listener status on any node:
[grid@vm ~] srvctl status listener
- Start the client listener:
[grid@vm ~] srvctl start listener
- Recheck the client listener status on any node: if client listener is still down. Investigate the cause of the client listener failure:
- Use
tfactl
to collect both the CRS and OS logs 30 minutes prior and 10 minutes for thehostName
indicated in log. Note the time in the event payload is always provided in UTC: Fortfactl
collection, adjust the time to the timezone of the VM cluster.[grid@vm ~] tfactl diagcollect -crs -os -node <hostName> –from "<eventTime adjusted for local vm timezone> - 30 minute " -to "<eventTime adjusted for local vm timezone> + 10 minutes"
- Review the listener log located under:
/u01/app/grid/diag/tnslsnr/<hostName>/<listenerName>/trace
- Use
- Login to the VM as
AVAILABILITY.DB_GUEST.CDB_INSTANCE.DOWN
Event Name
AVAILABILITY.DB_GUEST.CDB_INSTANCE.DOWN
Event Description
A DOWN event is created when a database instance goes down. The event is of type INFORMATION when a database instance is shutdown due to user action, such as with the SQL*Plus (sqlplus
) or Server Control Utility (srvctl
) commands, or any Oracle Cloud maintenance action that uses those commands, such as performing a database home software update. The event is of type CRITICAL when a database instance goes down unexpectedly. A corresponding DOWN_CLEARED event is created when a database instance is started.
Problem Statement
A database instance has gone down.
Risk
A database instance has gone down., which may result in reduced performance if database instances are available on other nodes in the cluster, or complete downtime if database instances on all nodes are down.
Action/Repair
Start the database instance to receive the DOWN_CLEARED event.
DOWN event of type INFORMATION
- If the event was caused by an Oracle Cloud maintenance action, such as performing a database home software update, then no action is required. The affected database instance will automatically restart when maintenance affecting the instance is complete.
- If the event was caused by user action, then start the affected database instance at the next opportunity.
DOWN event of type CRITICAL
- Check database status and restart the down database instance.
- Login to the VM as
oracle
user: - Set the environment:
[oracle@vm ~] . <dbName>.env
- Check the database status:
[oracle@vm ~] srvctl status database -db <dbName>
- Start the database instance:
[oracle@vm ~] srvctl start instance -db <dbName> -instance <instanceName>
- Login to the VM as
- Investigate the cause of the database instance failure.
- Review Trace File Analyzer (TFA) events for the database:
[oracle@vm ~] tfactl events -database <dbName> -instance <instanceName>
- Review the database alert log located at:
$ORACLE_BASE/diag/rdbms/<dbName>/<instanceName>/trace/alert_<instanceName>.log
- Review Trace File Analyzer (TFA) events for the database:
HEALTH.DB_CLUSTER.CDB.CORRUPTION
Event Name
HEALTH.DB_CLUSTER.CDB.CORRUPTION
Event Description
Database corruption has been detected on your primary or standby database. The database alert.log
is parsed for any specific errors that are indicative of physical block corruptions, logical block corruptions, or logical block corruptions caused by lost writes.
Problem Statement
Corruptions can lead to application or database errors and in worse case result in significant data loss if not addressed promptly.
A corrupt block is a block that was changed so that it differs from what Oracle Database expects to find. Block corruptions can be categorized as physical or logical:
- In a physical block corruption, which is also called a media corruption, the database does not recognize the block at all; the checksum is invalid or the block contains all zeros. An example of a more sophisticated block corruption is when the block header and footer do not match.
- In a logical block corruption, the contents of the block are physically sound and pass the physical block checks; however, the block can be logically inconsistent. Examples of logical block corruption include incorrect block type, incorrect data or redo block sequence number, corruption of a row piece or index entry, or data dictionary corruptions.
Block corruptions can also be divided into interblock corruption and intrablock corruption:
- In an intra-block corruption, the corruption occurs in the block itself and can be either a physical or a logical block corruption.
- In an inter-block corruption, the corruption occurs between blocks and can only be a logical block corruption.
Oracle checks for the following errors in the alert.log
:
ORA-01578
ORA-00752
ORA-00753
ORA-00600
[3020
]ORA-00600
[kdsgrp1
]ORA-00600
[kclchkblk_3
]ORA-00600
[13013
]ORA-00600
[5463
]
Risk
A data corruption outage occurs when a hardware, software, or network component causes corrupt data to be read or written. The service-level impact of a data corruption outage may vary, from a small portion of the application or database (down to a single database block) to a large portion of the application or database (making it essentially unusable). If remediation action is not taken promptly, potential downtime and data loss can increase.
Action/Repair
The current event notification currently triggers on physical block corruptions (ORA-01578
), lost writes (ORA-00752
, ORA-00753
, and ORA-00600
with first argument 3020
) and logical corruptions (typical detected from ORA-00600
with first argument of kdsgrp1
, kdsgrp1
, kclchkblk_3
, 13013
, or 5463
).
We recommend the following steps:
- Confirm that these corruptions were reported in the
alert.log
trace file. Log a Service Request (SR) with latest EXAchk report, excerpt of thealert.log
and trace file containing the corruption errors, any history of recent application, database or software changes and any system, clusterware and database logs for the same time period. For all these cases, a TFA collection should be available and should be attached to the SR. - For more information on repair recommendations, see Primary Note for Handling Oracle Database Corruption Issues.
For physical corruptions or ORA-1578
errors, the following notes will be helpful:
- OERR: ORA-1578 "ORACLE data block corrupted (file # %s, block # %s)" Primary Note (Doc ID 1578.1)
- How to identify all the Corrupted Objects in the Database with RMAN (Doc ID 472231.1)
- How to identify the corrupt Object reported by ORA-1578 / RMAN / DBVERIFY (Doc ID 819533.1)
- Primary Note for Handling Oracle Database Corruption Issues
Note:
RMAN can be used to recover one or many data block that are physically corrupted. Also using Active Data Guard with real time apply, auto block repair of physical data corruptions would have occurred automatically.For logical corruptions caused by lost writes (ORA-00752
, ORA-00753
, and ORA-00600
with first argument 3020
) on the primary or standby databases, they will be detected on the primary or with standby's redo apply process. The following notes will be helpful:
- Primary Note for Handling Oracle Database Corruption Issues
- If you have a standby and lost write corruption on the primary or standby, see Resolving ORA-00752 or ORA-600 [3020] During Standby Recovery (Doc ID 1265884.1).
For logical corruptions (typical detected from ORA-00600
with arguments of kdsgrp1
, kclchkblk_3
, 13013
, or 5463
)
- For more information on the error that was detected, see Primary Note for Handling Oracle Database Corruption Issues.
- If you have a standby and logical corruption on the primary, see Resolving Logical Block Corruption Errors in a Physical Standby Database (Doc ID 2821699.1).
HEALTH.DB_CLUSTER.CDB.ARCHIVER_HANG
Event Name
HEALTH.DB_CLUSTER.CDB.ARCHIVER_HANG
Event Description
An event of type CRITICAL is created if a container database (CDB) is either unable to archive the active online redo log or unable to archive the active online redo log fast enough to the log archive destinations.
Problem Statement
CDB RAC Instance may temporarily or permanently stall due to the log writer's (LGWR) inability to write the log buffers to an online redo log. This occurs because all online logs need archiving. Once the archiver (ARC) can archive at least one online redo log, LGWR will be able to resume writing the log buffers to online redo logs and the application impact will be alleviated.
Risk
If the archiver hang is temporary, this can result in a small application brown out or stall for application processes attempting to commit their database changes. If the archiver is not unblocked, applications can experience extended delay in processing.
Action/Repair
- To determine the hourly frequency for each thread/instance, see Script To Find Redolog Switch History And Find Archivelog Size For Each Instances In RAC (Doc ID 2373477.1).
- If any hourly bucket is greater than 12, consider resizing the online redo logs. See item 2 below for resizing steps.
- If the database hangs are temporary, the archiver may be unable to keep up with the redo log generated.
- Check the
alert.log
,$ORACLE_BASE/diag/rdbms/<dbName>/<instanceName>/trace/alert_<instanceName>.log
, for "All online logs need archiving", multiple events in a short period can indicate 2 possible solutions.- If the number of redo logs groups per thread is less than 4, consider adding additional logs groups to reach 4, see item1 below for add redo log steps.
- The other possible solution is to resize the redo logs, see item 2 below for resizing steps.
- Check the
- For sizing guidelines for Data Guard and non Data Guard, see Configure Online Redo Logs Appropriately.
- Add a redo log group for each thread. The additional redo log should equal the current log size.
- Use the following query:
select max(group#) Ending_group_number, thread#, count(*) number_of_groups_per_thread, bytes redo_size_in_bytes from v$log group by thread#,bytes
- Add one new group per thread using the same size as the current redo logs.
alter database add logfile thread <thread_number> group <max group + 1> ('<DATA_DISKGROUP>') size <redo_size_in_bytes>
- Use the following query:
- Resize the online redo logs by adding larger redo logs and dropping the current smaller redo logs.
- Use the following query:
select max(group#) Ending_group_number, thread#, count(*) number_of_groups_per_thread, bytes redo_size_in_bytes from v$log group by thread#,bytes
- Add the same number of redo logs for each thread
number_of_groups_per_thread
that currently exist. Thenew_redo_size_in_bytes
should be based on Configure Online Redo Logs Appropriately.alter database add logfile thread <thread_number> group <max group + 1> ('<DATA_DISKGROUP>') size <new_redo_size_in_bytes>
- The original smaller redo logs should be deleted. A redo log can only be deleted if its status is inactive. To determine the status of a redo logs issue the following select.
select group#, thread#, status, bytes from v$log order by bytes, group#, thread#;
Delete the original smaller redo logs:
alter database drop logfile <group#>;
- Use the following query:
- If the database is hung, the primary log archive destination and alternate may be full.
- For more information on freeing space in RECO and DATA disk groups, see HEALTH.DB_CLUSTER.DISK_GROUP.FREE_SPACE.
HEALTH.DB_CLUSTER.CDB.DATABASE_HANG
Event Name
HEALTH.DB_CLUSTER.CDB.DATABASE_HANG
Event Description
An event of type CRITICAL is created when a process/session hang is detected in the container database (CDB).
Problem Statement
Hang management detected a process hang and generated a ORA-32701
error message. Additional, this event may be raised if Diagnostic Process (DIA0
) process detects a hang in a critical database process.
Risk
A hang can indicate resource, OS or application coding related issues.
Action/Repair
Investigate the cause of the session hang.
- Review TFA events for the database for the following message patterns corresponding to the date/time of the event:
ORA-32701
, "DIA0 Critical Database Process Blocked", or "DIA0 Critical Database Process As Root".tfactl events -database <dbName> -instance <instanceName>
- Review the alert.log associated at the following location:
$ORACLE_BASE/diag/rdbms/<dbName>/<instanceName>/trace/alert_<instanceName>.log
- For
ORA-32701
: An overloaded system can cause slow progress, which can be interpreted as a hang. The hang manager may attempt to resolve the hang by terminating the final blocker process. - For
DIA0
Critical Database Process messages: Review the related diagnostic lines indicating the process and the reason for the hang.
HEALTH.DB_CLUSTER.CDB.BACKUP_FAILURE
Event Name
HEALTH.DB_CLUSTER.CDB.BACKUP_FAILURE
Event Description
An event of type CRITICAL is created if there is a CDB backup with a FAILED status reported in the v$rman_status
view.
Problem Statement
A daily incremental BACKUP of the CDB failed.
Risk
A failure of the backup can compromise the ability to use the backups for restore/recoverability of the database. Recoverability Point Object (RPO) and the Recoverability Time Object (RTO) can be impacted.
Action/Repair
Review the RMAN logs corresponding to the date/time of the event. Note the event time stamp eventTime
is in UTC, adjust as necessary for the VM's timezone.
For Oracle managed backups:
- RMAN output can be found at
/opt/oracle/dcs/log/<hostname>/rman
. - Review the log for any failures:
- If the failure is due to an external event outside of RMAN, for example the backup location was full or a networking issue, resolve the external issue.
- For other RMAN script errors, collect the diagnostic logs and open a Service Request.
dbcli collect-diagnostics -h
Usage: collect-diagnostics [options] Options: --components, -c Supported components: [all, dcs, crs, acfs, asm, db] all -- Collects diagnosis for all supported components [all, dcs, crs, acfs, asm, db] dcs -- Collects diagnosis for dcs crs -- Collects diagnosis for crs acfs -- Collects diagnosis for acfs asm -- Collects diagnosis for asm. db -- Collects diagnosis for db. For multiple parameter values, follow the example as "-c c1 c2" Default: [dcs] --dbNames, -d Comma separated database names. Valid only if 'db' or 'all' specified in Components list. --endTime, -et End time of diagnostic logs. Please give time in yyyy-MM-dd HH:mm:ss format --help, -h get help --json, -j json output --objectstoreuri, -ou Pre Authenticated Request URI --redaction, -r Diagnostic logs redaction. Might take longer time with some components. --startTime, -st Start time of diagnostic logs. Please give time in yyyy-MM-dd HH:mm:ss format
- If the issue is transient or is resolved, then take a new incremental backup. For more information, see Back Up a Database Using the Console.
For customer owned and managed backup taken through RMAN:
- Review the RMAN logs for the backup.
HEALTH.DB_CLUSTER.DISK_GROUP.FREE_SPACE
Event Name
HEALTH.DB_CLUSTER.DISK_GROUP.FREE_SPACE
Event Description
An event of type CRITICAL is created when an ASM disk group reaches space usage of 90% or higher. An event of type INFORMATION is created when the ASM disk group space usage drops below 90%.
Problem Statement
ASM disk group space usage is at or exceeds 90%.
Risk
Insufficient ASM disk group space can cause database creation failure, tablespace and data file creation failure, automatic data file extension failure, or ASM rebalance failure.
Action/Repair
ASM disk group used space is determined by the running the following query while connected to the ASM instance.
sudo su - grid
sqlplus / as sysasm
select 'ora.'||name||'.dg', total_mb, free_mb,
round ((1-(free_mb/total_mb))*100,2) pct_used from v$asm_diskgroup;
NAME TOTAL_MB FREE_MB PCT_USED
---------------- ---------- ---------- ----------
ora.DATAC1.dg 75497472 7408292 90.19
ora.RECOC1.dg 18874368 17720208 6.11
ASM disk group capacity can be increased in the following ways:
- Scale the VM Cluster storage to add more ASM disk group capacity. For more information, see the Scale Up the Storage for a Virtual Machine DB System.
DATA disk group space use can be reduced in the following ways:
- Drop unused data files and temp files from databases. For more information, see Dropping Data Files.
RECO disk group space use can be reduced in the following ways:
- Drop unnecessary Guaranteed Restore Points. For more information, see Using Normal and Guaranteed Restore Points.
- Delete archived redo logs or database backups already backed up outside the Flash Recovery Area (FRA). For more information, see Maintaining the Fast Recovery Area.
For information about Oracle's commitment to accessibility, visit the Oracle Accessibility Program website at http://www.oracle.com/pls/topic/lookup?ctx=acc&id=docacc.
Access to Oracle Support
Oracle customers that have purchased support have access to electronic support through My Oracle Support. For information, visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=info or visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=trs if you are hearing impaired.