Troubleshooting Exadata Cloud@Customer Systems

These topics cover some common issues you might run into and how to address them.

Patching Failures on Exadata Cloud@Customer Systems

Patching operations can fail for various reasons. Typically, an operation fails because a database node is down, there is insufficient space on the file system, or the virtual machine cannot access the object store.

Determining the Problem

In the Console, you can identify a failed patching operation by viewing the patch history of an Exadata Cloud@Customer system or an individual database.

A patch that was not successfully applied displays a status of Failed and includes a brief description of the error that caused the failure. If the error message does not contain enough information to point you to a solution, you can use the database CLI and log files to gather more data. Then, refer to the applicable section in this topic for a solution.

Troubleshooting and Diagnosis

Diagnose the most common issues that can occur during the patching process of any of the Exadata Cloud@Customer components.

Database Server VM Issues

One or more of the following conditions on the database server VM can cause patching operations to fail.

File System is Full

Patching operations require a minimum of 25 GB for Oracle Grid Infrastructure patching or 15 GB for Oracle Database patching. If the required Oracle home locations do not not meet the storage requirements, then an error message like the following can be observed during the patching pre-check operation:
[FATAL] [DBAAS-31009] - One or more Oracle patching pre-checks resulted in error conditions that needs to be
        addressed before proceeding: not enough space for s/w backups   ACTION: Verify the logs at /var/opt/oracle/log/exadbcpatch.

Use the df -h command on the host to check the available space. If the file system has insufficient space, you can remove old log or trace files to free up space.

Database Server VM Connectivity Problems

Cloud tooling relies on the proper networking and connectivity configuration between virtual machines of a given VM cluster. If the configuration is not set properly, this may incur in failures on all the operations that require cross-node processing. One example can be not being able to download the required files to apply a given patch. In such case and error like the following one would be observed upon a given patch precheck or apply request:
[FATAL] [DBAAS-31009] - One or more Oracle patching pre-checks resulted in error conditions that needs to be
        addressed before proceeding: % Total % Received % Xferd Average Speed Time Time Time Current
        Dload Upload Total Spent Left Speed0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to [host address]

Given the case, you can perform the following actions:

  • Verify that the virtual machine or the URL is reachable by using the following commands:
    ping hostname
    curl target url
  • Verify that your DNS configuration is correct so that the relevant virtual machine addresses are resolvable within the VM cluster.
  • Refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.
Oracle Grid Infrastructure Issues

One or more of the following conditions on Oracle Grid Infrastructure can cause patching operations to fail.

Oracle Grid Infrastructure is Down

Oracle Clusterware enables servers to communicate with each other so that they can function as a collective unit. The cluster software program must be up and running on the VM Cluster for patching operations to complete. Occasionally you might need to restart the Oracle Clusterware to resolve a patching failure.

In such cases, verify the status of the Oracle Grid Infrastructure as follows:
./crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
If Oracle Grid Infrastructure is down, then restart by running the following commands:
crsctl start cluster -all
crsctl check cluster

Oracle Grid Infrastructure Upgrade Pre-check Failures

During the pre-check operation, many failures can be reported as the requested databases to be patched, fails in meeting the minimum requirements to perform the patching operation. An example of the required command is shown as it follows:
dbaascli patch db prereq --patchid patch id --dbnames GRID
DBAAS CLI version 19.4.4.2.0
Executing command patch db prereq --patchid LATEST --dbnames grid
INFO: DBCS patching
...

Patch ID not being recognized

If Cloud tooling fails to identify the specified patch ID to verify, then an error like the following one will be observed:
[FATAL] [DBAAS-10002] - The provided value for the parameter patchnum is invalid: Incorrect patchnum.
ACTION: Verify the corresponding application usage and/or logs at /var/opt/oracle/log/exadbcpatchmulti and try again.

To verify that the specified patch ID is correct, confirm that the specified patch ID is listed as an available patch on the Console.

If the specified patch ID is listed and if the prerequisite operation still fails to recognize the patch ID, then refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.

Specific Pre-check Validation Failed

Once the pre-check validation starts, Cloud tooling will perform a series of validations to determine whether or not the minimum requirements to perform the requested patching operation are met. If any of these minimum requirements are not met, then the following failure will be observed:
[FATAL] [DBAAS-31009] - One or more Oracle patching pre-checks resulted in error conditions that needs to be addressed before proceeding: <Specific Pre-check Validation Failure>

Depending on the specific failed prerequisite validation, the corresponding corrections can be performed on the environment or the Oracle home if required. Once those corrections have been performed, then the operation can be reattempted.

If the failure persists, then refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.

Oracle Grid Infrastructure Patch Apply Failures

During the actual installation of the Oracle Grid Infrastructure patch, the procedure may fail as shown in the following example:
dbaascli patch db apply --patchid patch id --dbnames GRID
...
ERROR: Grid upgrade failed. Please check corresponding log in /var/opt/oracle/log/exadbcpatch

If a failure is detected on a given node during the patch installation process, then do the following:

  • Address the issue, if its evident, that caused the failure and then re-try the same command so the operation can be resumed from the failure point.
  • If, after retrying the command, the issue persists or it's not possible to identify the root cause of the failure then refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.
Oracle Databases Issues

An improper database state can lead to patching failures.

Oracle Database is Down

The database must be active and running on all the active nodes so the patching operations can be completed successfully across the cluster.

Use the following command to check the state of your database, and ensure that any problems that might have put the database in an improper state are resolved:
srvctl status database -d db_unique_name -verbose

The system returns a message including the database instance status. The instance status must be Open for the patching operation to succeed.

If the database is not running, use the following command to start it:
srvctl start database -d db_unique_name -o open

Oracle Database Patching Pre-check Failures

During the pre-check operation, different issues may be reported for the databases to be patched if they fail to meet the minimum requirements to perform the patching operation. An example of the precheck command is shown as follows:
dbaascli patch db prereq --patchid patch id --dbnames <database 1,...,database n>
DBAAS CLI version 19.4.4.2.0
Executing command patch db prereq --patchid LATEST --dbnames grid
INFO: DBCS patching
...

Patch ID not being recognized

If the Cloud tooling fails to verify the specified patch ID, then an error like the following will be observed:
[FATAL] [DBAAS-10002] - The provided value for the parameter patchnum is invalid: Incorrect patchnum.
ACTION: Verify the corresponding application usage and/or logs at /var/opt/oracle/log/exadbcpatchmulti and try again.

To verify that the specified patch ID is correct, confirm that the specified patch ID is listed as an available patch on the Console.

Alternatively, you can verify the installed patch level in a given home by using the following command:
dbaascli dbhome info

If the specified patch ID is listed and if the prerequisite operation still fails to recognize the patch ID, then refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.

Specific Prereq Validation Failed

Once the prerequisite validation starts, Cloud tooling will perform a series of validations to determine whether the minimum requirements to perform the requested patching operation are met. If any of the minimum requirements are not met, then a failure of the following will be observed:
[FATAL] [DBAAS-31009] - One or more Oracle patching pre-checks resulted in error conditions that needs to be addressed before proceeding: <Specific Prereq Validation Failure>

Depending on the specific failed prereq validation, the corresponding corrections can be performed on the environment or the Oracle home if required. Once the corrections have been performed the operation can be re-attempted.

If the failure persists, then refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.

Oracle Database Patch Apply Failures

During the actual installation of the requested patch for the corresponding Oracle home, the procedure may fail as shown in the following example:
dbaascli patch db apply --patchid patch id --dbnames <database 1,...,database n>
...
ERROR: Error during creation, empty dbhome patching failed. Check the corresponding logs

If it's not possible to identify the root cause of the failure and its corresponding solution then refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.

Oracle Cloud Tooling Issues

No Applicable Cloud Tooling Patches Available

A patch operation may fail because there are no applicable RPMs to install. An example of this condition follows:
dbaascli patch tools apply --patchid LATEST
DBAAS CLI version 19.4.4.2.0
Executing command patch tools apply --patchid LATEST
...
[FATAL] [DBAAS-33032] - An error occurred while performing the installation of the Oracle DBAAS tools: No applicable dbaastools rpms found.
ACTION: Verify the logs at /var/opt/oracle/log/exadbcpatch.
To confirm that indeed there are no applicable patches to be installed for Cloud tooling you can run the following command:
dbaascli patch tools list

If the patch level for the Cloud tooling is eligible for patch application and Cloud tooling is not able to disclose any applicable patch ID, then refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.

Failed Patch Modifies the Home Name in oraInventory with the Suffix "_PIP"

Description: The image-based patching process temporarily changes the name of the Home that is being patched in the oraInventory adding the suffix '_pip' for patching in progress. For example, changing OraDB19Home1 to OraDB19Home1_pip.

When a patch fails on node 2, the name is not reverted to the original name. This causes subsequent Home installed on node 2 to use the Home name OraDB19Home1.

Action: On the failing node run the following command to clear the inventory from the corresponding _pip entrance:
/var/opt/oracle/exapatch/exadbcpatchmulti -rollback_async patch id  
-instance1=hostname:ORACLE_HOME path -dbname=dbname1 -run_datasql=1

After performing the local rollback, resume applying the corresponding patching.

Patching Primary and Standby Databases Configured with Oracle Data Guard Fails

Description: In OCI environments, patching primary or secondary nodes using the exadbcpatchmulti tool fails if there's no SSH connectivity between the primary and standby nodes.

Action: Depending on the node you're patching, add the -primary or -secondary flag. You can add flags to identify the nodes only if you're patching using the exadbcpatchmulti tool.

For example:

To patch standby nodes, use -secondary flag.
/var/opt/oracle/exapatch/exadbcpatchmulti action [patchid] dbname|instance_num -secondary
To patch primary nodes, use -primary flag.
/var/opt/oracle/exapatch/exadbcpatchmulti action [patchid] dbname|instance_num -primary
Note

Always patch standby nodes first and then proceed to primary nodes.

Obtaining Further Assistance

If you were unable to resolve the problem using the information in this topic, follow the procedures below to collect relevant database and diagnostic information. After you have collected this information, contact Oracle Support.

Related Topics

Collecting Cloud Tooling Logs

Use the relevant log files that could assist Oracle Support for further investigation and resolution of a given issue.

DBAASAPI Logs

These logs are applicable for actions that are performed from the Console.

/var/opt/oracle/log/dbaasapi/db/db
  • Job HASH.log corresponding to the Backend API request
Note

All the log files are timestamped records so that the issues can be traced back to some point in the past during the DBSystem operation.

DBAASCLI Logs

/var/opt/oracle/log/dbaascli
  • dbaascli.log

DBAAS ExaPatch Logs

/var/opt/oracle/log/exadbcpatchmulti:
  • exadbcpatchmulti.log
  • exadbcpatchmulti-cmd.log
/var/opt/oracle/log/exadbcpatchsm:
  • exadbcpatchsm.log
/var/opt/oracle/log/exadbcpatch:
  • exadbcpatch.log
  • exadbcpatch-cmd.log
  • exadbcpatch-dmp.log
  • exadbcpatch-sql.log
Note

All the log files are timestamped records so that the issues can be traced back to some point in the past during the DBSystem operation.

Collecting Configuration Tools Logs

$GRID_BASE/cfgtoollogs
$ORACLE_BASE/cfgtoollogs

Collecting Oracle Diagnostics

To collect the relevant Oracle diagnostic information and logs, run the dbaas_diag_tool.pl script.
/var/opt/oracle/misc/dbaas_diag_tool.pl

For more information about the usage of this utility, see My Oracle Support note 2219712.1.

Database is Down While Performing Downgrade to Release 11.2 or 12.1

Description: Error is thrown as follows while running the database upgrade command with the –revert flag.
[FATAL] [DBAAS-54007] - An error occurred when
open the a121db database with resetlog options: ORA-01034: ORACLE not
available.

Action: Apply one-off patching for bug 31561819 prior to attempting the downgrade if required for Oracle Database releases 11.2 and 12.1.

If the issue persists and impacts a given database, then, per a similar bug 31762303 filed by MAA team, Oracle recommends that you run the following commands after the failure to complete database downgrade:
/u02/app/oracle/product/19.0.0.0/dbhome_3/bin/srvctl downgrade database -d
<DB_UNIQUE_NAME> -o /u02/app/oracle/product/11.2.0/dbhome_2 -t 11.2.0.4
/u02/app/oracle/product/11.2.0/dbhome_2/bin/srvctl setenv database -d <DB_UNIQUE_NAME> -T
"TNS_ADMIN=/u02/app/oracle/product/11.2.0/dbhome_2/network/admin/<DB_NAME>"

After Database Upgrade, the Standby Database Remains in Mounted State in Oracle Data Guard Configurations

Description: After performing the upgrade as recommended in Oracle MOS note 2628228.1, the standby database is left opened in MOUNT state.

Action: If it is required to bring the standby database back to read-only mode, then proceed with the following steps:

Run the following query on the primary database:
SELECT DEST_ID,THREAD#,sequence#,RESETLOGS_CHANGE#,STANDBY_DEST,ARCHIVED,APPLIED,status,to_char(completion_time,'DD-MM-YYYY:hh24:mi') from  v$archived_log;

Ensure that all the logs have been replicated successfully to the standby database after the upgrade operation.

Then on the standby database, run the following commands:
dbaascli database stop --dbname standby dbname
dbaascli database start --dbname standby dbname

Then the database should be open in read-only mode again.

Primary Database Fails to Downgrade to 18c in Oracle Data Guard Configurations

Description: The following failure is observed while downgrading a primary Data Guard database from 19c to 18c:
[FATAL] [DBAAS-54007]
- An error occurred when open the db163 database with resetlog options:
ORA-16649: possible failover to another database prevents this database from
being opened.

Action: Follow these steps to fix the issue:

  1. Open the initdbname.ora file:
    /var/opt/oracle/dbaas_acfs/upgrade_backup/dbname/initdbname.ora
  2. Set the *.dg_broker_start parameters to false and save the changes:
    *.dg_broker_start=FALSE
  3. Bring down the local instance and open it back in mount mode:
    startup mount pfile=’/var/opt/oracle/dbaas_acfs/upgrade_backup/dbname/initdbname.ora’;
  4. Then open it with the following command:
    alter database open resetlogs;
  5. Re-enable the Data Guard broker.
    alter system set dg_broker_start=true scope=BOTH;
  6. Restore the spfile.
    create spfile=’DATA DISKGROUP/db_unique_name/spfiledbname.ora’ from pfile=’ /var/opt/oracle/dbaas_acfs/upgrade_backup/dbname/initdbname.ora’
  7. Shut down the local instance.
    shutdown immediate;
  8. Manually downgrade the service.
    19c Oracle home/bin/srvctl downgrade database –d db_unique_name -oraclehome 18c Oracle home path -targetversion 18.0.0.0.0
  9. Restore the TNS_ADMIN variable.
    19c Oracle home/bin/srvctl setenv database –d db_unique_name -t oraclehome “18c Oracle home/network/admin/dbname
  10. Bounce the database across the cluster.
    18c Oracle home/bin/srvctl stop database –d db_unique_name
    18c Oracle home/bin/srvctl start database –d db_unique_name

Deleting a Database Does Not Delete the Associated Oracle Home

Description: During initial VM cluster creation, a database is created to verify/test the ability to create a database through tooling. This database is immediately deleted. However, the Oracle Home directory is left behind.

Action: To delete Oracle Home, run the following dbaascli commands as root.

dbaascli dbhome info: Lists all the database home and any associated databases in the respective homes.

Identify the database home where no databases are running, and then run the following command to delete Oracle home.

dbaascli dbhome purge --hpath path of Oracle Home to purge

You can get the path of Oracle Home to purge details from the dbaascli dbhome info command.

Database Creation Fails with Error Message "There are no ASM disk groups detected"

Description: This is an intermittent issue. Oracle Trace File Analyzer changes the owner and mode of the kfod directory, which causes database creation to fail.

Action: Before creating a database, change the owner and mode of all subdirectories and files in the kfod directory.
cd /u02/app/grid19/diag/kfod/node_name/kfod
sudo chown -R grid *
sudo chmod -R 775 *

Database Startup Fails with Error Message "ORA-12649: Unknown encryption or data integrity algorithm"

Description: Database startup fails with ORA-12649 when you attempt to start up a database or when you provision Data Guard after RMAN duplicate is complete.

Action: Restart CRS and try starting the database. If the issue persists, then apply Patch 32117115 on GI home and retry. If data Guard provisioning failed due to error ORA-12649, then apply patch before retrying provisioning Data Guard. For more information, see How to Apply Database Quarterly Patch on Exadata Cloud Service and Exadata Cloud at Customer Gen 2 (Doc ID 2701789.1). Applying the patch will also fix database startup issues.

VM Operating System Update Hangs During Database Connection Drain

Description: This is an intermittent issue. During virtual machine operating system update with 19c Grid Infrastructure and running databases, dbnodeupdate.sh waits for RHPhelper to drain the connections, which will not progress because of a known bug "DBNODEUPDATE.SH HANGS IN RHPHELPER TO DRAIN SESSIONS AND SHUTDOWN INSTANCE".

Symptoms: There are two possible outcomes due to this bug:
  1. VM operating system update hangs in rhphelper
    • Hangs the automation
    • Some or none of the database connections will have drained, and some or all of the database instances will remain running.
  2. VM operating system update does not drain database connections because rhphelper crashed
    • Does not hang automation
    • Some or none of the database connection draining completes

/var/log/cellos/dbnodeupdate.trc will show this as the last line:

(ACTION:) Executing RHPhelper to drain sessions and shutdown instances. 
(trace:/u01/app/grid/crsdata/scaqak04dv0201/rhp//executeRHPDrain.150721125206.trc)
Action:
  1. Upgrade Grid Infrastructure version to 19.11 or above.

    (OR)

    Disable rhphelper before updating and enable it back after updating.

    To disable before updating is started:
    /u01/app/19.0.0.0/grid/srvm/admin/rhphelper /u01/app/19.0.0.0/grid 19.10.0.0.0 -setDrainAttributes ENABLE=false
    To enable after updating is completed:
    /u01/app/19.0.0.0/grid/srvm/admin/rhphelper /u01/app/19.0.0.0/grid oracle-home-current-version -setDrainAttributes ENABLE=true

    If you disable rhphelper, then there will be no database connection draining before database services and instances are shutdown on a node before the operating system is updated.

  2. If you missed disabling RHPhelper and upgrade is not progressing and hung, then it is observed that the draining of services is taking time:
    1. Inspect the /var/log/cellos/dbnodeupdate.trc trace file, which contains a paragraph similar to the following:
      (ACTION:) Executing RHPhelper to drain sessions and shutdown instances. 
      (trace: /u01/app/grid/crsdata/<nodename>/rhp//executeRHPDrain.150721125206.trc)
    2. Open the /var/log/cellos/dbnodeupdate.trc trace file.
      If rhphelper fails, then the trace file contains the message as follows:
      "Failed execution of RHPhelper"
      If rhphelper hangs, then the trace file contains the message as follows:
      (ACTION:) Executing RHPhelper to drain sessions and shutdown instances.
    3. Identify the rhphelper processes running at the operating system level and kill them.

      There are two commands that will have the string “rhphelper” in the name – a Bash shell, and the underlying Java program, which is really rhphelper executing.

      rhphelper runs as root, so must be killed as root (sudo from opc).

      For example:
      [opc@<HOST> ~] pgrep –lf rhphelper
      191032 rhphelper
      191038 java
      
      [opc@<HOST> ~] sudo kill –KILL 191032 191038
    4. Verify that the dbnodeupdate.trc file moves and the Grid Infrastructure stack on the node is shutdown.

    For more information about RHPhelper, see Using RHPhelper to Minimize Downtime During Planned Maintenance on Exadata (Doc ID 2385790.1).

Adding a VM to a VM Cluster Fails

Description: When adding a VM to a VM cluster, you might encounter the following issue:
[FATAL] [INS-32156] Installer has detected that there are non-readable files in oracle home.
CAUSE: Following files are non-readable, due to insufficient permission oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc
ACTION: Ensure the above files are readable by grid.

Cause: Installer has detected a non-readable trace file, oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc created by Autonomous Health Framework (AHF) in Oracle home that causes adding a cluster VM to fail.

AHF ran as root created a trc file with root ownership, which the grid user is not able to read.

Action: Ensure that the AHF trace files are readable by the grid user before you add VMs to a VM cluster. To fix the permission issue, run the following commands as root on all the existing VM cluster VMs:
chown grid:oinstall /u01/app/19.0.0.0/grid/srvm/admin/logging.properties
chown -R grid:oinstall /u01/app/19.0.0.0/grid/oracle.ahf*
chown -R grid:oinstall /u01/app/grid/oracle.ahf*

Nodelist is not Updated for Data Guard-Enabled Databases

Description: Adding a VM to a VM cluster completes successfully, however, for Data Guard-enabled databases, the new VM is not added to the nodelist in the /var/opt/oracle/creg/<db>.ini file.

Cause: Data Guard-enabled databases will not be extended to the newly added VM. And therefore, the <db>.ini file will also not be updated because the database instance is not configured in the new VM.

Action: To add an instance to primary and standby databases and to the new VMs (Non-Data Guard), and to remove an instance from a Data Guard environment, see My Oracle Support note 2811352.1.

CPU Offline Scaling Fails

Description: CPU offline scaling fails with the following error:
** CPU Scale Update **An error occurred during module execution. Please refer to the log file for more information

Cause: After provisioning a VM cluster, the /var/opt/oracle/cprops/cprops.ini file, which is automatically generated by the database as a service (DBaaS) is not updated with the common_dcs_agent_bindHost and common_dcs_agent_port parameters and this causes CPU offline scaling to fail.

Action: As the root user, manually add the following entries in the /var/opt/oracle/cprops/cprops.ini file.
common_dcs_agent_bindHost=<IP_Address>
common_dcs_agent_port=7070
Note

The common_dcs_agent_port value is 7070 always.
Run the following command to get the IP address:
netstat -tunlp | grep 7070
For example:
netstat -tunlp | grep 7070
tcp 0 0 <IP address 1>:7070 0.0.0.0:* LISTEN 42092/java
tcp 0 0 <IP address 2>:7070 0.0.0.0:* LISTEN 42092/java

You can specify either of the two IP addresses, <IP address 1> or <IP address 2> for the common_dcs_agent_bindHost parameter.