Troubleshooting Exadata Cloud Infrastructure Systems

These topics cover some common issues you might run into and how to address them.

Known Issues for Exadata Cloud Infrastructure

General known issues.

CPU Offline Scaling Fails

Description: CPU offline scaling fails with the following error:
** CPU Scale Update **An error occurred during module execution. Please refer to the log file for more information

Cause: After provisioning a VM cluster, the /var/opt/oracle/cprops/cprops.ini file, which is automatically generated by the database as a service (DBaaS) is not updated with the common_dcs_agent_bindHost and common_dcs_agent_port parameters and this causes CPU offline scaling to fail.

Action: As the root user, manually add the following entries in the /var/opt/oracle/cprops/cprops.ini file.
common_dcs_agent_bindHost=<IP_Address>
common_dcs_agent_port=7070

Note:

The common_dcs_agent_port value is 7070 always.
Run the following command to get the IP address:
netstat -tunlp | grep 7070
For example:
netstat -tunlp | grep 7070
tcp 0 0 <IP address 1>:7070 0.0.0.0:* LISTEN 42092/java
tcp 0 0 <IP address 2>:7070 0.0.0.0:* LISTEN 42092/java

You can specify either of the two IP addresses, <IP address 1> or <IP address 2> for the common_dcs_agent_bindHost parameter.

Adding a VM to a VM Cluster Fails

Description: When adding a VM to a VM cluster, you might encounter the following issue:
[FATAL] [INS-32156] Installer has detected that there are non-readable files in oracle home.
CAUSE: Following files are non-readable, due to insufficient permission oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc
ACTION: Ensure the above files are readable by grid.

Cause: Installer has detected a non-readable trace file, oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc created by Autonomous Health Framework (AHF) in Oracle home that causes adding a cluster VM to fail.

AHF ran as root created a trc file with root ownership, which the grid user is not able to read.

Action: Ensure that the AHF trace files are readable by the grid user before you add VMs to a VM cluster. To fix the permission issue, run the following commands as root on all the existing VM cluster VMs:
chown grid:oinstall /u01/app/19.0.0.0/grid/srvm/admin/logging.properties
chown -R grid:oinstall /u01/app/19.0.0.0/grid/oracle.ahf*
chown -R grid:oinstall /u01/app/grid/oracle.ahf*

Troubleshoot Network Connectivity

To determine if a VM Cluster is properly configured to access the Oracle Cloud Infrastructure (OCI) Services Network, you need to perform the following steps on each virtual machine in the VM Cluster.

Validation check for Identity and Access management connectivity:

  • ssh to a virtual machine on your ExaDB-D VM Cluster as opc user.
  • Execute the command: curl https://identity.<region>.oci.oraclecloud.com here <region> corresponds to the OCI region where your VM Cluster is deployed. If your VM Cluster is deployed in the Ashburn region you need to use “us-ashburn-1” for <region>. The curl command will now look like curl https://identity.us-ashburn-1.oci.oraclecloud.com.
  • If your Virtual Cloud Network (VCN) is properly configured for accessing the OCI Services Network, you will get an immediate response that looks like
    {
     "code" : "NotAuthorizedOrNotFound",
     "message" : "Authorization failed or requested resource not found."
    } 
  • The ssh session will hang and will eventually timeout if your network is not configured for accessing the OCI Services
  • Depending on your VCN setup, you will need to follow the steps outlined in the action section below to configure access to the OCI Services Network.

Validation check for Object Storage Service (OSS) connectivity:

  • ssh to a virtual machine on your ExaDB-D VM Cluster as opc user.
  • Execute the command: curl https://objectstorage.<region>.oraclecloud.com, here <region> corresponds to the OCI region where your VM Cluster is deployed. If your VM Cluster is deployed in the Ashburn region you need to use “us-ashburn-1” for <region>. The curl command will now look like curl https://objectstorage.us-ashburn-1.oraclecloud.com .
  • If your Virtual Cloud Network (VCN) is properly configured for accessing the OCI Services Network, you will get an immediate response that looks like
    {
     "code" : "NotAuthorizedOrNotFound",
     "message" : "Authorization failed or requested resource not found."
    }
  • The ssh session will hang and will eventually timeout if your network is not configured for accessing the OCI Services
  • Depending on your VCN setup, you will need to follow the steps outlined in the action section below to configure access to the OCI Services Network.

Action:

Once you configure your VCN to reach the OCI Services network following the above instructions, execute the steps in both the Validation check sections to ensure that you have established connectivity to the OCI Services network from your VM Cluster.

Additional Information:

You can find instructions to update a service gateway here (https://docs.oracle.com/en-us/iaas/Content/Network/Tasks/servicegateway.htm#switch_label)

Backup Failures in Exadata Database Service on Dedicated Infrastructure

If your Exadata managed backup does not successfully complete, you can use the procedures in this topic to troubleshoot and fix the issue.

The most common causes of backup failure are the following:

  • The host cannot access Object Storage
  • The database configuration on the host is not correct

The information that follows is organized by the error condition. If you already know the cause, you can skip to the section with the suggested solution. Otherwise, use the procedure in Determining the Problem to get started.

Determining the Problem

In the Console, a failed database backup either displays a status of Failed or hangs in the Backup in Progress or Creating state. If the error message does not contain enough information to point you to a solution, you can gather more information by using dbaascli and by viewing the log files. Then, refer to the applicable section in this topic for a solution.

Database backups can fail during the RMAN configuration stage or during a running RMAN backup job. RMAN configuration tasks include validating object store connectivity, backup module installation, and RMAN configuration changes. The log files you examine depend on which stage the failure occurs.

  1. Log on to the host as the root user.
  2. Check the applicable log file:

    • If the failure occurred during RMAN configuration, navigate to the /var/opt/oracle/log/<database_name>/bkup/ directory and check the bkup.log file.
    • If the failure occurred during the backup job, navigate to the /var/opt/oracle/log/<database_name>/obkup/ directory and check the obkup.log file.

Note:

  • Each execution of bkup and obkup commands generates a separate log file but bkup.log and obkup.log are symbolic links that point to the most recently generated log file.
  • Ensure that you check the log files on all of the Exadata DB system compute nodes because all nodes send backup pieces to Object Storage.

Database Service Agent Issues

Your Oracle Cloud Infrastructure database makes use of an agent framework to allow you to manage your database through the cloud platform. Use the following to check and restart the dbcsagent.

Occasionally you might need to restart the dbcsagent program if it has the status of stop/waiting to resolve a backup failure. View the /opt/oracle/dcs/log/dcs-agent.log file to identify issues with the agent.

  1. From a command prompt, check the status of the agent:
    systemctl status dbcsagent.service
  2. If the agent is in the stop/waiting state, try to restart the agent:
    systemctl start dbcsagent.service
  3. Check the status of the agent again to confirm that it has the stop/running status:
    systemctl status dbcsagent.service

Object Store Connectivity Issues

Backing up your database to Oracle Cloud Infrastructure Object Storage requires that the host can connect to the applicable Swift endpoint.

Though Oracle controls the actual Swift user credentials for the storage bucket for managed backups, verifying general connectivity to Object Storage in your region is a good indicator that object store connectivity is not the issue. You can test this connectivity by using another Swift user.

  1. Create a Swift user in your tenancy. See Working with Auth Tokens.
  2. With the user you created in the previous step, use the following command to verify the host can access the object store.
    curl -v -X HEAD -u <user_ID>:'<auth_token>' https://swiftobjectstorage.<region_name>.oraclecloud.com/v1/<object_storage_namespace>
    See Object Storage FAQ for the correct region to use. See Understanding Object Storage Namespaces for information about your Object Storage namespace.
  3. If you cannot connect to the object store, refer to Prerequisites for Backups on Exadata Cloud Service topic for information on configuring object store connectivity.

Host Issues

One or more of the following conditions on the database host can cause backups to fail:

If an interactive command such as oraenv, or any command that might return an error or warning message, was added to the .bash_profile file for the grid or oracle user, Database service operations like automatic backups can be interrupted and fail to complete. Check the .bash_profile file for these commands, and remove them.

Backup operations require space in the /u01 directory on the host file system. Use the df -h command on the host to check the space available for backups. If the file system has insufficient space, you can remove old log or trace files to free up space.

Your system might not have the required version of the backup module (opc_installer.jar). See Unable to use Managed Backups in your DB System for details about this known issue. To fix the problem, you can follow the procedure in that section or simply update your DB system and database with the latest bundle patch.

Customizing the site profile file ( $ORACLE_HOME/sqlplus/admin/glogin.sql ) can cause managed backups to fail in Oracle Cloud Infrastructure. In particular, interactive commands can lead to backup failures. Oracle recommends that you not modify this file for databases hosted in Oracle Cloud Infrastructure.

Database Issues

An improper database state or configuration can lead to failed backups.

The database must be active and running (ideally on all nodes) while the backup is in progress.

Use the following command to check the state of your database, and ensure that any problems that might have put the database in an improper state are resolved:
srvctl status database -d <db_unique_name> -verbose
The system returns a message including the database's instance status. The instance status must be Open for the backup to succeed. If the database is not running, use the following command to start it:
srvctl start database -d <db_unique_name> -o open
If the database is mounted but does not have the Open status, use the following commands to access the SQL*Plus command prompt and set the status to Open:
sqlplus / as sysdba
alter database open;

When you provision a new database, the archiving mode is set to ARCHIVELOG by default. This is the required archiving mode for backup operations. Check the archiving mode setting for the database and change it to ARCHIVELOG, if applicable.

Open an SQL*Plus command prompt and enter the following command:
select log_mode from v$database;
If you need to set the archiving mode to ARCHIVELOG, start the database in MOUNT status (and not OPEN status), and use the following command at the SQL*Plus command prompt:
alter database archivelog;

Confirm that the db_recovery_file_dest parameter points to +RECO, and that the log_archive_dest_1 parameter is set to USE_DB_RECOVERY_FILE_DEST.

For RAC databases, one instance must have the MOUNT status when enabling archivelog mode. To enable archivelog mode for a RAC database, perform the following steps:

  1. Shut down all database instances:
    srvctl stop database -d
  2. Start one of the database instances in mount state:
    srvctl start instance -d <db_unique_name> -i <instance_name> -o mount
  3. Access the SQL*Plus command prompt:
    sqlplus / as sysdba
  4. Enable archive log mode:
    alter database archivelog; 
    exit;
  5. Stop the database:
    srvctl stop instance -d <db_unique_name> -i <instance_name>
  6. Restart all database instances:
    srvctl start database -d <db_unqiue_name>
  7. At the SQL*Plus command prompt, confirm the archiving mode is set to: ARCHIVELOG:
    select log_mode from v$database;
Backups can fail when the database instance has a stuck archiver process. For example, this can happen when the flash recovery area (FRA) is full. You can check for this condition using the srvctl status database -db <db_unique_name> -v command. If the command returns the following output, you must resolve the stuck archiver process issue before backups can succeed:
Instance <instance_identifier> is running on node *<node_identifier>. Instance status: Stuck Archiver

Refer to ORA-00257:Archiver Error (Doc ID 2014425.1) for information on resolving a stuck archiver process.

After resolving the stuck process, the command should return the following output:
Instance <instance_identifier> is running on node *<node_identifier>. Instance status: Open

If the instance status does not change after you resolve the underlying issue with the device or resource being full or unavailable, try restarting the database using the srvctl command to update the status of the database in the clusterware.

Editing certain RMAN configuration parameters can lead to backup failures in Oracle Cloud Infrastructure. To check your RMAN configuration, use the show all command at the RMAN command line prompt.

See the following list of parameters for details about RMAN the configuration settings that should not be altered for databases in Oracle Cloud Infrastructure.


CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 30 DAYS;

CONFIGURE CONTROLFILE AUTOBACKUP ON;

CONFIGURE DEVICE TYPE 'SBT_TAPE' PARALLELISM 5 BACKUP TYPE TO COMPRESSED BACKUPSET;

CONFIGURE CHANNEL DEVICE TYPE DISK MAXPIECESIZE 2 G;

CONFIGURE CHANNEL DEVICE TYPE 'SBT_TAPE' PARMS  'SBT_LIBRARY=/var/opt/oracle/dbaas_acfs/<db_name>/opc/libopc.so, ENV=(OPC_PFILE=/var/opt/oracle/dbaas_acfs/<db_name>/opc/opc<db_name>.ora)';

CONFIGURE ARCHIVELOG DELETION POLICY TO BACKED UP 1 TIMES TO 'SBT_TAPE';

CONFIGURE CHANNEL DEVICE TYPE DISK MAXPIECESIZE 2 G;

CONFIGURE ENCRYPTION FOR DATABASE ON;

RMAN backups fail when an object store wallet file is lost. The wallet file is necessary to enable connectivity to the object store.

  1. Get the name of the database with the backup failure using SQL*Plus:
    show parameter db_name
  2. Determine the file path of the backup config parameter file that contains the RMAN wallet information at the Linux command line:

    locate opc_<database_name>.ora
    For example:
    
    find / -name "opctestdb30.ora" -print /var/opt/oracle/dbaas_acfs/testdb30/opc/opctestdb30.ora
  3. Find the file path to the wallet file in the backup config parameter file by inspecting the value stored in the OPC_WALLET parameter. To do this, navigate to the directory containing the backup config parameter file and use the following cat command:
    cat opc<database_name>.ora
    For example:
    
    cd /var/opt/oracle/dbaas_acfs/testdb30/opc/
    ls -altr *.ora
    opctestdb30.ora
    cat opctestdb30.ora
    OPC_HOST=https://swiftobjectstorage.us-phoenix-1.oraclecloud.com/v1/dbbackupphx
    OPC_WALLET='LOCATION=file:/var/opt/oracle/dbaas_acfs/testdb30/opc/opc_wallet CREDENTIAL_ALIAS=alias_opc'
    OPC_CONTAINER=bUG3TFsSi8QzjWfuTxqqExample
    _OPC_DEFERRED_DELETE=false
  4. Confirm that the cwallet.sso file exists in the directory specified in the OPC_WALLET parameter, and confirm that the file has the correct permissions. The file permissions should have the octal value of "600" (-rw-------). Use the following command:

    ls -ltr /var/opt/oracle/dbaas_acfs/<database_name>/opc/opc_wallet
    For example:
    
    ls -altr /var/opt/oracle/dbaas_acfs/testdb30/opc/opc_wallet
    -rw------- 1 oracle oinstall 0 Oct 29 01:59 cwallet.sso.lck
    -rw------- 1 oracle oinstall 111231 Oct 29 01:59 cwallet.sso
    

TDE Wallet and Backup Failures

Learn to identify the root cause of TDE wallet and backup failures.

For backup operations to work, the $ORACLE_HOME/network/admin/sqlnet.ora file must contain the ENCRYPTION_WALLET_LOCATION parameter formatted exactly as follows:
ENCRYPTION_WALLET_LOCATION=(SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=/var/opt/oracle/dbaas_acfs/<database_name>/tde_wallet)))
Use the cat command to check the TDE wallet location specification. For example:
$ cat $ORACLE_HOME/network/admin/sqlnet.ora 
ENCRYPTION_WALLET_LOCATION=(SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=/var/opt/oracle/dbaas_acfs/<database_name>/tde_wallet)))

Database backups fail if the TDE wallet is not in the proper state. The following scenarios can cause this problem:

If the database was started using SQL*Plus, and the ORACLE_UNQNAME environment variable was not set, the wallet is not opened correctly.

To fix the problem, start the database using the srvctl utility:
srvctl start database -d <db_unique_name>

In a multitenant environment for Oracle Database versions that support PDB-level keystore, each PDB has its own master encryption key. For Oracle 18c databases, this encryption key is stored in a single keystore used by all containers. (Oracle Database 19c does not support a keystore at the PDB level.) After you create or plug in a new PDB, you must create and activate a master encryption key for it. If you do not do so, the STATUS column in the v$encryption_wallet view shows the value OPEN_NO_MASTER_KEY.

To check the master encryption key status and create a master key, do the following:

  1. Review the the STATUS column in the v$encryption_wallet view, as shown in the following example:
    
    SQL> alter session set container=pdb2;
    Session altered.
    
    SQL> select WRL_TYPE,WRL_PARAMETER,STATUS,WALLET_TYPE from v$encryption_wallet;
    
    WRL_TYPE   WRL_PARAMETER                                   STATUS             WALLET_TYPE
    ---------- ----------------------------------------------- ------------------ -----------
    FILE       /var/opt/oracle/dbaas_acfs/testdb30/tde_wallet/ OPEN_NO_MASTER_KEY AUTOLOGIN
  2. Confirm that the PDB is in READ WRITE open mode and is not restricted, as shown in the following example:
    
    SQL> show pdbs
    
    CON_ID CON_NAME     OPEN MODE              RESTRICTED
    ------ ------------ ---------------------- ---------------
    2      PDB$SEED     READ ONLY              NO
    3      PDB1         READ WRITE             NO
    4      PDB2         READ WRITE             NO

    The PDB cannot be open in restricted mode (the RESTRICTED column must show NO). If the PDB is currently in restricted mode, review the information in the PDB_PLUG_IN_VIOLATIONS view and resolve the issue before continuing. For more information on the PDB_PLUG_IN_VIOLATIONS view and the restricted status, review the Oracle Multitenant Administrator’s Guide on pluggable database for your Oracle Database version.

  3. Create and activate a master encryption key for the PDB:

    • Set the container to the PDB:
      ALTER SESSION SET CONTAINER = <pdb>;
    • Create and activate a master encryption key in the PDB by executing the following command:
      ADMINISTER KEY MANAGEMENT SET KEY USING TAG '<tag>' 
      FORCE KEYSTORE IDENTIFIED BY <keystore-password> WITH BACKUP USING '<backup_identifier>';

    Note the following:

    • The USING TAG clause is optional and can be used to associate a tag with the new master encryption key.
    • The WITH BACKUP clause is optional and can be used to create a backup of the keystore before the new master encryption key is created.

    You can also use the dbaascli commands dbaascli tde status and dbaascli tde rotate masterkey to investigate and manage your keys.

  4. Confirm that the status of the wallet has changed from OPEN_NO_MASTER_KEY to OPEN by querying the v$encryption_wallet view as shown in step 1.

Configuration parameters related to the TDE wallet can cause backups to fail.

Confirm that the wallet status is open and the wallet type is auto login by checking the v$encryption_wallet view. For example:

SQL> select status, wrl_parameter,wallet_type from v$encryption_wallet;

STATUS  WRL_PARAMETER                                  WALLET_TYPE
------- ---------------------------------------------- --------------
OPEN   /var/opt/oracle/dbaas_acfs/testdb30/tde_wallet/ AUTOLOGIN
For pluggable databases (PDBs), ensure that you switch to the appropriate container before querying v$encryption_wallet view. For example:

$ sqlplus / as sysdba

SQL> alter session set container=pdb1;

Session altered.

SQL> select WRL_TYPE,WRL_PARAMETER,STATUS,WALLET_TYPE from v$encryption_wallet;

WRL_TYPE  WRL_PARAMETER                                   STATUS   WALLET_TYPE
--------- ----------------------------------------------- -------- -----------
FILE      /var/opt/oracle/dbaas_acfs/testdb30/tde_wallet/ OPEN     AUTOLOGIN
The TDE wallet file (ewallet.p12) can cause backups to fail if it is missing, or if it has incompatible file system permissions or ownership. Check the file as shown in the following example as the root user:
# ls -altr /var/opt/oracle/dbaas_acfs/<database_name>/tde_wallet/ewallet.p12
total 76
-rw------ 1 oracle oinstall 5467 Oct 1 20:17 ewallet.p12

The TDE wallet file should have file permissions with the octal value "600" (-rw-------), and the owner of this file should be a part of the oinstall operating system group.

The auto login wallet file (cwallet.sso) can cause backups to fail if it is missing, or if it has incompatible file system permissions or ownership. Check the file as shown in the following example as the root user:

# ls -altr /var/opt/oracle/dbaas_acfs/<database_name>/tde_wallet/cwallet.sso
total 76
-rw------ 1 oracle oinstall 5512 Oct 1 20:18 cwallet.sso

The auto login wallet file should have file permissions with the octal value "600" (-rw-------), and the owner of this file should be a part of the oinstall operating system group.

Troubleshooting Oracle Data Guard

Learn to identify and resolve Oracle Data Guard issues.

When troubleshooting Oracle Data Guard, you must first determine whether the problem occurs during the Data Guard setup and initialization or during Data Guard operation, when lifecycle commands are entered. The steps to identify and resolve the issues are different, depending on the scenario in which they are used.

There are three lifecycle operations: switchover, failover, and reinstate. The Data Guard broker is used for all of these commands. The broker command line interface (dgmgrl) is the main tool used to identify and troubleshoot the issues. Although you can use logfiles to identify root causes, dgmgrl is faster and easier to use to check and identify an issue.

Setting up and enabling Data Guard involves multiple steps. Log files are created for each step. If any of the steps fail, review the relevant log file to identify and fix the problem.

  • Validation of the primary cloud VM Cluster and database
  • Validation of the standby cloud VM Cluster
  • Recreating and copying files to the standby database (passwordfile and wallets)
  • Creating Data Guard through Network (RMAN Duplicate command)
  • Configuring Data Guard broker
  • Finalizing the setup

Troubleshooting Data Guard using logfiles

The tools used to identify the issue and the locations of relevant logfiles are different, depending on the scenario in which they are used.

Use the following procedures to collect relevant log files to investigate issues. If you are unable to resolve the problem after investigating the log files, contact My Oracle Support.

Note:

When preparing collected files for Oracle Support, bundle them into a compressed archive, such as a ZIP file.

On each compute node associated with the Data Guard configuration, gather log files pertaining to the problem you experienced.

  • Enablement stage log files (such as those documenting the Create Standby Database operation) and the logs for the corresponding primary or standby system.
  • Enablement job ID logfiles. For example: 23.
  • Locations of enablement log files by enablement stage and Exadata system (primary or standby).
  • Database name logfiles (db_name or db_unique_name, depending on the file path).

Note:

Check all nodes of the corresponding primary and standby Exadata systems. Commands executed on a system may have been run on any of its nodes.

Data Guard Deployer (DGdeployer) is the process that performs the configuration. When configuring the primary database, it creates the /var/opt/oracle/log/<dbname>/dgdeployer/dgdeployer.log file.

This log should contain the root cause of a failure to configure the primary database.

  • The primary log from the dbaasapi command-line utility is: /var/opt/oracle/log/dbaasapi/db/dg/<job_ID>.log. Look for entries that contain dg_api.
  • One standby log from the dbaasapi command-line utility is: /var/opt/oracle/log/dbaasapi/db/dg/<job_ID>.log. In this log, look for entries that contain dg_api.
  • The other standby log is: /var/opt/oracle/log/<dbname>/dgcc/dgcc.log. This log is the Data Guard configuration log.
  • The Oracle Cloud Deployment Engine (ODCE) creates the /var/opt/oracle/log/<dbname>/ocde/ocde.log file. This log should contain the cause of a failure to create the standby database.
  • The dbaasapi command line utility creates the var/opt/oracle/log/dbaasapi/db/dg/<job_ID>.log file. Look for entries that contain dg_api.
  • The Data Guard configuration log file is /var/opt/oracle/log/<dbname>/dgcc/dgcc.log.
  • DGdeployer is the process that performs the configuration. It creates the following /var/opt/oracle/log/<dbname>/dgdeployer/dgdeployer.log file. This log should contain the root cause of a failure to configure the standby database.
  • The dbaasapi command-line utility creates the /var/opt/oracle/log/dbaasapi/db/dg/<job_ID>.log file. Look for entries that contain dg_api.
  • The Data Guard configuration log is /var/opt/oracle/log/<dbname>/dgcc/dgcc.log.

DGdeployer is the process that performs the configuration. While configuring Data Guard, it creates the /var/opt/oracle/log/<dbname>/dgdeployer/dgdeployer.log file. This log should contain the root cause of a failure to configure the primary database.

On each node of the primary and standby sites, gather log files for the related database name (db_name).

Note:

Check all nodes on both primary and standby Exadata systems. A lifecycle management operation may impact both primary and standby systems.
  • Database alert log: /u02/app/oracle/diag/rdbms/<dbname>/<dbinstance>/trace/alert_<dbinstance>.log
  • Data Guard Broker log: /u02/app/oracle/diag/rdbms/<dbname>/<dbinstance>/trace/drc<dbinstance>.log
  • Cloud tooling log file for Data Guard: /var/opt/oracle/log/<dbname>/odg/odg.log

Troubleshooting the Data Guard Setup Process

The following errors might occur in the different steps of the Data Guard setup process. While some errors are displayed within the Console, most of the root causes can be found in the logfiles

The password entered for enabling Data Guard didn't match the primary admin password for the SYS user. This error occurs during the Validate Primary stage of enablement.

The database may not be running. This error occurs during the Validate Primary stage of enablement. Check with srvctl and sql on the host to verify that the database is up and running on all nodes.

The primary database could not be configured. Invalid Data Guard commands or failed listener reconfiguration can cause this error.

The TDE wallet could not be created. The Oracle Transparent Database Encryption (TDE) keystore (wallet) files could not be prepared for transportation to the standby site. This error occurs during the create TDE Wallet stage of enablement. Either of the following items can cause failure at this stage:

  • The TDE wallet files could not be accessed
  • The enablement commands could not create an archive containing the wallet files

Troubleshooting procedure:

  1. Ensure that the cluster is accessible. To check the status of a cluster, run the following command:
    crsctl check cluster -all
  2. If the cluster is down, run the following command to restart it:
    crsctl start crs -wait
  3. If this error occurs when the cluster is accessible, check the logs for create TDE Wallet (enablement stage) to determine cause and resolution for the error.

The archive containing the TDE wallet was likely not transmitted to the standby site. Retrying usually solves the problem.

  • The primary and standby sites may not be able to communicate with each other to configure the standby database. These errors occur during the configure standby database stage of enablement. In this stage, configurations are performed on the standby database, including the rman duplicate of the primary database. To resolve this issue:
    1. Verify the connectivity status for the primary and standby sites.
    2. Ensure that the host can communicate from port 1521 to all ports. Check the network setup, including Network Security Groups (NSGs), Network Security Lists, and the remote VCN peering setup (if applicable). The best way to test communication between the host and other nodes is to access the databases using SQL*PLUS from the primary to standby and from the standby to the primary.
  • The SCAN VIPs or listeners may not be running. Use the test above to help identify the issue.

Possible causes:

  • SCAN VIPs or listeners may not be running. You can confirm this issue by using the following commands on any cluster node.
    • [grid@exa1-****** ~]$ srvctl status
                  scan
    • [grid@exa1-****** ~]$ srvctl status
                    scan_listener
  • Databases may not be reachable. You can confirm this issue by attempting to connect using an existing Oracle Net alias.

Troubleshooting procedure:

  1. As the oracle OS user, check for the existence of an Oracle Net alias for the container database (CDB). Look for an alias in $ORACLE_HOME/network/admin/<dbname>/tnsnames.ora.

    The following example shows an entry for a container database named db12c:

    cat $ORACLE_HOME/network/admin/db12c/tnsnames.ora 
    DB12C = (DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = exa1-*****-scan.********.******.******.com)(PORT = 1521)) 
    (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = db12c.********.******.******.com) 
    (FAILOVER_MODE = (TYPE = select) (METHOD = basic))))
  2. Verify that you can use the alias to connect to the database. For example, as sysdba, enter the following command:
    sqlplus sys@db12c

A possible cause for this error is that the Oracle Database sys or system user passwords for the database and the TDE wallet may not be the same. To compare the passwords:

  1. Connect to the database as the sys user and check the TDE status in
    V$ENCRYPTION_WALLET
    .
  2. Connect to the database as the system user and check the TDE status in
    V$ENCRYPTION_WALLET
    .
  3. Update the applicable passwords to match. Log on to the system host as opc and run the following commands:
    1. To change the SYS password:
      sudo dbaascli database changepassword --dbname <database_name>
    2. To change the TDE wallet password:
      sudo dbaascli tde changepassword --dbname <database_name>

For possible causes and resolutions to TDE wallet issues, see TDE Wallet and Backup Failures .

When the switchover, failover, and reinstate commands are run, multiple error messages may occur. Refer to the Oracle Database documentation for these error messages.

Note

Oracle recommends using the Data Guard broker command line interface (dgmgrl) to validate the configurations.

  1. As the Oracle User, connect to the primary or standby database with dgmgrl and verify the configuration and the database:
    dgmgrl sys/<pwd>@<database>
    DGMGRL> VALIDATE CONFIGURATION VERBOSE
    DGMGRL> VALIDATE DATABASE VERBOSE <PRIMARY>
    DGMGRL> VALIDATE DATABASE VERBOSE <STANDBY>
  2. Consult the Oracle Database documentation to check for the respective error message. For example:
    • ORA-16766: Redo apply is stopped.
    • ORA-16853: Apply lag has exceeded specified threshold.
    • ORA-16664: Unable to receive the result from a member (under the standby database).
    • ORA-12541: TNS: no listener (under the primary database)

For cause and resolution, review the errors in Database Error Messages.

Patching Failures on Exadata Cloud Infrastructure Systems

Patching operations can fail for various reasons. Typically, an operation fails because a database node is down, there is insufficient space on the file system, or the virtual machine cannot access the object store.

Determining the Problem

In the Console, you can identify a failed patching operation by viewing the patch history of an Exadata Cloud Infrastructure system or an individual database.

A patch that was not successfully applied displays a status of Failed and includes a brief description of the error that caused the failure. If the error message does not contain enough information to point you to a solution, you can use the database CLI and log files to gather more data. Then, refer to the applicable section in this topic for a solution.

Troubleshooting and Diagnosis

Diagnose the most common issues that can occur during the patching process of any of the Exadata Cloud Infrastructure components.

Database Server VM Issues

One or more of the following conditions on the database server VM can cause patching operations to fail.

Database Server VM Connectivity Problems

Cloud tooling relies on the proper networking and connectivity configuration between virtual machines of a given VM cluster. If the configuration is not set properly, this may incur in failures on all the operations that require cross-node processing. One example can be not being able to download the required files to apply a given patch.

Given the case, you can perform the following actions:

  • Verify that your DNS configuration is correct so that the relevant virtual machine addresses are resolvable within the VM cluster.
  • Refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.
Oracle Grid Infrastructure Issues

One or more of the following conditions on Oracle Grid Infrastructure can cause patching operations to fail.

Oracle Grid Infrastructure is Down

Oracle Clusterware enables servers to communicate with each other so that they can function as a collective unit. The cluster software program must be up and running on the VM Cluster for patching operations to complete. Occasionally you might need to restart the Oracle Clusterware to resolve a patching failure.

In such cases, verify the status of the Oracle Grid Infrastructure as follows:
./crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
If Oracle Grid Infrastructure is down, then restart by running the following commands:
crsctl start cluster -all
crsctl check cluster
Oracle Databases Issues

An improper database state can lead to patching failures.

Oracle Database is Down

The database must be active and running on all the active nodes so the patching operations can be completed successfully across the cluster.

Use the following command to check the state of your database, and ensure that any problems that might have put the database in an improper state are resolved:
srvctl status database -d db_unique_name -verbose

The system returns a message including the database instance status. The instance status must be Open for the patching operation to succeed.

If the database is not running, use the following command to start it:
srvctl start database -d db_unique_name -o open

Obtaining Further Assistance

If you were unable to resolve the problem using the information in this topic, follow the procedures below to collect relevant database and diagnostic information. After you have collected this information, contact Oracle Support.

Related Topics

Collecting Cloud Tooling Logs

Use the relevant log files that could assist Oracle Support for further investigation and resolution of a given issue.

DBAASCLI Logs

/var/opt/oracle/log/dbaascli
  • dbaascli.log

Collecting Oracle Diagnostics

To collect the relevant Oracle diagnostic information and logs, run the dbaascli diag collect command.

For more information about the usage of this utility, see DBAAS Tooling: Using dbaascli to Collect Cloud Tooling Logs and Perform a Cloud Tooling Health Check.

Standby Database Fails to Restart After Switchover in Oracle Database 11g Oracle Data Guard Setup

Description: After performing the switchover, the new standby (old primary) database remains shut down and fails to restart.

Action: After performing switchover, do the following:

  1. Restart the standby database using the srvctl start database -db <standby dbname> command.
  2. Reload the listener as grid user on all primary and standby cluster nodes.
    • To reload the listener using high availability, download and apply patch 25075940 to the Grid home, and then run lsnrctl reload -with_ha.
    • To reload the listener, run lsrnctl reload.

After reloading the listener, verify that the <dbname>_DGMGRL services are loaded into the listener using the lsnrctl status command.

To download patch 25075940

  1. Log in to My Oracle Support.
  2. Click Patches & Updates.
  3. Select Bug Number from the Number/Name or Bug Number (Simple) drop-down list.
  4. Enter the bug number 34741066, and then click Search.
  5. From the search results, click the name of the latest patch.

    You will be redirected to the Patch 34741066: LSNRCTL RELOAD -WITH_HA FAILED TO READ THE STATIC ENTRY IN LISTENER.ORA page.

  6. Click Download.