Site Failover and Recovery

A site failover is required if your infrastructure has suffered an unplanned event that causes the primary site to be unavailable and completely inaccessible for a duration of time that will negatively impact the business. In this scenario, the standby assumes the primary role.

A primary site can become unavailable for a variety of reasons, including but not limited to, the following:

  • Issues that might cause the primary database instances not to start such as failed or extensively corrupted media, or a flawed OS or firmware upgrade
  • Infrastructure failure such as a full power or cooling system outage within the OCI region infrastructure
  • Complete network failures
  • Natural disasters such as earthquakes, fires, and floods

While unplanned events are rare, they can and do occur.

Perform Site Failover

As a true failover is disruptive and may result in some small loss of data, test your site failover in a TEST environment.
The following example uses names from our test environment for the Primary database in Ashburn (CDBHCM_iad1dx) and the Standby database in Phoenix (CDBHCM_phx5s).
  1. Force a stop of all database Oracle Real Application Clusters (Oracle RAC) instances on the primary.

    Note:

    Do not conduct this test on a production environment.
    $ srvctl stop database -db CDBHCM_iad1dx -stopoption
            abort
    From this point on, our primary is assumed (simulated) to be completely unavailable. We will make our secondary region our new primary site.

    The following steps use our test implementation and all steps are performed at the secondary site (new primary).

  2. At the secondary site, log in to any one of the Oracle Exadata Database Service on Dedicated Infrastructure domUs, become the oracle OS user, and invoke the Data Guard Broker command-line interface.
    • Node: One Oracle Exadata Database Service on Dedicated Infrastructure domU
    • User: oracle
    $ dgmgrl sys/sys password
    DGMGRL> show configuration lag
    
    Configuration - fsc
    
      Protection Mode: MaxPerformance
      Members:
      CDBHCM_iad1dx  - Primary database
        Error: ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
      CDBHCM_phx5s - Physical standby database 
                        Transport Lag:      0 seconds (computed 18 seconds ago)
                        Apply Lag:          0 seconds (computed 18 seconds ago)
    
    Fast-Start Failover:  Disabled
    
    Configuration Status:
    ERROR   (status updated 0 seconds ago)
    Notice the error.
  3. Perform a failover using the Oracle Data Guard Broker command-line interface.
    • Node: One Oracle Exadata Database Service on Dedicated Infrastructure domU at the secondary site
    • User: oracle
    DGMGRL> failover to CDBHCM_phx5s
  4. If the primary middle tiers (including the shared file system) are still intact, then manually perform a “forced” rsync from the failed primary site to the DR site.

    The application and web tiers may still be functional, but the application and process scheduler processes will begin to fail trying to connect to the failed database. This will cause the rsync script to stop performing rsync.

    • Node: One middle tier, failed primary site
    • User: psadm2
    1. Perform a forced rsync. A sample rsync_psft.sh script is available in the scripts directory.
      If the rsync script is disabled, then the -f parameter will prompt to continue with a forced rsync. A forced rsync will not consult the database to determine the site’s role, then will perform the requested rsync.

      Note:

      This should only be done when forcing a final refresh during a site failover. Using the FORCE option is logged.
      $ rsync_psft.sh path to file system/parameter file -f
      For example,
      $ rsync_psft.sh $SCRIPT_DIR/fs1 -f
    2. Monitor the rsync process log to verify that the process completes successfully.
  5. If the site failure is complete and a final rsync process cannot be run, then disable rsync at the new primary by running the disable_psft_rsync.sh script.
    • Node: One middle tier, new primary site
    • User: psadm2
    disable_psft_rsync.sh
  6. If you configured Active Data Guard support when you configured your primary and standby database servers, then ensure that the Active Data Guard service for PeopleSoft (PSQUERY) has started on the new primary database.
    • Node: One Oracle Exadata Database Service on Dedicated Infrastructure domU at the secondary site
    • User: oracle
    $ srvctl status service -db CDBHCM_phx5s -s PSQUERY
    Service PSQUERY is running on instance(s) CDBHCM1,CDBHCM2
    This service should be running on all Oracle Real Application Clusters (Oracle RAC) instances.

    Note:

    This service must be started before starting the process scheduler. Otherwise, the process scheduler will fail on startup.
  7. Verify that the role-based database services are up on the new primary.
    • Site: Site 2
    • Node: All Oracle Exadata Database Service on Dedicated Infrastructure domUs
    • User: oracle
    For example, issue the following command on each domU hosting a PeopleSoft Oracle RAC database instance:
    $ srvctl status service -db CDBHCM_phx5s -s HR92U033_BATCH
    Service HR92U033_BATCH is running on instance(s) CDBHCM1,CDBHCM2
    $ srvctl status service -db CDBHCM_phx5s -s HR92U033_ONLINE
    Service HR92U033_ONLINE is running on instance(s) CDBHCM1,CDBHCM2
    This service should be running on all Oracle RAC instances.
  8. Start the application server and process scheduler domains.
    • Site: Site 2
    • Node: Application and process scheduler server compute instances
    • User: psadm2
    1. Log in to the compute instances hosting the PeopleSoft application servers and process scheduler and become psadm2.
      Use the startPSFTAPP.sh script from the Wrapper directory in GitHub.
      $ startPSFTAPP.sh
    2. Monitor the startup.
      You can use the query from PeopleSoft Application and Process Scheduler Domains.
      col service_name format a20
      select a.inst_id,a.instance_name,b.service_name, count(*)
      from gv$instance a, gv$session b
      where a.inst_id = b.inst_id
      and service_name not like 'SYS%'
      group by a.inst_id,a.instance_name,b.service_name
      order by 1
      
      SQL> /
      
         INST_ID INSTANCE_NAME    SERVICE_NAME          COUNT(*)
      ---------- ---------------- ------------------- ----------
               1 CDBHCM1          CDBHCM_phx5s                 2
      SQL> /
      
         INST_ID INSTANCE_NAME    SERVICE_NAME          COUNT(*)
      ---------- ---------------- ------------------- ----------
               1 CDBHCM1          CDBHCM_phx5s                2
               1 CDBHCM1          HR92U033_BATCH               8
               1 CDBHCM1          HR92U033_ONLINE             52
               2 CDBHCM2          HR92U033_BATCH               7
               2 CDBHCM2          HR92U033_ONLINE             50
  9. If you did not have a complete site failure as described in Step 5, then go to the next step. If you did have a complete site failure, then you must run the disable_psft_rsync.sh script again, as the startPSTAPP.sh script will enable rsync.

    Note:

    In an actual failure event where the primary site is lost or becomes inaccessible, you'll need to conduct an assessment of the scope and impact. Here are a few items to consider:

    • Possible database data loss
    • Missing file system artifacts (reports, logs, inbound and outbound files, and so on)

    Depending on the outage, you may or may not be able to recover every transaction committed at the original primary. If possible, ask the users to verify the very last transactions they were working on.

    It is likely there will be missing file system artifacts, from output being written or transmitted when access to the original primary ceases. Use the report logging in the database to identify file system artifacts created near the time of the outage, then determine what needs to be done, case by case, for missing or incomplete files.

  10. Start web services.
    • Site: Site 2
    • Node: All PeopleSoft Internet Architecture (PIA) web server compute instances
    • User: psadm2
    If Coherence*Web is configured, then you will first start the cache cluster on all compute instances that host the PIA web servers, then start the PIA web servers. In this example, one script is used to start both in the proper order.
    1. Log in to the PIA web servers and become psadm2.
    2. Using the script from startPSFTAPP.sh, start the web servers.
      $ startPSFTWEB.sh
  11. Check the load balancer.
    • Site: Site 2 Region
    • Node: OCI Console
    • User: Tenancy administrator
    1. Log in to the OCI Console and change the region to your new primary.
    2. Select Networking, then Load Balancer from the main menu.
    3. Select the appropriate compartment.
    4. Click Backend Set, then click Backends.
      Each backend should show OK. It may take a few minutes after each PIA web server has been started.

Reinstate the Failed Primary as the New Standby

You'll want to protect your new production environment with a standby. Ideally, you'll be able to reinstate the failed primary as a new standby by reinstating both the database and the file systems.

Reinstate the Old Primary Database as the Standby

Oracle Data Guard will prevent the old primary database from opening when it is made available again after a primary site failure. Any attempt to start the database normally will fail, with messages written to its alert log indicating reinstatement is required. If Flashback Database was enabled on this database before the failure, then you can reinstate the old primary as the new standby.

Perform the following to reinstate the old primary as a standby of current production:

  1. Log in to one of the domUs hosting the old primary database and start the database:
    $ srvctl start database -db CDBHCM_phx5s
  2. Log in to the Data Guard Broker command-line interface at the new primary region and show the configuration.
    Note the ORA-16661 error in bold below.
    $ dgmgrl sys/sys password
    DGMGRL> show configuration
    configuration - fsc
    
      Protection Mode: MaxPerformance
      Members:
      CDBHCM_iad1dx - Primary database
        CDBHCM_phx5s  - Physical standby database (disabled)
          ORA-16661: the standby database needs to be reinstated
    
    Fast-Start Failover:  Disabled
    
    Configuration Status:
    SUCCESS   (status updated 12 seconds ago)
  3. Reinstate the standby.
    DGMGRL> REINSTATE DATABASE CDBHCM_phx5s;

    Note:

    The process of recovering the failed database and making a valid standby starts and may take some time to complete.
  4. Use the Data Guard Broker command-line interface to check the status of your environment.
    For example,
    DGMGRL> show configuration lag
    
    Configuration - fsc
    
      Protection Mode: MaxPerformance
      Members:
      CDBHCM_iad1dx - Primary database
        CDBHCM_phx5s  - Physical standby database 
                        Transport Lag:      0 seconds (computed 1 second ago)
                        Apply Lag:          0 seconds (computed 1 second ago)
    
    Fast-Start Failover:  Disabled
    
    Configuration Status:
    SUCCESS   (status updated 35 seconds ago)

    The reinstated database is now serving as standby, protecting the primary and available for switchover and failover.

    If you configured Active Data Guard support when you configured your primary and standby database servers, then the Active Data Guard service for PeopleSoft (PSQUERY) can be relocated from the primary back to the standby database.

Reinstate the Old Primary Middle Tiers as the Standby

Reinstate the old primary middle tiers based on your failover event.
  1. If your old primary middle tier servers remained available while the database failover event occurred, then you should have done a final forced rsync from the failed primary site to the standby at the time of the failure, then reversed direction of the rsync processes in the same manner as when you do a switchover.

    The application and web tiers may still be functional, but the application and process scheduler processes will begin to fail trying to connect to the failed database. This will cause the rsync script to stop performing rsync.

    1. If the primary middle tiers, including the shared file system, are still intact, then manually perform a “forced” rsync from the failed primary site to the DR site.
      If the rsync script is disabled, the -f parameter will prompt to continue with a forced rsync. A forced rsync will not consult the database to determine the site’s role, then will perform the requested rsync.

      Note:

      This should only be done when forcing a final refresh during a site failover. Using the FORCE option will be logged.
      You can use the sample script in the scripts directory rsync_psft.sh, as user psadm2:
      $ rsync_psft.sh path to file system/parameter file -f
      For example,
      $ rsync_psft.sh $SCRIPT_DIR/fs1 -f
    2. Monitor the rsync process log to verify that the process completes successfully.
  2. If the old primary site is completely inaccessible even for rsync, then perform the following:
    1. Disable the rsync scripts at the new primary site with disable_psft_rsync.sh for all file systems being replicated.
    2. If you can resume activity on the original middle tiers at a later time, then re-enable the rsync scripts and let them catch up.
  3. If you need to rebuild your middle tiers, then follow the processes described earlier.