Managing Planned and Unplanned Outages for Oracle GoldenGate Hub

26 Managing Planned and Unplanned Outages for Oracle GoldenGate Hub

There are a number of considerations that must be taken into account when the hub undergoes a planned or unplanned outage of either the primary or standby file system clusters.

Managing Planned Outages

When there is a requirement to perform planned maintenance on the GGHub, some of the CRS resources should be stopped and disabled to prevent them from restarting, or from causing undesirable results when incorrectly instigating a file system failover, or stopping GoldenGate from running.

Use the following recommendations in the event of a planned outage of the primary or standby hub clusters.

For all planned maintenance events:

Operating system software or hardware updates and patches
Oracle Grid Infrastructure interim or diagnostic patches
Oracle Grid Infrastructure quarterly updates under the Critical Patch Update (CPU) program, or Oracle Grid Infrastructure release upgrades
GGHub software life cycle, including:
- Oracle GoldenGate
- Oracle Grid Infrastructure Agent
- NGINX

High Availability Solutions with Target Outage Time:

Seconds to minutes where GoldenGate replication is temporarily suspended

Step 1: Software update of idle GGHub node

Step 2: GGHub Node Relocate

Step 3: Software update of the remaining inactive GGHub node

GGHub Node Relocate

As the grid OS user on the primary GGHub system, relocate the Oracle GoldenGate Instance:

[grid@gghub_prim1 ~]$ agctl status goldengate

Goldengate instance 'gghub' is running on gghub_prim1

[grid@gghub_prim1 ~]$ time agctl relocate goldengate gghub

real    0m43.984s
user    0m0.156s
sys     0m0.049s

As the grid OS user on the primary GGHub system, check the status of the Oracle GoldenGate Instance:

[grid@gghub_prim1 ~]$ agctl status goldengate

Goldengate instance 'gghub' is running on gghub_prim2

GGHub Role Reversal for DR events or to move GGHub in the same region as the target database

GGHub role reversal performs an ACFS role reversal so that the standby becomes the new primary. With both the primary and standby file systems online, the acfsutil repl failover command ensures that all outstanding primary file system changes are transferred and applied to the standby before the role reversal completes.

When to use GGHub role reversal:

To move the GGHub deployment close to the target database for replication performance
To support site outage
To support site maintenance

As the grid OS user on the current standby GGHub node, create the script to perform the ACFS role reversal:

[grid@gghub_stby1]$ export ACFS_MOUNT_POINT=/mnt/acfs_gg1
[grid@gghub_stby1]$ export GG_DEPLOYMENT_NAME=gghub
[grid@gghub_stby1]$ ssh `/sbin/acfsutil repl info -c -v $ACFS_MOUNT_POINT| grep
 'Primary hostname' | awk '{print $3}'| cut -d "@" -f2`
 "agctl stop goldengate $GG_DEPLOYMENT_NAME"
[grid@gghub_stby1]$ /sbin/acfsutil repl failover $ACFS_MOUNT_POINT
[grid@gghub_stby1]$ agctl start goldengate $GG_DEPLOYMENT_NAME
[grid@gghub_stby1]$ agctl status goldengate $GG_DEPLOYMENT_NAME
Goldengate  instance 'gghub' is running on gghub_stby1

Alternatively, as the grid OS user on any GGHub node, run the script acfs_role_reversal.sh to perform the ACFS role reversal:

[grid@gghub_stby1]$ sh /u01/oracle/scripts/acfs_role_reversal.sh
 /mnt/acfs_gg1 gghub

################################################################################
ACFS Primary Site: gghub_prim_vip1.frankfurt.goldengate.com
ACFS Standby Site: gghub_stby_vip1.frankfurt.goldengate.com
################################################################################
Thu Nov 30 17:28:37 UTC 2023 - Begin Stop  GoldenGate gghub
Thu Nov 30 17:28:38 UTC 2023 - End   Stop  GoldenGate gghub
################################################################################
Thu Nov 30 17:28:38 UTC 2023 - Begin ACFS replication sync /mnt/acfs_gg1
Thu Nov 30 17:28:59 UTC 2023 - End   ACFS replication sync /mnt/acfs_gg1
################################################################################
Site:                                Primary
Primary status:                      Running
Status:                              Send Completed
Lag Time:                            00:00:00
Retries made:                        0
Last send started at:                Thu Nov 30 17:28:45 2023
Last send completed at:              Thu Nov 30 17:28:55 2023
################################################################################
Site:                                Standby
Last sync time with primary:         Thu Nov 30 17:28:45 2023
Status:                              Receive Completed
Last receive started at:             Thu Nov 30 17:28:46 2023
Last receive completed at:           Thu Nov 30 17:28:52 2023
################################################################################
Thu Nov 30 17:29:00 UTC 2023 - Begin Role Reversal
Thu Nov 30 17:30:02 UTC 2023 - End Role Reversal
################################################################################
ACFS Primary Site: gghub_stby_vip1.frankfurt.goldengate.com
ACFS Standby Site: gghub_prim_vip1.frankfurt.goldengate.com
################################################################################
Site:                                Primary
Primary status:                      Running
Status:                              Send Completed
Lag Time:                            00:00:00
Retries made:                        0
Last send started at:                Thu Nov 30 17:29:45 2023
Last send completed at:              Thu Nov 30 17:29:56 2023
################################################################################
Site:                                Standby
Last sync time with primary:         Thu Nov 30 17:29:45 2023
Status:                              Receive Completed
Last receive started at:             Thu Nov 30 17:29:50 2023
Last receive completed at:           Thu Nov 30 17:29:50 2023
################################################################################
Thu Nov 30 17:30:03 UTC 2023 - Begin Start GoldenGate gghub
Thu Nov 30 17:30:10 UTC 2023 - End   Start GoldenGate gghub
################################################################################

Managing Unplanned Outages

Expected Impact with Unplanned Outages

When an unplanned outage occurs on either the primary or standby GGHub clusters, there are some instructions to ensure the continuous operation of GoldenGate. Use the following GGHUB failure use cases to guide you in the event of an unplanned outage of the primary and standby GGHUB systems.

Use case #1 – Standby Hub Failure or Primary GGHub Cannot Communicate with the Standby GGHub

If the primary GGhub cannot communicate with the standby GGhub, the following messages will be output into the primary CRS trace file (crsd_scriptagent_grid.trc) on the active cluster node:

2023-06-21 12:06:59.506 :CLSDYNAM:1427187456: [acfs_primary]{1:8532:12141} [check] Executing action script: /u01/oracle/scripts/acfs_primary.scr[check]
2023-06-21 12:07:05.666 :CLSDYNAM:1427187456: [acfs_primary]{1:8532:12141} [check] WARNING: STANDBY not accessible (attempt 1 of 3))
2023-06-21 12:07:18.683 :CLSDYNAM:1427187456: [acfs_primary]{1:8532:12141} [check] WARNING: STANDBY not accessible (attempt 2 of 3))
2023-06-21 12:07:31.751 :CLSDYNAM:1427187456: [acfs_primary]{1:8532:12141} [check] WARNING: STANDBY not accessible (attempt 3 of 3))
2023-06-21 12:07:31.751 :CLSDYNAM:1427187456: [acfs_primary]{1:8532:12141} [check] WARNING: Problem with STANDBY file system (error: 222)

At this time, the standby file system is no longer receiving the primary file system changes. The primary file system and Oracle GoldenGate will continue to function unimpeded.

Use the following action plan with this scenario.

Check the standby file system, using the command ‘acfsutil repl util verifystandby /mnt/acfs_gg –v’ to determine why the standby hub is inaccessible.
After fixing the cause of the communication errors, the standby will automaitically catch up applying the outstanding primary file system changes. The warning messages will no longer be reported into the CRS trace file, being replaced with the following message:

2023-06-21 12:15:01.720 :CLSDYNAM:1427187456: [acfs_primary]{1:8532:12141} [check] SUCCESS: STANDBY file system /mnt/acfs_gg is ONLINE

Use case #2 – Primary GGHub Failure or Standby GGHub Cannot Communicate with the Primary GGHub

If the standby GGhub cannot communicate with the primary GGhub, the the following messages will be output into the standby CRS trace file (crsd_scriptagent_grid.trc) on the active cluster node:

2023-06-21 12:24:03.823 :CLSDYNAM:4156544768: [acfs_standby]{1:10141:2} [check] Executing action script: /u01/oracle/scripts/acfs_standby.scr[check]
2023-06-21 12:24:06.928 :CLSDYNAM:4156544768: [acfs_standby]{1:10141:2} [check] WARNING: PRIMARY not accessible (attempt 1 of 3)
2023-06-21 12:24:19.945 :CLSDYNAM:4156544768: [acfs_standby]{1:10141:2} [check] WARNING: PRIMARY not accessible (attempt 2 of 3)
2023-06-21 12:24:32.962 :CLSDYNAM:4156544768: [acfs_standby]{1:10141:2} [check] WARNING: PRIMARY not accessible (attempt 3 of 3)
2023-06-21 12:24:32.962 :CLSDYNAM:4156544768: [acfs_standby]{1:10141:2} [check] WARNING: Problem with PRIMARY file system (error: 222)

At this time, it is unlikely that the standby file system is receiving file system changes from the primary file system.

Use the following action plan with this scenario.

Check the primary file system, using the command ‘acfsutil repl util verifyprimary /mnt/acfs_gg -v’ to determine why the primary hub is inaccessible.

If the primary file system cluster is down and cannot be restarted, issue an ACFS failover on the standby GGhub:

[grid@gghub_stby1]$ /sbin/acfsutil repl failover /mnt/acfs_gg        # Specify the correct mount point

[grid@gghub_stby1]$ acfsutil repl info -c -v /mnt/acfs_gg |egrep 'Site:|Primary status|Background Resources:'

Site:                                Primary
Primary status:                      Running
Background Resources:                Active

Run the following commands to prepare the acfs_primary resource to start on the new primary hub, and then restart GoldenGate:

[grid@gghub_stby1]$ echo "RESTART" > /mnt/acfs_gg/status/acfs_primary

[grid@gghub_stby1]$ agctl start goldengate <instance_name> 			 # Specify the GoldenGate instance name

[grid@gghub_stby1]$ agctl status goldengate

Goldengate  instance '<instance_name>' is running on gghubstby-node1

When the old primary file system comes back online, if connectivity is resumed between the new primary and old primary, the old primary file system will automatically convert to the standby.
If the old primary file system comes back online, but connectivity cannot be established between the primary and standby file systems the acfs_primary resource will detect that node had crashed, and because connectivity to the standby cannot be confirmed, GoldenGate will not be started. This avoids a ‘split-brain’ where two file systems think they are both the primary because they cannot commnunicate with each other.

Use case #3 – Double Failure Case: Primary GGHub Failure and Standby GGHub Connectivity Failure

If the primary GGhub crashes and communication cannot be established with the standby file system when it comes back online, the following messages will be output into the primary CRS trace file (crsd_scriptagent_grid.trc) on the active cluster node:

2023-06-21 17:08:52.621:[acfs_primary]{1:40360:36312} [start] WARNING: PRIMARY file system /mnt/acfs_gg previously crashed
2023-06-21 17:08:55.678:[acfs_primary]{1:40360:36312} [start] WARNING: STANDBY not accessible - disabling acfs_primary

If an attempt is made to manually restart the primary file system, an additional message will be output into the CRS trace file:

2023-06-21 17:25:54.224:[acfs_primary]{1:40360:37687} [start] WARNING:
    PRIMARY /mnt/acfs_gg disabled to prevent split brain

Use the following action plan with this scenario.

Check the standby file system, using the command ‘acfsutil repl util verifystandby /mnt/acfs_gg -v’ to determine why the standby hub is inaccessible.

If communication with the the standby file system can re-established, restart GoldenGate on the primary hub:

[grid@gghub_prim1]$ agctl start goldengate <instance_name> # Specify the GoldenGate instance name

[grid@gghub_prim1]$ agctl status goldengate

Goldengate  instance '<instance_name>' is running on gghub_prim1

If communication with the standby file system cannot be re-established, use the following commands to restart GoldenGate on the primary hub:

[grid@gghub_prim1]$ echo "RESTART" > /mnt/acfs_gg/status/acfs_primary

[grid@gghub_prim1]$ agctl start goldengate <instance_name> 			 # Specify the GoldenGate instance name

[grid@gghub_prim1]$ agctl status goldengate

Goldengate  instance '<instance_name>' is running on gghub_prim1

When communication with the standby file system is restored, ACFS Replication will continue to replicate primary file system changes.