6 Raft Replication Configuration and Management

Raft replication in Oracle Globally Distributed Database creates smaller replication units and distributes them automatically to handle chunk assignment, chunk movement, workload distribution, and balancing upon scaling (addition or removal of shards), including planned or unplanned shard availability changes.

Unlike Oracle Data Guard replication, Raft replication does not need to be reconfigured when shards are added or removed, and replicas do not need to be actively managed.

Oracle Globally Distributed Database provides commands and options in the GDSCTL CLI to enable and manage Raft replication in a system-managed sharded database.

Note:

Raft replication is only supported for system-managed (automatic) data distribution methods.

Using Raft Replication in Oracle Globally Distributed Database

Oracle Globally Distributed Database provides built-in fault tolerance with Raft replication, a capability that integrates data replication with transaction execution in a sharded database.

Raft replication enables fast automatic failover with zero data loss. If all shards are in the same data center, it is possible to achieve sub-second failover. Raft replication is active/active; each shard can process reads and writes for a subset of data. This capability provides a uniform configuration with no primary or standby shards.

Raft replication is integrated and transparent to applications. Raft replication provides built-in replication for Oracle Globally Distributed Database without requiring configuration of Oracle GoldenGate or Oracle Data Guard. Raft replication automatically reconfigures replication in case of shard host failures or when shards are added or removed from the sharded database.

When Raft replication is enabled, a sharded database contains multiple replication units. A replication unit (RU) is a set of chunks that have the same replication topology. Each RU has multiple replicas placed on different shards.

Replication Unit

When Raft replication is enabled, a sharded database contains multiple replication units. A replication unit (RU) is a set of chunks that have the same replication topology. Each RU has three replicas placed on different shards. The Raft consensus protocol is used to maintain consistency between the replicas in case of failures, network partitioning, message loss, or delay.

Each shard contains replicas from multiple RUs. Some of these replicas are leaders and some are followers. Raft replication tries to maintain a balanced distribution of leaders and followers across shards. By default each shard is a leader for two RUs and is a follower for four other RUs. This makes all shards active and provides optimal utilization of hardware resources.

In Oracle Globally Distributed Database, an RU is a set of chunks, as shown in the image below.



The diagram above illustrates the relationship among shards, chunk sets, and chunks. A shard contains a set of chunks. A chunk is a set of table partitions in a given table family. A chunk is a unit of resharding (data movement across shards). A set of chunks which have the same replication topology is called chunk set.

Raft Group

Each replication unit contains exactly one chunk set and has a leader and a set of followers, and these members form a raft group. The leader and its followers for a replication unit contain replicas of the same chunk set in different shards as shown below. A shard can be the leader for some replication units and a follower for other replication units.

All DMLs for a particular subset of data are executed in the leader first, and then are replicated to its followers.



In the image above, a leader in each shard, indicated by the set of chunks in one color with a star next to it, points to a follower (of the same color) on each of the two other shards.

Replication Factor

The replication factor (RF) determines the number of participants in a Raft group. This number includes the leader and its followers.

The RU needs a majority of replicas available for write.

  • RF = 3: tolerates one replica failure
  • RF = 5: tolerates two replica failures

In Oracle Globally Distributed Database , the replication factor is specified for the entire sharded database, that is all replication units in the database have the same RF. The number of followers is limited to two, thus the replication factor is three.

Raft Log

Each RU is associated with a set of Raft logs and OS processes that maintain the logs and replicate changes from the leader to followers. This allows multiple RUs to operate independently and in parallel within a single shard and across multiple shards. It also makes it possible to scale the replication up and down by changing the number of RUs.

Changes to data made by a DML are recorded in the Raft log. A commit record is also recorded at the end of each user transaction. Raft logs are maintained independently from redo logs and contain logical changes to rows. Logical replication reduces failover time because followers are open to incoming transactions and can quickly become the leader.

The Raft protocol guarantees that followers receive log records in the same order they are generated by the leader. A user transaction is committed on the leader as soon as half of the followers acknowledge the receipt of the commit record and writes it to the Raft log.

Transactions

On a busy system, multiple commits are acknowledged at the same time. The synchronous propagation of transaction commit records provides zero data loss. The application of DML change records to followers, however, is done asynchronously to minimize the impact on transaction latency.

Leader Election Process

Per Raft protocol, if followers do not receive data or heartbeat from the leader for a specified period of time, then a new leader election process begins.

The default heartbeat interval is 150 milliseconds, with randomized election timeouts (up to 150 milliseconds) to prevent multiple shards from triggering elections at the same time, leading to split votes.

Node Failure

Node failure and recovery are handled in an automated way with minimal impact on the application.

The failover time is sub-3 seconds with less than 10 millisecond network latencies between Availability Zones. This includes failure detection, shard failover, change of leadership, application reconnecting to new leader, and continuing business transactions as before.

The impact of the failure on the application can further be abstracted by configuring retries in JDBC driver and end customer experience will be that a particular request took longer rather than getting an error.

The following is an illustration of a sharded database with all three shards in a healthy state. Applications requests are able to reach all three shards, and replication between the leaders and followers is ongoing between the shards.



Leader Node Failure

When the leader for a replication unit becomes unavailable, followers will initiate a new leader election process using the Raft protocol.

As long as a majority of the nodes (quorum) are still healthy, the Raft protocol will ensure that a new leader is elected from the available nodes.

When one of the followers succeeds in becoming the new leader, proactive notifications are sent from the shard to the client driver of leadership change. The client driver starts routing the request to the new leader shard. Routing clients (such as UCP) are notified using ONS notifications to update their shard and chunk mapping, ensuring that they route traffic to the leader.

During this failover and reconnection period, the application could be configured to wait and retry with the retry interval and retry counts settings at the JDBC driver configuration. These are very similar to the present RAC instance failover configuration.

Upon connecting to new leader, the application will continue to function as before.

The following diagram shows that the first shard failed, and that a new leader for the replication unit whose leader was once on that first shard has been replaced by a new leader in the second shard.



Failback

When the original leader comes back online after a failure, it first tries to identify the current leader and attempts to rejoin the cluster as a follower. Once the failed shard rejoins the cluster, it asks the leader for logs based on its current index in order to sync up with the leader.

Leadership rebalancing can be done by calling the API SWITCHOVER RU -REBALANCE, which could also be scripted if needed.

If there are not enough Raft logs available on the present leader, the follower will have to be repopulated from one of the good followers using data copy API (COPY RU).

Follower Node Failure

If a follower node becomes unavailable, the leader's attempts to replicate the Raft log to that follower will fail.

The leader will attempt to reach the failed follower indefinitely until the failed follower rejoins or a new follower replaces it.

If needed, a new follower will have to be created and added to the cluster, and its data needs to be synchronized from a good follower as explained above.

Example Raft Replication Deployment

The following diagram shows a simple sharded database Raft replication deployment with 6 shards, 12 replication units, and replication factor = 3.

Each shard has two leaders and 4 followers. Each member is labeled with an RU number and the suffix indicates whether it is a leader (-L) or follower (-Fn). The leaders are also indicated by a star. In this configuration, two shards can take over the load of a failed shard.



To configure this deployment, at the shard catalog creation step, you specify GDSCTL CREATE SHARDCATALOG -repl NATIVE. You can specify the replication factor (-repfactor) in the GDSCTL CREATE SHARDCATALOG or ADD SHARDGROUP commands. Similar to the specification of chunks, you specify the number of replication units in GDSCTL CREATE SHARDCATALOG or ADD SHARDSPACE commands.

Enabling Raft Replication

You enable Raft replication when you configure the shard catalog.

To enable Raft replication, specify the native replication option in the create shardcatalog command when you create the shard catalog.

For example,

gdsctl> create shardcatalog ... -repl native

After the shard catalog is created, you can add shards to the configuration and run the DEPLOY command.

Note:

You must have at least 3 shards in your sharded database to use Raft replication.

Specifying Replication Unit Attributes

By default, Oracle Globally Distributed Database determines the number of replication units (RUs) in a shardspace and the number of chunks in an RU.

You can specify the number of primary RUs using the -repunit option when you create the shard catalog. Specify a value greater than zero (0).

gdsctl> create shardcatalog ... -repunit number

The RU value cannot be modified after the first DEPLOY command is run on the sharded database configuration. Before you run DEPLOY, you can modify the number of RUs using the MODIFY SHARDSPACE command.

gdsctl> modify shardspace -shardspace shardspaceora -repunit number

Note that in system-managed sharding there is one shardspace named shardspaceora.

If the -repunit parameter is not specified, the default number of RUs is determined at the time of execution of the first DEPLOY command.

Ensuring Replicas Are Not Placed in the Same Rack

To ensure high availability, Raft replication group members should not be placed in the same rack.

If specified using the ADD SHARD command -rack rack_id option, the shard catalog will enforce that shards that contain replicated data are not placed in the same rack. If this is not possible an error is raised.

gdsctl> add shard -connect connect_identifier … -rack rack_id

Getting Runtime Information for Replication Units

Use GDSCTL STATUS REPLICATION to get replication unit runtime statistics, such as the leader and its followers.

GDSCTL STATUS REPLICATION can also be entered as STATUS RU, or just RU).

When option -ru is specified, you can get specific information for a particular replication unit.

For example, to get information about replication unit ru1:

GDSCTL> status ru -ru 1

Replication units
------------------------
Database RU# Role Term Log Index Status
-------- --- ---- ---- --------- ------
cdbsh1_sh1 1 Leader 2 21977067 Ok
cdbsh2_sh2 1 Follower 2 21977067 Ok
cdbsh3_sh3 1 Follower 2 21977067 Ok
cdbsh1_sh1 2 Follower 1 32937130 Ok
cdbsh2_sh2 2 Leader 1 32937130 Ok
cdbsh3_sh3 2 Follower 1 32937130 Ok
cdbsh1_sh1 3 Follower 2 16506205 Ok
cdbsh2_sh2 3 Follower 2 16506205 Ok
cdbsh3_sh3 3 Leader 2 16506205 Ok 

For details about the command syntax, options, and examples, see status ru (RU, status replication) in Global Data Services Concepts and Administration Guide.

Scaling with Raft Replication

You can add or remove shards from your Raft replication sharded database with the following instructions.

Adding Shards

To scale up your sharded database, run GDSCTL ADD SHARD.

When the shard is deployed, Raft replication automatically splits the replication units (RUs), such that for each shard added, 2 new RUs are created and their leaders placed on the new shard. Chunks from other RUs are moved to the new RUs to rebalance data distribution.

After adding shards you can run CONFIG TASK to view the ongoing rebalancing tasks.

If you don't want automatic rebalancing to occur, you can deploy the new shards with GDSCTL DEPLOY -no_rebalance and then manually move RUs and chunks to suit your needs.

Removing Shards

You remove a shard using GDSCTL REMOVE SHARD; however, you must first move any RUs off of the shard.

You can then manually rebalance the RUs on the remaining shards using MOVE RU and RELOCATE CHUNK.

See Moving Replication Unit Replicas and Moving A Chunk to Another Replication Unit for information.

Moving Replication Unit Replicas

Use MOVE RU to move a follower replica of a replication unit from one shard database to another.

For example,

gdsctl> move ru -ru 1 -source dba -target dbb

Notes:

  • Source database shouldn't contain the replica leader
  • Target database should not already contain another replica of the replication unit

See move ru (replication_unit) in Global Data Services Concepts and Administration Guide for syntax and option details.

Changing the Replication Unit Leader

Using SWITCHOVER RU, you can change which replica is the leader for the specified replication unit.

The -shard option makes the replication unit member on the specified shard database the new leader of the given RU.

gdsctl> switchover ru -ru 1 -shard dba

To then automatically rebalance the leaders, use SWITCHOVER RU -rebalance.

For full syntax and option details, see switchover ru (replication_unit) in Global Data Services Concepts and Administration Guide.

Copying Replication Units

You can copy a replication unit from one shard database to another using COPY RU. This allows you to instantiate or repair a replica of a replication unit on the target shard database.

For example, to copy replication unit 1 from dba to dbb:

gdsctl> copy ru -ru 1 -source dba -target dbb

Notes:

  • Neither source database nor target database should be the replica leader for the given replication unit
  • If the target database already contains this replication unit it will be replaced by full replica of the replication unit on the source database
  • If -replace is specified, the replication unit is removed from that database
  • If the target database doesn't contain the specified replication unit, then the total number of members for the given replication unit should be less than replication factor (3), unless -replace is specified.
    gdsctl> copy ru -ru 1 -source dba -target dbc -replace dbb
  • If -source is not specified, then an existing follower of the replication unit is chosen as the source database.

Note:

Because running this command requires a tablespace set for the destination chunk, create a minimum of 1 tablespace set before running this command.

For syntax and option details, see copy ru (replication_unit) in Global Data Services Concepts and Administration Guide.

Moving A Chunk to Another Replication Unit

To move a chunk from one Raft replication unit to another replication unit, use the GDSCTL RELOCATE CHUNK command.

To use RELOCATE CHUNK, the source and target replication unit leaders must be located on the same shard, and their followers must also be on the same shards. If they are not on the same shard, use SWITCHOVER RU to move the leader and MOVE RU to move the followers to co-located shards.

When moving chunks, specify the chunk ID numbers, the source RU ID from which to move them, and the target RU ID to move them to, as shown here.

GDSCTL> relocate chunk -chunk 3, 4 -sourceru 1, -targetru 2

The specified chunks must be in the same source replication unit. If -targetru is not specified, an new empty target replication unit is created.

GDSCTL MOVE CHUNK is not supported for moving chunks in a sharded database with Raft replication enabled.

Note:

Because running this command requires a tablespace set for the destination chunk, create a minimum of 1 tablespace set before running this command.

See also relocate chunk in Global Data Services Concepts and Administration Guide.

Splitting Chunks in Raft Replication

You can manually split chunks with GDSCTL SPLIT CHUNK in a Raft replication-enabled sharded database.

If you want to move some data within an RU to a new chunk, you can use GDSCTL SPLIT CHUNK to manually split the chunk.

You can then use RELOCATE CHUNK to move the new chunk to another RU if you wish. See Moving A Chunk to Another Replication Unit.

Note:

Because running this command requires a tablespace set for the destination chunk, create a minimum of 1 tablespace set before running this command.

Getting the Replication Type

To find out if your sharded database is using Raft replication, run CONFIG SDB to see the replication type in the command output.

In the command output, the Replication Type is listed as Native when Raft replication is enabled.

For example,

GDSCTL> config sdb

GDS Pool administrators
------------------------

Replication Type
------------------------
Native

Shard type
------------------------
System-managed

Shard spaces
------------------------
shardspaceora

Services
------------------------
oltp_ro_srvc
oltp_rw_srvc

Starting and Stopping Replication Units

The GDSCTL commands START RU and STOP RU can be used to facilitate maintenance operations.

You might want to stop the RU on a specific replica to disable replication temporarily so that you can do maintenance tasks on the database, operating system, or machine.

You can run START RU and STOP RU commands for specific replicas within a given replication unit (RU) or for all replicas.

The START RU command is used to resume the operation of previously stopped RUs. Additionally, it can be used in cases where an RU is offline due to errors. For example, if the log producer process for any replica within an RU stops functioning, it results in the RU being halted. The START RU command lets you restart the RU without a complete database restart.

To use the commands, follow this syntax:

start ru -ru ru_id [-database db]
stop ru -ru ru_id [-database db]

You supply the RU ID that you want to start to stop, and you can optionally specify the database name on which a member of the RU runs. If the database is not specified, the commands affect all available replicas of the specified replication unit.

Synchronizing Replication Unit Members

Use the GDSCTL command SYNC RU to synchronize data of the specified replication unit on all shards. This operation also erases Raft logs and resets log index and term.

To use SYNC RU, specify the replication unit (-ru ru_id).

gdsctl> sync ru -ru ru_id [-database db]

You can optionally specify a shard database name. If a database is not specified for the SYNC RU command, a replica to synchronize with will be chosen based on the following criteria:

  1. Pick the replica that was the last leader.
  2. If not available, pick the replica with greatest apply index.

If you see "Warning: GSM timeout expired" after running SYNC RU, set the GDSCTL global service manager (shard director) request timeout to a higher value.

gdsctl configure -gsm gsm_ID -timeout seconds -save_config

Enable or Disable Reads from Follower Replication Units

Use the database initialization parameter SHARD_ENABLE_RAFT_FOLLOWER_READ to enable or disable reads from follower replication units in a shard.

Set this parameter to TRUE to enable reads, or set to FALSE to disable reads.

This parameter can have different values on different shards.

See also: SHARD_ENABLE_RAFT_FOLLOWER_READ in Oracle Database Reference.

Viewing Parameter Settings

Use the SHARD_RAFT_PARAMETERS static data dictionary view to see parameters set at an RU level on each shard.

The values for these parameters, if set, can be seen in this view. The columns in the view are:

ORA_SHARD_ID: shard ID

RU_ID: replication unit ID

NAME: parameter name

VALUE: parameter value

For details about this view, see SHARD_RAFT_PARAMETERS in Oracle Database Reference.

Setting Parameters with GDSCTL

You can set some Raft-specific parameters at the replication unit level on each shard using the GDSCTL set ru parameter command.

Syntax

set ru parameter parameter_name=value [-shard shard_name] [-ru ru_id]

Arguments

Argument Type
parameter_name=value

Specify the parameter name and the value you wish to set it to. See the following topics for details about each parameter setting.

Tuning Flow Control to Mitigate Follower Lag

Setting Transaction Consensus Timeout

 
-ru ru_id Specify a replication unit ID number, If not specified, the command applies to all RUs.
-shard shard_name

Specify a shard name. If not specified, the command applies to all shards.

Tuning Flow Control to Mitigate Follower Lag

Flow control in Raft replication coordinates Raft group followers to optimize performance, efficiently utilize memory, and smooth out replication pipeline hiccups, such as variable network latency.

Followers may not consistently maintain the same speed. Occasionally, one might be slightly faster, while at other times, slightly slower.

To tune flow control set the SHARD_FLOW_CONTROL parameter on the shard where a follower is lagging.

For example,

gdsctl set ru parameter SHARD_FLOW_CONTROL=value

You can optionally specify a shard (-shard) or a replication unit ID number (-ru)

The value argument can be set to one of the following:

  • (Default) TILL_TIMEOUT: As long as the slow follower has received an LCR within a threshold time (see "Configuring Threshold Timeout" below), from the fast follower, the fast follower is throttled.

    However, if the slow follower falls behind by more than the threshold time, then it is disconnected, at which point it may or may not be able to catch up, depending on why there is a lag between the two followers. For example, if the slow follower is lagging because the network connection to it is bad for a very long time, it will be disconnected. This is also the case if the slow follower is actually a down follower.

    TILL_TIMEOUT at 10 times the heartbeat interval is the SHARD_FLOW_CONTROL default setting.

  • AT_DISTANCE: As long as the slow follower is within a threshold distance (see "Configuring Threshold Distance" below), in terms of LCRs, from the fast follower, the fast follower is throttled.

    However, if the slow follower falls behind by more than the threshold distance, then it is disconnected, at which point it may or may not be able to catch up, depending on why there is a lag between the two followers. For example, if the slow follower is lagging because the network connection to it is bad for a very long time, it will be disconnected. This is also the case if the slow follower is actually a down follower.

  • AT_LOGLIMIT: Flow control does not kick in during normal operation at all, but only starts if the log file is about to be overwritten by the leader but the slow follower still needs LCRs from the file being overwritten. When this situation occurs, the leader waits for the slow follower to consume the LCRs from the file to be overwritten.

    If the slow follower is actually a down follower, then with this option the leader waits for the slow follower to come online again when the RU's raft log limit is reached.

Configuring Threshold Distance

Threshold distance, expressed as a percentage of the in-memory queue size for LCRs, is used by the AT_DISTANCE option for flow control.

The default value is 10.

To set the threshold distance to another value, run:

gdsctl set ru parameter SHARD_FLOW_CONTROL_NS_DISTANCE_PCT=number

You can optionally specify a shard (-shard) or a replication unit ID number (-ru)

Configuring Threshold Timeout

Threshold timeout, in milliseconds, is used by the TILL_TIMEOUT option for flow control.

The timeout is expressed as a multiple of the heartbeat interval, and the default value is 10.

To set the threshold timeout, run:

gdsctl set ru parameter SHARD_FLOW_CONTROL_TIMEOUT_MULTIPLIER=milliseconds

You can optionally specify a shard (-shard) or a replication unit ID number (-ru)

See Setting Parameters with GDSCTL for details about using the set ru parameter command.

Setting Transaction Consensus Timeout

You can change the timeout value for a transaction to get consensus in Raft replication.

To configure the transaction consensus timeout, set the SHARD_TXN_ACK_TIMEOUT_SEC parameter, which specified the maximum time a user transaction waits for the consensus of its commit before raising ORA-05086.

gdsctl set ru parameter SHARD_TXN_ACK_TIMEOUT_SEC=seconds

You can optionally specify a shard (-shard) or a replication unit ID number (-ru)

By default, Raft replication waits 90 seconds for a transaction to get consensus. However, if the leader is disconnected from the other replicas, it may not get consensus for its commits; if there is low memory in the replication pipeline, the replication of LCRs slows down, resulting in the delayed delivery of acknowledgments. In many cases such as these, 90 seconds may be too long to wait, so you may want to error out a transaction much earlier, depending on your application requirements.

The minimum valid value is 1 second.

See Setting Parameters with GDSCTL for details about using the set ru parameter command.

Dynamic Performance Views for Raft Replication

There are several dynamic performance (V$) views available for Raft replication.

  • V$SHARD_ACK_SENDER
  • V$SHARD_ACK_RECEIVER
  • V$SHARD_APPLY_COORDINATOR
  • V$SHARD_APPLY_LCR_READER
  • V$SHARD_APPLY_READER
  • V$SHARD_APPLY_SERVER
  • V$SHARD_LCR_LOGS
  • V$SHARD_LCR_PERSISTER
  • V$SHARD_LCR_PRODUCER
  • V$SHARD_NETWORK_SENDER
  • V$SHARD_MESSAGE_TRACKING
  • V$SHARD_REPLICATION_UNIT
  • V$SHARD_TRANSACTION

For descriptions and column details for these views, see Dynamic Performance Views in Oracle Database Reference.

Raft Replication Restrictions

The following restrictions apply to Raft replication in Oracle Globally Distributed Database.

GDSCTL MOVE CHUNK is not supported for Raft replication. To move chunks from one replication unit to another, use RELOCATE CHUNK. See Moving A Chunk to Another Replication Unit.