Chapter 3. Transaction Management

Table of Contents

Managing Durability
Durability Controls
Commit File Synchronization
Managing Acknowledgements
Managing Consistency
Setting Consistency Policies
Time Consistency Policies
Commit Point Consistency Policies
Availability
Write Availability
Read Availability
Consistency and Durability Use Cases
Out on the Town
Bio Labs, Inc
Managing Transaction Rollbacks
Example Run Transaction Class
RunTransaction Class
Using RunTransaction

A JE HA application is essentially a transactional application that distributes its data across multiple environments for you. The assumption is that these environments are on separate physical hosts, so the distribution of your data is performed over TCP/IP connections.

Because of this distribution activity, several new dimensions are added to your transactional management. In particular, there is more to consider in the areas of durability, consistency and performance than you have to think about for single-environment applications.

Before continuing, some definitions are in order:

  1. Durability is defined by how likely your data will continue to exist in the presence of hardware breakage or a software crash. The first goal of any durability scheme is to get your data stored onto physical media. After that, to make your data even more durable, you will usually start to consider your backup schemes.

    By its very nature, a JE HA application is offering you more data durability than does a traditional transactional application. This is because your HA application is distributing your data across multiple environments (which we assume are on multiple physical machines), which means that data backups are built into the application. The more backups, the more durable your application is.

  2. Consistency is defined by how current your data is. In a traditional transactional application, consistency is guaranteed by allowing you to group multiple read and write operations in a single atomic unit, which is defined by the transactional handle. This level of consistency continues to exist for your HA application, but in addition you must concern yourself with how consistent (or correct) the data is across the various nodes in the replication group.

    Because the replication group is a collection of differing machines connected by a network, some amount of a delay in data updates is to be naturally expected across the Replicas. The amount of delay that you will see is determined by the number and size of the data updates, the performance of your network, the performance of the hardware on which your nodes are running, and whether your nodes are persistently available on the network (as opposed to being down or offline or otherwise not on the network for some period of time). Because they are not included in acknowledgments, Secondary nodes may tend to show greater delay than Electable nodes.

    A highly consistent HA application, then, is an application where the data across all nodes in the replication group is identical or very nearly identical all the time. A not very consistent HA application is one where data across the replication group is frequently stale or out of date relative to the data contained on the Master node.

  3. Performance is simply how fast your HA application is at performing read and write requests. By its very nature, an HA application tends to perform much better than a traditional transactional application at read-only requests. This is because you have multiple machines that are available to service read-only requests. The only tricky thing here is to make sure you load balance your read requests appropriately across all your nodes so that you do not have some nodes that are swamped with requests while others are mostly idle.

    Write performance for an HA application is a mixed bag. Depending on your goals, you can make the HA application perform better than a traditional transactional application that is committing writes to the disk synchronously. However, in doing so you will compromise your data's durability and consistency guarantees. This is no different than configuring a traditional transactional application to commit transactions asynchronously to disk, and so lose the guarantee that the write is stored on physical media before the transaction completes. However, the good news is that because of the distributed nature of the HA application, you have a better durability guarantee than the asynchronously committing single-environment transactional application. That is, by "committing to the network" you have a fairly good chance of a write making it to disk somewhere on some node.

    Mostly, though, HA applications commit a transaction and then wait for an acknowledgement from some number of nodes before the transaction is complete. An HA application running with quorum acknowledgements and write no sync durability can exhibit equal or better write performance than a single node standalone application, but your write performance will ultimately depend on your application's configuration.

As you design your HA application, remember that each of these characteristics are interdependent. You cannot, for example, configure your application to have extremely high durability without sacrificing some amount of performance. A highly consistent application may have to make sacrifices in durability. A high performance HA application may require you to make trade-offs in both durability and consistency.

Managing Durability

A highly durable application is one where you attempt to make sure you do not lose data, ever. This is frequently (but not always) one of the most pressing design considerations for any application that manages data. After all, data often equals money because the data you are managing could involve billing or inventory information. But even if your application is not managing information that directly relates to money, a loss of data may very well cost your enterprise money in terms of the time and resources necessary to reacquire the information.

HA applications attempt to increase their data durability guarantees by distributing data writes across multiple physical machines on the network. By spreading the data in this way, you are placing it on stable storage on multiple physical hard drives, CPUs and power supplies. Obviously, the more physical resources available to contain your data, the more durable it is.

However, as you increase your data durability, you will probably lower your consistency guarantees and probably your write performance. Read performance may also take a hit, depending on how many physical machines you include in the mix and how high a durability guarantee you want. In order to understand why, you have to understand how JE HA applications handle transactional commits.

Durability Controls

By default, JE HA makes transactional commit operations on the Master wait to return from the operation until they receive acknowledgements from some number of Replicas. Each Replica, in turn, will only return an acknowledgement once the write operation has met whatever durability requirement exists for the Replica. (For example, you can require the Replicas to successfully flush the write operation to disk before returning an acknowledgement to the Master.)

Note

Be aware that write operations received on the Replica from the Master have lock priority. This means that if the Replica is currently servicing a read request, it might have to retry the read operation should a write from the Master preempt the read lock. For this reason, you can see read performance degradation if you have Replicas that are heavily loaded with read requests at a time when the Master is performing a lot of write activity. The solution to this is to add additional nodes to your replication group and/or better load-balance your read requests across the Replicas.

There are three things to control when you design your durability guarantee:

  • Whether the Master synchronously writes the transaction to disk. This is no different from the durability consideration that you have for a stand-alone transactional application.

  • Whether the Replica synchronously writes the transaction to disk before returning an acknowledgement to the Master, if any.

  • How many, if any, Replicas must acknowledge the transaction commit before the commit operation on the Master can complete.

You can configure your durability policy on a transaction-by-transaction basis using TransactionConfig.setDurability(), or on an environment-wide basis using EnvironmentMutableConfig.setDurability().

Commit File Synchronization

Synchronization policies are described in the Berkeley DB, Java Edition Getting Started with Transaction Processing guide. However, for the sake of completeness, we briefly cover this topic here again.

You define your commit synchronization policy by using a Durability class object. For HA applications, the Durability class constructor must define the synchronization policy for both the Master and the Master's replicas. The synchronization policy does not have to be the same for both Master and Replica.

You can use the following constants to define a synchronization policy:

  • Durability.SyncPolicy.SYNC

    Write and synchronously flush the log to disk upon transaction commit. This offers the most durable transaction configuration because the commit operation will not return until all of the disk I/O is complete. But, conversely, this offers the worse possible write performance because disk I/O is an expensive and time-consuming operation.

  • Durability.SyncPolicy.NO_SYNC

    Do not synchronously flush the log on transaction commit. All of the transaction's write activity is held entirely in memory when the transaction completes. The log will eventually make it to disk (barring an application hardware crash of some kind). However, the application's thread of control is free to continue operations without waiting for expensive disk I/O to complete.

    This represents the least durable configuration that you can provide for your transactions. But it also offers much better write performance than the other options.

  • Durability.SyncPolicy.WRITE_NO_SYNC

    Log data is synchronously written to the OS's file system buffers upon transaction commit, but the data is not actually forced to disk. This protects your write activities from an application crash, but not from a hardware failure.

    This policy represents an intermediate durability guarantee. It is not has strong as SYNC, but is also not as weak as NO_SYNC. Conversely, it performs better than NO_SYNC (because your application does not have to wait for actual disk I/O), but it does not perform quite as well as SYNC (because data still must be written to the file system buffers).

Managing Acknowledgements

Whenever a Master commits a transaction, by default it waits for acknowledgements from a majority of its Electable Replicas before the commit operation on the Master completes. By default, Electable Replicas respond with an acknowledgement once they have successfully written the transaction to their local disk. Note that Secondary Replicas do not ever provide acknowledgements.

Acknowledgements are expensive operations. They involve both network traffic, as well as disk I/O at multiple physical machines. So on the one hand, acknowledgements help to increase your durability guarantees. On the other, they hurt your application's performance, and may have a negative impact on your application's consistency guarantee.

For this reason, JE allows you to manage acknowledgements for your HA application. As is the case with synchronization policies, you do this using the Durability class. As a part of this class' constructor, you can provide it with one of the following constants:

  • Durability.ReplicaAckPolicy.ALL

    All of the Electable Replicas must acknowledge the transactional commit. This represents the highest possible durability guarantee for your HA application, but it also represents the poorest performance. For best results, do not use this policy unless your replication group contains a very small number of electable replicas, and those replicas are all on extremely reliable networks and servers.

  • Durability.ReplicaAckPolicy.NONE

    The Master will not wait for any acknowledgements from its Replicas. In this case, your durability guarantee is determined entirely by the synchronization policy your Master is using for its transactional commits. This policy also represents the best possible choice for write performance.

  • Durability.ReplicaAckPolicy.SIMPLE_MAJORITY

    A simple majority of the Electable Replicas must return acknowledgements before the commit operation returns on the Master. This is the default policy. It should work well for most applications unless you need an extremely high durability guarantee, have a very large number of Electable Replicas, or you otherwise have performance concerns that cause you to want to avoid acknowledgements altogether.

You can configure your synchronization policy on a transaction-by-transaction basis using TransactionConfig.setDurability(), or on an environment-wide basis using EnvironmentMutableConfig.setDurability(). For example:

   EnvironmentConfig envConfig = new EnvironmentConfig();
   envConfig.setAllowCreate(true);
   envConfig.setTransactional(true);

   // Require no synchronization for transactional commit on the 
   // Master, but full synchronization on the Replicas. Also,
   // wait for acknowledgements from a simple majority of Replicas.
   Durability durability =
          new Durability(Durability.SyncPolicy.WRITE_NO_SYNC,
                         Durability.SyncPolicy.NO_SYNC,
                         Durability.ReplicaAckPolicy.SIMPLE_MAJORITY);

   envConfig.setDurability(durability);

   // Identify the node
   ReplicationConfig repConfig = 
        new ReplicationConfig("PlanetaryRepGroup",
                              "Jupiter",
                              "jupiter.example.com:5002");

   // Use the node at mercury.example.com:5001 as a helper to find
   // the rest of the group.
   repConfig.setHelperHosts("mercury.example.com:5001");

   ReplicatedEnvironment repEnv =
      new ReplicatedEnvironment(home, repConfig, envConfig); 

Note that at the time of a transaction commit, if the Master is not in contact with enough Electable Replicas to meet the transaction's durability policy, the transaction commit operation will throw an InsufficientReplicasException. The proper action to take upon encountering this exception is to abort the transaction, wait a small period of time in the hopes that more Electable Replicas will become available, then retry the exception. See Example Run Transaction Class for example code that implements this retry loop.

You can also see an InsufficientReplicasException when you begin a transaction if the Master fails to be in contact with enough Electable Replicas to meet the acknowledgement policy. To manage this, you can configure how long the transaction begin operation will wait for enough Electable Replicas before throwing this exception. You use the INSUFFICIENT_REPLICAS_TIMEOUT configuration option, which you can set using the ReplicationConfig.setConfigParam() method.

Managing Acknowledgement Timeouts

In addition to the acknowledgement policies, you have to also consider your replication acknowledgement timeout value. This value specifies the maximum amount of time that the Master will wait for acknowledgements from its Electable Replicas.

If the Master commits a transaction and the timeout value is exceeded while waiting for enough acknowledgements, the Transaction.commit() method will throw an InsufficientAcksException exception. In this event, the transaction has been committed on the Master, so at least locally the transaction's durability policy has been met. However, the transaction might not have been committed on enough Electable Replicas to guarantee your HA application's overall durability policy.

There can be a lot of reasons why the Master did not get enough acknowledgements before the timeout value, such as a slow network, a network failure before or after a transaction was transmitted to a replica, or a failure of a replica. These failures have different consequences for whether a transaction will become durable or will be subject to rollback. As a result, an application may respond in various ways, and for example choose to:

  • Do nothing, assuming that the transaction will eventually propagate to enough replicas to become durable.

  • Retry the operation in a new transaction, which may succeed or fail depending on whether the underlying problems have been resolved.

  • Retry using a larger timeout interval and return to the original timeout interval at a later time.

  • Fall back temporarily to a read-only mode.

  • Increase the durability of the transaction on the Master by ensuring that the changes are flushed to the operating system's buffers or to the disk.

  • Give up and report an error at a higher level, perhaps to allow an administrator to check the underlying cause of the failure.

The default value for this timeout is 5 seconds, which should work for most cases where an acknowledgement policy is in use. However, if you have a very large number of Electable Replicas, or if you have a very unreliable network, then you might see a lot of InsufficientAcksException exceptions. In this case, you should either increase this timeout value, relax your acknowledgement policy, or find out why your hardware and/or network is performing so poorly.

Note

You can also see InsufficientAcksException or InsufficientReplicasException exceptions if one or more replicas have exceeded their disk usage thresholds. See Suspending Writes Due to Disk Thresholds for more information.

You can configure your acknowledgement policy using the ReplicationConfig.setReplicaAckTimeout() method.

   EnvironmentConfig envConfig = new EnvironmentConfig();
   envConfig.setAllowCreate(true);
   envConfig.setTransactional(true);

   // Require no synchronization for transactional commit on the 
   // Master, but full synchronization on the Replicas. Also,
   // wait for acknowledgements from a simple majority of Replicas.
   Durability durability =
          new Durability(Durability.SyncPolicy.WRITE_NO_SYNC,
                         Durability.SyncPolicy.NO_SYNC,
                         Durability.ReplicaAckPolicy.SIMPLE_MAJORITY);

   envConfig.setDurability(durability);

   // Identify the node
   ReplicationConfig repConfig = 
        new ReplicationConfig("PlanetaryRepGroup",
                              "Jupiter",
                              "jupiter.example.com:5002");
 
   // Use the node at mercury.example.com:5001 as a helper to find the rest
   // of the group.
   repConfig.setHelperHosts("mercury.example.com:5001");

   // Set a acknowledgement timeout that is slightly longer
   // than the default 5 seconds.
   repConfig.setReplicaAckTimeout(7, TimeUnit.SECONDS);

   ReplicatedEnvironment repEnv =
      new ReplicatedEnvironment(home, repConfig, envConfig);