Building replicated applications

The simplest way to build a replicated Berkeley DB application is to first build (and debug!) the transactional version of the same application. Then, add a thin replication layer: application initialization must be changed and the application's communication infrastructure must be added.

The application initialization changes are relatively simple. Replication Manager provides a communication infrastructure, but in order to use the replication Base APIs you must provide your own.

For implementation reasons, all replicated databases must reside in the data directories set from DB_ENV->add_data_dir() (or in the default environment home directory, if not using DB_ENV->add_data_dir()), rather than in a subdirectory below the specified directory. Care must be taken in applications using relative pathnames and changing working directories after opening the environment. In such applications the replication initialization code may not be able to locate the databases, and applications that change their working directories may need to use absolute pathnames.

During application initialization, the application performs three additional tasks: first, it must specify the DB_INIT_REP flag when opening its database environment and additionally, a Replication Manager application must also specify the DB_THREAD flag; second, it must provide Berkeley DB information about its communications infrastructure; and third, it must start the Berkeley DB replication system. Generally, a replicated application will do normal Berkeley DB recovery and configuration, exactly like any other transactional application.

Replication Manager applications configure the built-in communications infrastructure by calling obtaining a DB_SITE handle, and then using it to configure the local site. It can optionally obtain one or more DB_SITE handles to configure remote sites. Once the environment has been opened, the application starts the replication system by calling the DB_ENV->repmgr_start() method.

A Base API application calls the DB_ENV->rep_set_transport() method to configure the entry point to its own communications infrastructure, and then calls the DB_ENV->rep_start() method to join or create the replication group.

When starting the replication system, an application has two choices: it may choose the group master site explicitly, or alternatively it may configure all group members as clients and then call for an election, letting the clients select the master from among themselves. Either is correct, and the choice is entirely up to the application.

Replication Manager applications make this choice simply by setting the flags parameter to the DB_ENV->repmgr_start() method.

For a Base API application, the result of calling DB_ENV->rep_start() is usually the discovery of a master, or the declaration of the local environment as the master. If a master has not been discovered after a reasonable amount of time, the application should call DB_ENV->rep_elect() to call for an election.

Consider a Base API application with multiple processes or multiple environment handles that modify databases in the replicated environment. All modifications must be done on the master environment. The first process to join or create the master environment must call both the DB_ENV->rep_set_transport() and the DB_ENV->rep_start() method. Subsequent replication processes must at least call the DB_ENV->rep_set_transport() method. Those processes may call the DB_ENV->rep_start() method (as long as they use the same master or client argument). If multiple processes are modifying the master environment there must be a unified communication infrastructure such that messages arriving at clients have a single master ID. Additionally the application must be structured so that all incoming messages are able to be processed by a single DB_ENV handle.

Note that not all processes running in replicated environments need to call DB_ENV->repmgr_start(), DB_ENV->rep_set_transport() or DB_ENV->rep_start(). Read-only processes running in a master environment do not need to be configured for replication in any way. Processes running in a client environment are read-only by definition, and so do not need to be configured for replication either (although, in the case of clients that may become masters, it is usually simplest to configure for replication on process startup rather than trying to reconfigure when the client becomes a master). Obviously, at least one thread of control on each client must be configured for replication as messages must be passed between the master and the client.

Any site in a replication group may have its own private transactional databases in the environment as well. A site may create a local database by specifying the DB_TXN_NOT_DURABLE flag to the DB->set_flags() method. The application must never create a private database with the same name as a database replicated across the entire environment as data corruption can result.

For implementation reasons, Base API applications must process all incoming replication messages using the same DB_ENV handle. It is not required that a single thread of control process all messages, only that all threads of control processing messages use the same handle.

No additional calls are required to shut down a database environment participating in a replication group. The application should shut down the environment in the usual manner, by calling the DB_ENV->close() method. For Replication Manager applications, this also terminates all network connections and background processing threads.