A Disaster Recovery

Disaster recovery is the process for recovering or preventing the loss of business critical information after a natural or human-induced disaster.

See also:

Recovering a KMA

OKM uses a cluster design of at least two KMAsFoot 1  to help reduce the risk of disruptions and assist in recovery. Clustering KMAs allows you to replicate database entries and balance workloads. If a component fails, it can be easily replaced and restored.

When designing an encryption and archive strategy, you should ensure that critical data is replicated and vaulted off-site (see "Example Scenarios for Recovering Data").

If at least one KMA remains operational, you can recover a single KMA without impacting the rest of the cluster. The following sections address scenarios that require recovery of a single KMA.

KMA Recovery Following a Software Upgrade

Software upgrades do not require a repair or a recovery, however sometimes the KMA will be out of service as the upgrade takes place. The cluster allows the upgrade to occur without interrupting the active encryption agents.

You can download the new software concurrently on all KMAs in the cluster, however activating the new software requires the KMA to reboot. Therefore to prevent an interruption, you should stagger rebooting the KMAs in the cluster so that at least one KMA is always active. As each KMA returns to an online status, any database updates done while the KMA was offline will be replicated and all KMAs in the cluster will re-synchronize.

KMA Recovery Following a Network Disconnection

When a KMA disconnects from the management network, such as when activating new software, the remaining KMAs in the cluster attempt to contact it and report communication errors in the audit event log. Agents continue to communicate with other KMAs across the network. Usually these are other KMAs attached to the same service network. However, because Agents may be attached to the management network, they first attempt to work with the KMAs in their own configured site; but if need be, they will contact any reachable KMAs within the cluster.

When the KMA reconnects to the network, any database updates done while the KMA was disconnected will be replicated and all KMAs in the cluster re-synchronize.

KMA Recovery Following a Hardware Failure

If a hardware failure occurs, you should first delete the KMA from the cluster so that the remaining KMAs stop attempting to communicate with it. If the KMA console is still accessible, you can reset the KMA. The reset operation returns the unit to its factory defaults. This operation offers the option to scrub the server's hard disk as an extra security precaution. Disposition of the failed server is handled by the customer.

Oracle service representative can repair and add a KMA server to the cluster as described in the Oracle Key Manager 3 Installation and Service Manual, PN E48395-xx. Once added the cluster, the database replicates, KMAs in the cluster re-synchronize, and the new KMA becomes an active member of the cluster.

Considerations When Performing Backups and Key Sharing

OKM backups and key sharing (import/export) are database intensive and reduce the response time on the KMA while it is performing the backup or key transfer operation. If possible, reduce tape drive workloads during the OKM backup and transfer window. If that is not possible, then consider the following options:

  • Use the same KMA for backups and key sharing each time (most likely this is how cron jobs invoking the OKM backup utility will get set up).

  • If the cluster is large enough, dedicate a KMA to be an administrative KMA.

    • This KMA should not have a service network connection so it would not be burdened with tape drive key requests at any time, especially during the backup or key transfer windows.

    • This KMA could also be used for OKM GUI sessions thus offloading the other KMAs from handling management related requests.

  • Ensure fast management network connectivity of the backup and key transfer KMA. The faster the connection, the better it will be able to keep up with the additional load during backup and key transfer windows. This is true for all KMAs, but especially for the KMA performing backups as it will fall behind on servicing replication requests during the backup window. Having a fast network connection helps to minimize the replication backlog, such as lag.

  • Put the backup and key transfer KMA in a site that is not used by tape drives. The tape drives then preference other KMAs within the site that they have been assigned and avoid using the backup and key transfer KMA.

  • Add more KMAs to sites containing tape drives so that load balancing of key requests will occur across more KMAs. This reduces the number of key requests that the backup and key transfer KMA has to handle.

Determining Key Pool Size

OKM administrators should know the worst case number of keys they expect to be created during of the OKM backup/key transfer window. The default key pool size of 1000 keys should be sufficient for most customers unless the estimated worst case key creation rate for the backup windows exceeds this.


KMAs pre-generate keys so a key creation request from an agent does not actually cause a key to be created on the KMA until the key pool maintainer runs within the server. When the server is busy the key pool maintainer can be delayed in its operations.

The total cluster key pool size must be large enough so that KMAs can hand out pre-generated keys from their key pool during the backup windows. When the key pool size is too small, KMAs can become drained of pre-generated keys and start returning "no ready key" errors. Tape drives failover to other KMAs when this happens, adding further disruption to the backup/key transfer window.

Administrators should observe the OKM backup window periodically as it will gradually grow as the database gets larger. Adjust the key pool size when the backup window exceeds a threshhold or if the key consumption rate grows due to changes in the overall tape workload.

Example Scenarios for Recovering Data

OKM can span multiple geographically-separated sites to reduce the risk of a disaster destroying the entire cluster. Although unlikely that an entire cluster must be recreated, you can recover most of the key data by re-creating the OKM environment from a recent database backup.

When designing an encryption/archive strategy, you should replicate and vault critical data at a recovery site. If a site is lost, this backup data may be transferred to another operational site. Data units and keys associated with tape volumes will be known to the KMAs at the sister site, and encrypted data required to continue business operations will be available. The damaged portion of the cluster can be restored easily at the same or a different location once site operations resume.

Many companies employ the services of a third-party disaster recovery (DR) site to allow them to restart their business operations as quickly as possible. Periodic unannounced DR tests demonstrate the company's degree of preparedness to recover from a disaster, natural or human-induced.

Replicating from Another Site

The figures below show examples of two geographically separate sites (two KMAs at each site). Recovery of a single KMA can occur with no impact to the rest of the cluster as long as one KMA always remains operational.

Figure ???????–? shows a disaster recover example where the time to recover business continuity to an entire site could take months. If Site 1 were destroyed, the customer must replace all the destroyed equipment to continue tape operations at Site 1. Completely restoring Site 1 would require you to install and create the new KMAs (requires a Security Officer and Quorum), join the existing cluster, and enroll the tape drives. Site 1 then self-replicates from the surviving KMAs at Site 2.

Figure A-1 Replication from Another Site—No WAN Service Network

Description of Figure A-1 follows
Description of ''Figure A-1 Replication from Another Site—No WAN Service Network''

Figure ???????–? shows an disaster recovery example where the amount of time to recover business continuity is a matter of minutes. If the KMAs at Site 1 were destroyed, and the infrastructure at Site 2 is still intact, a WAN used as the Service Network that connects the tape drives between the two sites allows the intact KMAs from Site 2 to continue tape operations between both sites. Once the KMAs are replaced at Site 1, they self-replicate from the surviving KMAs at the intact Site 2.

Figure A-2 Replication from Another Site—WAN Service Network

Description of Figure A-2 follows
Description of ''Figure A-2 Replication from Another Site—WAN Service Network''

Using a Dedicated Disaster Recovery Site

The customer can place KMAs at the disaster recovery site and configure these into their production cluster using a WAN connection. These KMAs are dedicated to the specific customer and allow keys to always be at the site and ready for use.

With this approach, a recovery can begin once the customer enrolls the tape drives in the KMAs and joins the OKM cluster. This can be done by connecting the OKM GUI to the KMAs at the DR site. In a true disaster recovery scenario, these may be the only remaining KMAs from the customer's cluster. Drive enrollment can occur within minutes and tape production can begin after configuring the drives.

In the example below, the customer has a big environment with multiple sites. Each site uses a pair of KMAs and the infrastructure to support automated tape encryption and a single cluster where all KMAs share keys. Along with the multiple sites, this customer also maintains and uses equipment at a Disaster Recovery (DR) site that is part of the customer's OKM Cluster.

This customer uses a simple backup scheme that consists of daily incremental backups, weekly differential backups, and monthly full backups. The monthly backups are duplicated at the DR site and sent to an off-site storage facility for 90 days. After the 90-day retention period, the tapes are recycled. Because the customer owns the equipment at the DR site, this site is just an extension of the customer that strictly handles the back-up and archive processes.

Figure A-3 Pre-positioned Equipment at a Dedicated Disaster Recovery Site

Description of Figure A-3 follows
Description of ''Figure A-3 Pre-positioned Equipment at a Dedicated Disaster Recovery Site''

Using Shared Resources for Disaster Recovery

Companies that specialize in records management, data destruction, and data recovery, purchase equipment that several customers can use for backup and archive. Using shared resources can provide cost-efficient elements for disaster recovery. The customer can restore backups their OKM into KMAs provided by the shared resource site. This avoids the need for a wide area network (WAN) link and the on-site dedicated KMAs, however it requires additional time to restore the database. Restore operations can take about 20 minutes per 100,000 keys.

At the DR site,

  • The customer selects the appropriate equipment from the DR site inventory. The DR site configures the equipment and infrastructure accordingly.

  • IMPORTANT: The customer must provide the DR site with the three OKM back-up files: the Core Security backup file (requires a quorum), .xml backup file, and .dat backup file.

  • The customer configures an initial KMA using QuickStart, restores the KMA from the OKM backup files, activates/enables/ switches the drives to encryption-capable, and enrolls the tape drives into the DR site KMA cluster.

  • Once the restore completes, the DR site needs to switch-off encryption from the agents, remove the tape drives from the cluster or reset the drives passphrase, reset the KMAs to factory default, and disconnect the infrastructure/network.

Using Key Transfer Partners for Disaster Recovery

Key Transfer is also called Key Sharing. Transfers allow keys and associated data units to be securely exchanged between partners or independent clusters and is required if you want to exchange encrypted media.


A DR site may also be configured as a Key Transfer Partner.

This process requires each party in the transfer to establish a public/private key pair. Once the initial configuration is complete the sending party uses Export Keys to generate a file transfer and the receiving party then uses Import Keys to receive the keys and associated data.

As a practice, it is not recommended to use Key Transfer Partners for Disaster Recovery. However, when DR sites create keys during the backup process, doing a key transfer can incrementally add the DR sites keys to the already existing data base.

The Key Transfer process requires each user to configure a Transfer Partner for each OKM Cluster: one partner exports keys from their cluster and the other partner imports keys into their cluster. When configuring Key Transfer Partners, administrators must perform tasks in a specific order that requires the security officer, compliance officer, and operator roles.

To configure Key Transfer Partners, see "Transfer Keys Between Clusters".

Figure A-5 Transfer Key Partners

Description of Figure A-5 follows
Description of ''Figure A-5 Transfer Key Partners''

Footnote Legend

Footnote 1: Multiple Servers: Exceptions to this standard configuration must be made with the approval of OKM Engineering and Global Support Services.