Recovering from Data Corruption

Oracle NoSQL Database can automatically detect data corruption in the database store. When it detects data corruption, Oracle NoSQL Database automatically shuts down the associated Admin or Replication Nodes. Manual administrative action is then required before the nodes can be brought back online.

Detecting Data Corruption

Oracle NoSQL Database Admin or Replication Node processes will exit when they detect data corruption. This is caused by a background task which detects data corruption caused by a disk failure, or similar physical media or I/O subsystem problem. Typically, the corruption is detected because of a checksum error in a log entry in one of the data (*.jdb) files contained in an Admin or Replication Node database environment. A data corruption error generates output in the debug log similar to this:

2016-10-25 16:59:52.265 UTC SEVERE [rg1-rn1] Process exiting
com.sleepycat.je.EnvironmentFailureException: (JE 7.3.2)
rg1-rn1(-1):kvroot/mystore/sn1/rg1-rn1/env 
com.sleepycat.je.log.ChecksumException:
Invalid log entry type: 102 lsn=0x0/0x0 bufPosition=5 
bufRemaining=4091 LOG_CHECKSUM:
Checksum invalid on read, log is likely invalid. Environment is 
invalid and must be closed
...
2016-10-25 16:59:52.270 UTC SEVERE [rg1-rn1] Exception creating 
service rg1-rn1:
(JE 7.3.2) rg1-rn1(-1):kvroot/mystore/sn1/rg1-rn1/env 
com.sleepycat.je.log.ChecksumException:
Invalid log entry type: 102 lsn=0x0/0x0 bufPosition=5 
bufRemaining=4091 LOG_CHECKSUM:
Checksum invalid on read, log is likely invalid. Environment is 
invalid and must be closed. (12.1.4.3.0): oracle.kv.FaultException: 
(JE 7.3.2) rg1-rn1(-1):kvroot/mystore/sn1/rg1-rn1/env 
com.sleepycat.je.log.ChecksumException: Invalid log entry type: 102 
lsn=0x0/0x0 bufPosition=5 bufRemaining=4091 LOG_CHECKSUM: Checksum 
invalid on read, log is likely invalid. Environment is invalid and 
must be closed. (12.1.4.3.0)
Fault class name: com.sleepycat.je.EnvironmentFailureException
...  
2016-10-25 16:59:52.272 UTC INFO [rg1-rn1] Service status changed 
from STARTING to ERROR_NO_RESTART 

The EnvironmentFailureException will cause the process to exit. Because the exception was caused by log corruption, the service status is set to ERROR_NO_RESTART, which means that the service will not restart automatically.

Data Corruption Recovery Procedure

If an Admin or Replication Node has been stopped due to data corruption, then manual administration intervention is required in order to restart the Node:

  1. Optional: Archive the corrupted environment data files.

    If you want to send the corrupted environment to Oracle support for help in identifying the root cause of the failure, archive the corrupted environment data files. These are usually located at:

    <KVROOT>/<STORE_NAME>/<SNx>/<Adminx>/"

    or

    <KVROOT>/<STORE_NAME>/<SNx>/<rgx-rnx>"

    However, if you used the plan change-storagedir CLI command to change the storage directory for your Replication Node, then you will find the environment in the location that you specified to that command.

    You can use the show topology CLI command to display your store's topology. As part of this information, the storage directory for each of your Replication Nodes are identified.

  2. Confirm that a non-corrupted version of the data is available.

    Before removing the files associated with the corrupted environment, confirm that another copy of the data is available either on another node or via a previously save snapshot. For a Replication Node, you must be using a Replication Factor greater than 1 and also have a properly operating Replication Node in the store in order for the data to reside elsewhere in the store. If you are using a RF=1, then you must have a previously saved snapshot in order to continue.

    If the problem is with an Admin Node, there must be to be another Admin available in the store that is operating properly.

    Use the ping or verify configuration commands to check if the available nodes are running properly and healthy.

  3. Remove all the data files that reside in the corrupted environment.

    Once the data files associated with a corrupted environment have been saved elsewhere, and you have confirmed that another copy of the data is available, delete all the data files in the enviroment directory. Make sure you only delete the files associated with the Admin or Replication Node that has failed due to a corrupted environment error.

    # ls <KVROOT>/mystore/sn1/rg1-rn1/env
    00000000.jdb  00000001.jdb  00000002.jdb  je.config.csv  
    je.info.0 je.lck  je.stat.csv
    
    # rm <KVROOT>/mystore/sn1/rg1-rn1/env/*.jdb 
  4. Perform recovery using either Network Restore, or from a backup. Be aware the recovery from a backup will not work to recover an Admin Node.

    • Recovery using Network Restore

      Network restore can be used to recover from data corruption if the corrupted node belongs to a replication group that has other replication nodes available. Network restore is automatic recovery task. After removing all of the database files in the corrupted environment, you only need to connect to CLI and restart the corrupted node.

      For a Replication Node:

      kv-> plan start-service -service rg1-rn1

      For an Admin:

      kv-> plan start-service -service rg1-rn1 
    • Recovery from a backup (RNs only)

      If the store does not have another member in the Replication Node's shard or if all of the nodes in the shard have failed due to data corruption, you will need to restore the node's environment from a previously created snapshot. See Recovering the Store for details.

      Note that to recover an Admin that has failed due to data corruption, you must have a working Admin somewhere in the store. Snapshots do not capture Admin data.