Replacing a Failed Disk

You can replace a disk that is either in the process of failing, or has already failed. Disk replacement procedures are necessary to keep the store running. These are the steps required to replace a failed disk to preserve data availability.

The following example deploys a KVStore to a set of three machines, each with 3 disks. Use the storagedir flag of the makebootconfig command to specify the storage location of the disks.

> java -Xmx64m -Xms64m \
-jar KVHOME/lib/kvstore.jar makebootconfig \
    -root /opt/ondb/var/kvroot \
    -port 5000  \
    -host node09
    -harange 5010,5020 \
    -num_cpus 0  \
    -memory_mb 0 \
    -capacity 3  \
    -admindir /disk1/ondb/admin -admindirsize 1_gb \
    -storagedir /disk1/ondb/data \
    -storagedir /disk2/ondb/data \
    -storagedir /disk3/ondb/data \
    -rnlogdir /disk1/ondb/rnlog01    

With a boot configuration such as the previous example, the directory structure created and populated on each machine is as follows:

 - Machine 1 (SN1) -     - Machine 2 (SN2) -    - Machine 3 (SN3) -
/opt/ondb/var/kvroot   /opt/ondb/var/kvroot  /opt/ondb/var/kvroot
  /security            /security              /security
  /store-name           /store-name           /store-name
    /sn1                   /sn2                  /sn3
      config.xml             config.xml            config.xml

/disk1/ondb/admin         /disk1/ondb/admin        /disk1/ondb/admin
  /admin1                   /admin2                 /admin3
    /env                     /env                    /env

/disk1/ondb/data         /disk1/ondb/data        /disk1/ondb/data
  /rg1-rn1                 /rg1-rn2                /rg1-rn3
    /env                     /env                    /env

/disk2/ondb/data         /disk2/ondb/data        /disk2/ondb/data
  /rg2-rn1                 /rg2-rn2                /rg2-rn3
    /env                     /env                    /env 

/disk3/ondb/data         /disk3/ondb/data        /disk3/ondb/data
  /rg3-rn1                 /rg3-rn2                /rg3-rn3
    /env                     /env                    /env

/disk1/ondb/rnlog01      /disk1/ondb/rnlog01        /disk1/ondb/rnlog01
  /log                     /log                       /log

In this case, configuration information and administrative data is stored in a location that is separate from all of the replication data. The replication data itself is stored by each distinct Replication Node service on separate, physical media as well. Storing data in this way provides failure isolation and will typically make disk replacement less complicated and time consuming. For information on how to deploy a store, see Configuring a single region data store.

To replace a failed disk:

  1. Determine which disk has failed. To do this, you can use standard system monitoring and management mechanisms. In the previous example, suppose disk2 on Storage Node 3 fails and needs to be replaced.

  2. Then given a directory structure, determine which Replication Node service to stop. With the structure described above, the store writes replicated data to disk2 on Storage Node 3, so rg2-rn3 must be stopped before replacing the failed disk.

  3. Use the plan stop-service command to stop the affected service (rg2-rn3) so that any attempts by the system to communicate with it are no longer made; resulting in a reduction in the amount of error output related to a failure you are already aware of.

    kv-> plan stop-service -service rg2-rn3
  4. Remove the failed disk (disk2) using whatever procedure is dictated by the operating system, disk manufacturer, and/or hardware platform.

  5. Install a new disk using any appropriate procedures.

  6. Format the disk to have the same storage directory as before; in this case, /disk2/ondb/var/kvroot.

  7. With the new disk in place, use the plan start-service command to start the rg2-rn3 service.

    kv-> plan start-service -service rg2-rn3

    Note:

    Depending on the amount of data stored on the disk before it failed, recovering that data can take a considerable amount of time. Also, the system may encounter unexpected or additional network traffic and load while repopulating the new disk. If so, such events add even more time to completion.