Moving the Dgraph databases to HDFS

If your Dgraph databases are currently stored on NFS, you can move them to HDFS.

Note: This procedure is supported for MapR, which uses MapR-FS instead of HDFS. Although this document only refers to HDFS for simplicity, all information also applies to MapR-FS unless specified otherwise.
Because HDFS is a distributed file system, storing your databases there provides increased high availability for the Dgraph. It also increases the amount of data your databases can contain.

When its databases are stored on HDFS, the Dgraph has to run on HDFS DataNodes. If it isn't currently installed on DataNodes, you must move its binaries over when you move its databases.

Important: The DataNode service should be the only Hadoop service running on the Dgraph nodes. In particular, you shouldn't co-locate the Dgraph with Spark, as both require a lot of resources. If you have to host the Dgraph on nodes running Spark or other Hadoop services, use cgroups to ensure it has access to sufficient resources. For more information, see Setting up cgroups for the Dgraph.

To move your Dgraph databases to HDFS:

  1. On the Admin Server, go to $BDD_HOME/BDD_manager/bin and stop BDD:
    ./bdd-admin.sh stop [-t <minutes>]
  2. Copy your Dgraph databases from their current location to the new one on HDFS.
    The bdd user must have read and write access to the new location.
    If you have MapR, the new location must be mounted with a volume and the bdd user must have permission to create and delete snapshots from it.
    If you have HDFS data at rest encryption enabled, the new location must be an encryption zone.
  3. If the Dgraph isn't currently installed on HDFS DataNodes, select one or more in your Hadoop cluster to move it to.
    If other BDD components are currently installed on the selected nodes, verify that the following directories are present on each, and copy over any that are missing.
    • $BDD_HOME/common/edp
    • $BDD_HOME/dataprocessing
    • $BDD_HOME/dgraph
    • $BDD_HOME/logs/edp
    If no BDD components are installed on the selected nodes:
    1. Create a new $BDD_HOME directory on each node. Its permissions must be 755 an its owner must be the bdd user.
    2. Copy the following directories from an existing Dgraph node to the new ones:
      • $BDD_HOME/BDD_manager
      • $BDD_HOME/common
      • $BDD_HOME/dataprocessing
      • $BDD_HOME/dgraph
      • $BDD_HOME/logs
      • $BDD_HOME/uninstall
      • $BDD_HOME/version.txt
    3. Create a symlink $ORACLE_HOME/BDD pointing to $BDD_HOME.
    4. Optionally, remove the /dgraph directory from the old Dgraph nodes, as it's no longer needed.
      Leave any other BDD directories as they may still be useful.
  4. To enable the Dgraph to access its databases in HDFS, install the HDFS NFS Gateway service (called MapR NFS in MapR) on all Dgraph nodes.
    For instructions, refer to the documentation for your Hadoop distribution.
  5. If you have MapR, mount MapR-FS to the local mount point, $BDD_HOME/dgraph/hdfs_root.
    You can do this by adding an NFS mount point to /etc/fstab on each new Dgraph node. This ensures MapR-FS will be mounted automatically when your system starts. Note that you'll have to remove this manually if you uninstall BDD.
  6. If you have to host the Dgraph on the same node as Spark or any other Hadoop processes (in addition to the HDFS DataNode process), create cgroups to isolate the resources used by Hadoop and the Dgraph.
    For instructions, see Setting up cgroups for the Dgraph.
  7. For best performance, configure short-circuit reads in HDFS.
    This enables the Dgraph to access local files directly, rather than having to use the HDFS DataNode's network sockets to transfer the data. For instructions, refer to the documentation for your Hadoop distribution.
  8. Clean up the ZooKeeper index.
  9. On the Admin Server, copy $BDD_HOME/BDD_manager/conf/bdd.conf to a new location. Open the copy in a text editor and update the following properties:
    Property Description
    DGRAPH_INDEX_DIR The absolute path to the new location of the Dgraph databases directory on HDFS.

    If you have MapR, this location must be mounted as a volume, and the bdd user must have permission to create and delete snapshots from it.

    If you have HDFS data at rest encryption enabled, this location must be an encryption zone.

    DGRAPH_ SERVERS A comma-separated list of the FQDNs of the new Dgraph nodes. All must be HDFS DataNodes.
    DGRAPH_ THREADS The number of threads the Dgraph starts with. This should be the number of CPU cores on the Dgraph nodes minus the number required to run HDFS and any other Hadoop services running on the new Dgraph nodes.
    DGRAPH_CACHE The size of the Dgraph cache. This should be either 50% of the machine's RAM or the total amount of free memory, whichever is larger.
    DGRAPH_USE_ MOUNT_HDFS Determines whether the Dgraph mounts HDFS when it starts. Set this to TRUE.
    DGRAPH_HDFS_ MOUNT_DIR The absolute path to the local directory where the Dgraph mounts the HDFS root directory. This location must exist and be empty, and must have read, write, and execute permissions for the bdd user.

    It's recommended that you use the default location, $BDD_HOME/dgraph/hdfs_root, which was created by the installer and should meet these requirements.

    KERBEROS_ TICKET_ REFRESH_ INTERVAL Only required if you have Kerberos enabled. The interval (in minutes) at which the Dgraph's Kerberos ticket is refreshed. For example, if set to 60, the Dgraph's ticket would be refreshed every 60 minutes, or every hour.
    KERBEROS_ TICKET_ LIFETIME Only required if you have Kerberos enabled. The amount of time that the Dgraph's Kerberos ticket is valid. This should be given as a number followed by a supported unit of time: s, m, h, or d. For example, 10h (10 hours), or 10m (10 minutes).
    DGRAPH_ENABLE_CGROUP Only required if you set up cgroups for the Dgraph. This must be set to TRUE if you created a Dgraph cgroup.
    DGRAPH_CGROUP_NAME Only required if you set up cgroups for the Dgraph. The name of the cgroup that controls the Dgraph.
    NFS_GATEWAY_ SERVERS Only required if you're using the NFS Gateway. A comma-separated list of the FQDNs of the nodes running the NFS Gateway service. This should include all Dgraph nodes.
    DGRAPH_USE_ NFS_MOUNT If you're using the NFS Gateway, set this property to TRUE.
  10. To populate your configuration changes to the rest of the cluster, go to $BDD_HOME/BDD_manager/bin and run:
    ./bdd-admin.sh publish-config <path>
    Where <path> is the absolute path to the updated copy of bdd.conf.
  11. Start your cluster:
    ./bdd-admin.sh start