Hadoop requirements

BDD supports the following Hadoop distributions:

You must have one of these installed before installing BDD. Note that you can't connect BDD to more than one Hadoop cluster.

Note: You can switch to a different version of your Hadoop distribution after installing BDD, if necessary. See the Administrator's Guide for more information.

BDD doesn't require all of the components each distribution provides, and the components it does require don't need to be installed on all BDD nodes. The following table lists the required Hadoop components and the node(s) they must be installed on. If you're installing on a single machine, it must be running all required components.

Component Description
Cluster manager Your cluster manager depends on your Hadoop distribution:
  • CDH: Cloudera Manager
  • HDP: Ambari
  • MapR: MapR Control System (MCS)

The installer uses a RESTful API to query your Hadoop cluster manager for information about your Hadoop nodes, such as their hostnames and port numbers. Post-install, the bdd-admin script will query it for similar information when performing administrative tasks.

Your cluster manager must be installed on at least one node in your Hadoop cluster, although it doesn't have to be on any that will host BDD.

ZooKeeper BDD uses ZooKeeper to manage the Dgraph instances and ensure high availability of Dgraph query processing. ZooKeeper must be installed on at least one node in your Hadoop cluster, although to ensure high availability, it should be on three or more. These don't have to be BDD nodes, although each Managed Server must be able to connect to at least one of them.
HDFS/MapR-FS The tables that contain your source data are stored in HDFS. It must be installed on all nodes that will run Data Processing. Additionally, if you choose to store your Dgraph databases on HDFS, the HDFS DataNode service must be installed on all Dgraph nodes.
Note: MapR uses the MapR File System (MapR-FS) instead of standard HDFS. For simplicity, this document typically refers only to HDFS. Any requirements specific to MapR-FS will be called out explicitly.
YARN The YARN NodeManager service run all Data Processing jobs. YARN must be installed on all nodes that will run Data Processing.
Spark on YARN BDD uses Spark on YARN to run all Data Processing jobs. Spark on YARN must be installed on all nodes that will run Data Processing.

Note that BDD requires Spark 1.6+. Verify the version you have and upgrade it, if necessary.

Hive All of your data is stored in Hive tables within HDFS. When BDD discovers a new or modified Hive table, it launches a Data Processing workflow for that table.
HCatalog The Hive Table Detector monitors HCatalog for new and deleted tables that require processing. HCatalog must be installed on at least one node in your Hadoop cluster, although it doesn't have to be one that will host BDD.
Hue You can use Hue to load your source data into Hive and to view data exported from Studio. Hue must be installed on at least one node in your Hadoop cluster, although it doesn't have to be one that will host BDD.
Note: HDP doesn't include Hue. If you have HDP, you must install Hue separately and set the HUE_URI property in BDD's configuration file. You can also use the bdd-admin script to update this property after installation, if necessary. For more information, see the Administrator's Guide.
To reiterate, Data Processing will automatically be installed on nodes running the following:

You must also make a few changes within your Hadoop cluster to ensure that BDD can communicate with your Hadoop nodes. These changes are described below.