Hadoop requirements

You must install one of the following Hadoop distributions on your cluster before you install BDD:

Cloudera Distribution for Hadoop (CDH) 5.5.x (min. 5.5.2), 5.6, 5.7.1. Enterprise edition is recommended.
Hortonworks Data Platform (HDP) 2.3.4.17-5, 2.4.x (min. 2.4.2)

Note: You can switch to a different version of your Hadoop distribution after you install, if necessary. See the Administrator's Guide for more information.

BDD doesn't require all of the components each distribution provides, and the components it does require don't need to be installed on all nodes. The following table lists the required Hadoop components and the node(s) they must be installed on.

Note: If you are installing on a single machine, that machine must have all required Hadoop components installed.

Component	Description
Cluster manager	Your cluster manager depends on your Hadoop distribution: CDH: Cloudera Manager HDP: Ambari The installer uses a RESTful API to query your cluster manager for information about your Hadoop nodes, such as their hostnames and port numbers. Your cluster manager must be installed on at least one node in your cluster, although it doesn't have to be on any that will host BDD.
ZooKeeper	BDD uses ZooKeeper services to manage the Dgraph instances and ensure high availability of Dgraph query processing. ZooKeeper must be installed on at least one node in your cluster, although it doesn't have to be on any that will host BDD. For more information on ZooKeeper and how it affects the cluster deployment's high availability, see the Administrator's Guide. All Managed Servers must be able to connect to a node running ZooKeeper.
HDFS	The Hive tables that contain your source data are stored in HDFS. HDFS must be installed on at least one node in your cluster. You can also store your Dgraph databases on HDFS. If you choose to do this, the DataNode service must be installed on all nodes that will run the Dgraph.
HCatalog	The Data Processing Hive Table Detector monitors HCatalog for new and deleted tables that require processing. HCatalog must be installed on at least one node in your cluster, although it doesn't have to be one that will host BDD.
Hive	All of your data is stored as Hive tables on HDFS. When BDD discovers a new or modified Hive table, it launches a Data Processing workflow for that table.
Spark on YARN	BDD uses Spark on YARN to run all Data Processing jobs. Spark on YARN must be installed on all nodes that will run Data Processing.
Hue	You can use Hue to load your source data into Hive and to view data exported from Studio. Note: HDP doesn't include Hue. If you have an HDP cluster, you must install it separately and set the `HUE_URI` property in BDD's configuration file. You can also use the `bdd-admin` script to update this property after installation, if necessary. For more information, see the Administrator's Guide.
YARN	YARN worker nodes run all Data Processing jobs. YARN must be installed on all nodes that will run Data Processing.

Note: Data Processing will automatically be installed on nodes running the following Hadoop components:

Spark on YARN
YARN
HDFS

If you want to store your Dgraph databases on HDFS, the Dgraph and Dgraph HDFS Agent must be installed on Hadoop DataNodes. For more information, see Dgraph database requirements.

You must also make a few changes within your Hadoop cluster to ensure that BDD can communicate with your Hadoop nodes. These changes are described below.