Hadoop requirements

You must install one of the following Hadoop distributions on your cluster before you install BDD:

Cloudera Distribution for Hadoop (CDH) 5.3.x, 5.4.x, 5.5.2+
Hortonworks Data Platform (HDP) 2.2.4 - 2.3.x

Note: You can switch to a different version of your Hadoop distribution after you install, if necessary. See the Administrator's Guide for more information.

BDD doesn't require all of the components each distribution provides, and the components it does require don't need to be installed on all nodes. The following table lists the required Hadoop components and the node(s) they must be installed on.

Note: If you are installing on a single machine, that machine must have all required Hadoop components installed.

Component	Description
Cloudera Manager (CDH)/Ambari (HDP)	The BDD installer uses a RESTful API to query Cloudera Manager (if you're using CDH) or Ambari (If you're using HDP) for information about specific Hadoop nodes, such as their hostnames and port numbers. Cloudera Manager/Ambari must be installed on at least one node in your cluster, although it doesn't have to be on any that will host BDD.
ZooKeeper	BDD uses ZooKeeper services to manage the Dgraph instances and ensure high availability of Dgraph query processing. ZooKeeper must be installed on at least one node in your cluster, although it doesn't have to be on any that will host BDD. For more information on ZooKeeper and how it affects the cluster deployment's high availability, see the Administrator's Guide. All Managed Servers must be able to connect to a node running ZooKeeper.
HDFS	BDD stores the Hive tables that contain your source data in HDFS. HDFS must be installed on at least one node in your cluster, although it doesn't need to be on any that will host BDD. HDFS must be installed on all nodes that will run Data Processing.
HCatalog	The Data Processing Hive Table Detector monitors HCatalog for new and deleted tables that require processing. HCatalog must be installed on at least one node in your cluster, although it doesn't have to be one that will host BDD.
Hive	All of your data is stored as Hive tables on HDFS. When BDD discovers a new or modified Hive table, it launches a Data Processing workflow for that table.
Spark on YARN	BDD uses Spark on YARN to run all Data Processing jobs. Spark on YARN must be installed on all nodes that will run Data Processing.
Hue	You can use Hue to load your source data into Hive and to view data exported from Studio. Note: HDP doesn't include Hue. If you have an HDP cluster, you must install it separately and set the `HUE_URI` property in BDD's configuration file. You can also use the `bdd-admin` script to update this property after installation, if necessary. For more information, see the Administrator's Guide.
YARN	YARN worker nodes run all Data Processing jobs. YARN must be installed on all nodes that will run Data Processing.

Note: Data Processing will automatically be installed on nodes running the following Hadoop components:

Spark on YARN
YARN
HDFS

Additionally, if you will be co-locating the Dgraph and Hadoop, you must enable cgroups on that node and limit the Dgraph's memory consumption.

You must also make a few changes within your Hadoop cluster to ensure that BDD can communicate with your Hadoop nodes. These changes are described below.