You must install one of the following Hadoop distributions on your cluster before you install BDD:
| Component | Description |
|---|---|
| Cloudera Manager (CDH)/Ambari (HDP) | The BDD installer uses a RESTful API to
query Cloudera Manager (if you're using CDH) or Ambari (If you're using HDP)
for information about specific Hadoop nodes, such as their hostnames and port
numbers.
Cloudera Manager/Ambari must be installed on at least one node in your cluster, although it doesn't have to be on any that will host BDD. |
| ZooKeeper | BDD uses ZooKeeper services to manage the
Dgraph instances and ensure high availability of Dgraph query processing.
ZooKeeper must be installed on at least one node in your cluster, although it
doesn't have to be on any that will host BDD. For more information on ZooKeeper
and how it affects the cluster deployment's high availability, see the
Administrator's Guide.
All Managed Servers must be able to connect to a node running ZooKeeper. |
| HDFS | BDD stores the Hive tables that contain your source data in HDFS. HDFS must be installed on at least one node in your cluster, although it doesn't need to be on any that will host BDD. HDFS must be installed on all nodes that will run Data Processing. |
| HCatalog | The Data Processing Hive Table Detector monitors HCatalog for new and deleted tables that require processing. HCatalog must be installed on at least one node in your cluster, although it doesn't have to be one that will host BDD. |
| Hive | All of your data is stored as Hive tables on HDFS. When BDD discovers a new or modified Hive table, it launches a Data Processing workflow for that table. |
| Spark on YARN | BDD uses Spark on YARN to run all Data Processing jobs. Spark on YARN must be installed on all nodes that will run Data Processing. |
| Hue | You can use Hue to load your source data
into Hive and to view data exported from Studio.
Note: HDP doesn't include Hue. If you have an HDP cluster, you
must install it separately and set the
HUE_URI property in BDD's configuration file.
You can also use the
bdd-admin script to update this property after
installation, if necessary. For more information, see the
Administrator's Guide.
|
| YARN | YARN worker nodes run all Data Processing jobs. YARN must be installed on all nodes that will run Data Processing. |
Additionally, if you will be co-locating the Dgraph and Hadoop, you must enable cgroups on that node and limit the Dgraph's memory consumption.
You must also make a few changes within your Hadoop cluster to ensure that BDD can communicate with your Hadoop nodes. These changes are described below.