BDD supports the following Hadoop distributions:
You must have one of these installed before installing BDD. Note that you can't connect BDD to more than one Hadoop cluster.
BDD doesn't require all of the components each distribution provides, and the components it does require don't need to be installed on all BDD nodes. The following table lists the required Hadoop components and the node(s) they must be installed on. If you're installing on a single machine, it must be running all required components.
Component | Description |
---|---|
Cluster manager | Your cluster manager depends on your
Hadoop distribution:
The installer uses a RESTful API to query your Hadoop cluster manager for information about your Hadoop nodes, such as their hostnames and port numbers. Post-install, the bdd-admin script will query it for similar information when performing administrative tasks. Your cluster manager must be installed on at least one node in your Hadoop cluster, although it doesn't have to be on any that will host BDD. |
ZooKeeper | BDD uses ZooKeeper to manage the Dgraph instances and ensure high availability of Dgraph query processing. ZooKeeper must be installed on at least one node in your Hadoop cluster, although to ensure high availability, it should be on three or more. These don't have to be BDD nodes, although each Managed Server must be able to connect to at least one of them. |
HDFS/MapR-FS | The tables that contain your source data
are stored in HDFS. It must be installed on all nodes that will run Data
Processing. Additionally, if you choose to store your Dgraph databases on HDFS,
the HDFS DataNode service must be installed on all Dgraph nodes.
Note: MapR uses the MapR File System (MapR-FS) instead of
standard HDFS. For simplicity, this document typically refers only to HDFS. Any
requirements specific to MapR-FS will be called out explicitly.
|
YARN | The YARN NodeManager service run all Data Processing jobs. YARN must be installed on all nodes that will run Data Processing. |
Spark on YARN | BDD uses Spark on YARN to run all Data
Processing jobs. Spark on YARN must be installed on all nodes that will run
Data Processing.
Note that BDD requires Spark 1.6+. Verify the version you have and upgrade it, if necessary. |
Hive | All of your data is stored in Hive tables within HDFS. When BDD discovers a new or modified Hive table, it launches a Data Processing workflow for that table. |
HCatalog | The Hive Table Detector monitors HCatalog for new and deleted tables that require processing. HCatalog must be installed on at least one node in your Hadoop cluster, although it doesn't have to be one that will host BDD. |
Hue | You can use Hue to load your source data
into Hive and to view data exported from Studio. Hue must be installed on at
least one node in your Hadoop cluster, although it doesn't have to be one that
will host BDD.
Note: HDP doesn't include Hue. If you have HDP, you must install
Hue separately and set the
HUE_URI property in BDD's configuration file.
You can also use the
bdd-admin script to update this property after
installation, if necessary. For more information, see the
Administrator's Guide.
|
You must also make a few changes within your Hadoop cluster to ensure that BDD can communicate with your Hadoop nodes. These changes are described below.