CDH requirements

CDH 5.3.0 must be installed on your system before you install BDD. The table below describes the specific CDH components BDD requires.

If you are installing on a cluster, the required CDH components do not need to be installed on all nodes that will host BDD components, since some BDD components only require specific CDH components and others don't require CDH at all. The CDH components required on each type of BDD node are specified in CDH requirements.

If you are installing on a single machine, that machine must have all required CDH components installed.

Component Description
Cloudera Manager A web-based user interface that provides administrative capabilities for the CDH cluster. You can use this to perform operations like monitoring the health of the entire cluster, and starting and stopping individual components.

When the BDD installer runs, it uses a RESTful API to query Cloudera Manager for information about specific CDH nodes, such as their hostnames and port numbers.

Note: Cloudera Manager must be running during installation but is not required once the installation is finished. You can continue using it afterwards, as it provides a number of useful administrative features; however, if you are working in a resource-constrained environment, you may shut it down without affecting BDD's performance.
ZooKeeper An open source distributed resource coordination package. BDD uses ZooKeeper services to manage the Dgraph instances and ensure high availability of Dgraph query processing.
HDFS Hadoop's highly fault-tolerant distributed file system. The Hive tables that contain your source data are stored in HDFS.
HCatalog A metadata abstraction layer that allows you to reference data without using filenames or formats. It insulates users and client programs that need to query data from the data storage. When you create a table in Hive, a table is automatically created in HCatalog.

Data Processing's Hive Table Detector monitors HCatalog for new and deleted tables that require processing.

Hive An open source data warehouse that allows you to query and analyze large amounts of data stored in HDFS. It obtains metadata from HCatalog, enabling you to query your data without knowing its schema or location.

All of your source data is stored as Hive tables within HDFS. When BDD discovers a new or modified Hive table, it launches a Data Processing workflow for that table.

Oozie An open source system for scheduling and managing jobs in Hadoop. BDD relies on Oozie to manage Data Processing workflows.
Spark (Standalone) An open source parallel data processing framework that compliments Hadoop to make it easy to develop fast, unified big data applications combining batch, streaming, and interactive analytics on all of your data. Spark workers run all Data Processing jobs.
Note: Big Data Discovery requires the Spark (Standalone) service. It does not support Spark on YARN.
Hue Hadoop User Experience. An open source user interface for a number of Hadoop components.
YARN An open source data processing framework that provides resource management for distributed applications.