CDH 5.3.0 must be installed on your system before you install BDD. The table below describes the specific CDH components BDD requires.
If you are installing on a cluster, the required CDH components do not need to be installed on all nodes that will host BDD components, since some BDD components only require specific CDH components and others don't require CDH at all. The CDH components required on each type of BDD node are specified in CDH requirements.
If you are installing on a single machine, that machine must have all required CDH components installed.
Component | Description |
---|---|
Cloudera Manager | A web-based user interface that
provides administrative capabilities for the CDH cluster. You can use this to
perform operations like monitoring the health of the entire cluster, and
starting and stopping individual components.
When the BDD installer runs, it uses a RESTful API to query Cloudera Manager for information about specific CDH nodes, such as their hostnames and port numbers. Note: Cloudera Manager must be running during installation but is
not required once the installation is finished. You can continue using it
afterwards, as it provides a number of useful administrative features; however,
if you are working in a resource-constrained environment, you may shut it down
without affecting BDD's performance.
|
ZooKeeper | An open source distributed resource coordination package. BDD uses ZooKeeper services to manage the Dgraph instances and ensure high availability of Dgraph query processing. |
HDFS | Hadoop's highly fault-tolerant distributed file system. The Hive tables that contain your source data are stored in HDFS. |
HCatalog | A metadata abstraction layer that allows you
to reference data without using filenames or formats. It insulates users and
client programs that need to query data from the data storage. When you create
a table in Hive, a table is automatically created in HCatalog.
Data Processing's Hive Table Detector monitors HCatalog for new and deleted tables that require processing. |
Hive | An open source data warehouse that
allows you to query and analyze large amounts of data stored in HDFS. It
obtains metadata from HCatalog, enabling you to query your data without knowing
its schema or location.
All of your source data is stored as Hive tables within HDFS. When BDD discovers a new or modified Hive table, it launches a Data Processing workflow for that table. |
Oozie | An open source system for scheduling and managing jobs in Hadoop. BDD relies on Oozie to manage Data Processing workflows. |
Spark (Standalone) | An open source parallel data
processing framework that compliments Hadoop to make it easy to develop fast,
unified big data applications combining batch, streaming, and interactive
analytics on all of your data. Spark workers run all Data Processing jobs.
Note: Big Data Discovery requires the Spark (Standalone) service.
It does
not support Spark on YARN.
|
Hue | Hadoop User Experience. An open source user interface for a number of Hadoop components. |
YARN | An open source data processing framework that provides resource management for distributed applications. |