The
Data Processing
component of BDD runs a set of processes and jobs. This set is
called
data processing workflows. Many of these processes run
natively in Hadoop.
The Data Processing component
controls several workflows in BDD. For example, you can create workflows for
data loading, data updates, and others. The Data Processing component discovers
new Hive tables and loads data into BDD. It runs data refresh operations and
incremental updates. It also keeps BDD data sets in synch with Hive tables that
BDD creates.
For example, during a data loading workflow, the Data Processing
component performs these tasks:
- Discovery of data in Hive
tables
- Creation of data sets in BDD
- Running a select set of
enrichments on discovered data sets
- Profiling of the data sets
- Indexing of the data sets,
by streaming data to the Dgraph.
The Workflow Manager Service
The
Workflow Manager is a service in BDD that acts as an
intermediary between Spark and the BDD clients: Studio and Data Processing CLI.
The service receives data set workflow requests from Studio or Data Processing
CLI, and delegates the sequence of Spark jobs needed for each workflow, such as
sampling, discovery or transformations, to run in YARN. The Spark jobs run
asynchronously of each other and the service notifies Studio of the job status.
The Workflow Manager also delegates jobs to other components in BDD, such as
the Dgraph and the Dgraph HDFS Agent. Here are two examples of these
interactions:
- Studio submits its
workflow request to the Workflow Manager Service. The service submits sampling
or transformation Spark jobs to YARN. Depending on the workflow, it also
indicates to the Dgraph HDFS Agent and the Dgraph to start data loading and
indexing. Finally, the Workflow Manager notifies Studio when the workflow
finishes.
- The Data Processing CLI
submits its workflow request to the Workflow Manager Service. This could be a
request to load data from Hive, to refresh data or to run incremental data
updates (add more data to existing data). The Workflow Manager delegates a
sequence of Spark jobs to YARN, and also notifies Studio of the status when the
workflow finishes.
Data Processing CLI
The
DP CLI is a shell Linux utility that submits data
processing workflows to the Workflow Manager, which starts them in YARN. You
can run DP CLI manually or from a
cron job. The DP CLI submits the workflow requests and
uses the Workflow Manager for the delegation of workflows to YARN. The
workflows can run on an individual Hive table, all tables within a Hive
database, or all tables within Hive. This depends on DP CLI settings, such as a
blacklist and a whitelist.
Here are some of the workflow requests you can submit with DP CLI:
- Load data from Hive tables
after installing Big Data Discovery (BDD). When you first install BDD, your
existing Hive tables are not processed. You must use the DP CLI to launch a
data processing operation on your tables.
- Run data updates. They
include:
- An operation to
refresh data. It reloads an existing data set in a Studio project, replacing
the contents of the data set with the latest data from Hive in its entirety.
- An incremental update.
It adds newer data to existing data sets in a Studio's project.
- Launch the BDD
Hive Table Detector, which is another sub-component
related to data processing in BDD. The Hive Table Detector (HDT) detects if a
new table is added to Hive. It then checks the whitelist and blacklist. If the
table passes them, it creates a data set in BDD. It also deletes any BDD data
set that does not have a corresponding source Hive table. This keeps BDD data
sets in synch with data sets in Hive. For detailed information on how data sets
are managed in BDD, see
Data set lifecycle in Studio.
For information on Data Processing workflows, the Workflow Manager,
the DP CLI, and the Hive Table Detector utility, see the
Data Processing Guide.