The
Data Processing
component of BDD runs a set of processes and jobs. This set is
called
data processing workflows. Many of these processes run
natively in Hadoop.
The Data Processing component
controls several workflows in BDD. For example, you can create workflows for
data loading, data updates, and others. The Data Processing component discovers
new Hive tables and loads data into BDD. It runs data refresh operations and
incremental updates. It also keeps BDD data sets in synch with Hive tables that
BDD creates.
For example, during a data loading workflow, the Data Processing
component performs these tasks:
- Discovery of data in Hive
tables
- Creation of data sets in BDD
- Running a select set of
enrichments on discovered data sets
- Profiling of the data sets
- Indexing of the data sets,
by streaming data to the Dgraph.
To launch a data processing workflow when Big Data Discovery starts,
you use the
Data Processing Command Line Interface (DP CLI).
Data Processing CLI
The DP CLI is a shell Linux utility that launches data processing
workflows in Hadoop. You can control their steps and behavior. You can run the
DP CLI manually or from a
cron job. You can run the data processing workflows on
an individual Hive table, all tables within a Hive database, or all tables
within Hive. This depends on DP CLI settings, such as a blacklist and a
whitelist.
Here are some of the jobs you can run with DP CLI:
- Load data from Hive tables
after installing Big Data Discovery (BDD). When you first install BDD, your
existing Hive tables are not processed. You must use the DP CLI to launch a
data processing operation on your tables.
- Run data updates. They
include:
- An operation to
refresh data. It reloads an existing data set in a Studio project, replacing
the contents of the data set with the latest data from Hive in its entirety.
- An incremental update.
It adds newer data to existing data sets in a Studio's project.
- Launch the BDD Hive Table
Detector. It detects if a new table is added to Hive. It then checks the
whitelist and blacklist. If the table passes them, it creates a data set in
BDD. It also deletes any BDD data set that does not have a corresponding source
Hive table. This keeps BDD data sets in synch with data sets in Hive.
For information on Data Processing and the DP CLI, see the
Data Processing Guide.