About Data Processing

The Data Processing component of BDD runs a set of processes and jobs. This set is called data processing workflows. Many of these processes run natively in Hadoop.

The Data Processing component controls several workflows in BDD. For example, you can create workflows for data loading, data updates, and others. The Data Processing component discovers new Hive tables and loads data into BDD. It runs data refresh operations and incremental updates. It also keeps BDD data sets in synch with Hive tables that BDD creates.

For example, during a data loading workflow, the Data Processing component performs these tasks:

Discovery of data in Hive tables
Creation of data sets in BDD
Running a select set of enrichments on discovered data sets
Profiling of the data sets
Indexing of the data sets, by streaming data to the Dgraph.

The Workflow Manager Service

The Workflow Manager is a service in BDD that acts as an intermediary between Spark and the BDD clients: Studio and Data Processing CLI. The service receives data set workflow requests from Studio or Data Processing CLI, and delegates the sequence of Spark jobs needed for each workflow, such as sampling, discovery or transformations, to run in YARN. The Spark jobs run asynchronously of each other and the service notifies Studio of the job status. The Workflow Manager also delegates jobs to other components in BDD, such as the Dgraph and the Dgraph HDFS Agent. Here are two examples of these interactions:

Studio submits its workflow request to the Workflow Manager Service. The service submits sampling or transformation Spark jobs to YARN. Depending on the workflow, it also indicates to the Dgraph HDFS Agent and the Dgraph to start data loading and indexing. Finally, the Workflow Manager notifies Studio when the workflow finishes.
The Data Processing CLI submits its workflow request to the Workflow Manager Service. This could be a request to load data from Hive, to refresh data or to run incremental data updates (add more data to existing data). The Workflow Manager delegates a sequence of Spark jobs to YARN, and also notifies Studio of the status when the workflow finishes.

Data Processing CLI

The DP CLI is a shell Linux utility that submits data processing workflows to the Workflow Manager, which starts them in YARN. You can run DP CLI manually or from a cron job. The DP CLI submits the workflow requests and uses the Workflow Manager for the delegation of workflows to YARN. The workflows can run on an individual Hive table, all tables within a Hive database, or all tables within Hive. This depends on DP CLI settings, such as a blacklist and a whitelist.

Here are some of the workflow requests you can submit with DP CLI:

Load data from Hive tables after installing Big Data Discovery (BDD). When you first install BDD, your existing Hive tables are not processed. You must use the DP CLI to launch a data processing operation on your tables.
Run data updates. They include:
- An operation to refresh data. It reloads an existing data set in a Studio project, replacing the contents of the data set with the latest data from Hive in its entirety.
- An incremental update. It adds newer data to existing data sets in a Studio's project.
Launch the BDD Hive Table Detector, which is another sub-component related to data processing in BDD. The Hive Table Detector (HDT) detects if a new table is added to Hive. It then checks the whitelist and blacklist. If the table passes them, it creates a data set in BDD. It also deletes any BDD data set that does not have a corresponding source Hive table. This keeps BDD data sets in synch with data sets in Hive. For detailed information on how data sets are managed in BDD, see Data set lifecycle in Studio.

For information on Data Processing workflows, the Workflow Manager, the DP CLI, and the Hive Table Detector utility, see the Data Processing Guide.