The DP CLI (Command Line Interface) shell utility is used to
launch Data Processing workflows.
The Data
Processing workflow can be run on an individual Hive table, all tables within a
Hive database, or all tables within Hive. The tables must be of the
auto-provisioned type (as explained further in this topic).
The DP CLI starts workflows in Oozie. The results of the DP CLI workflow
are the same as if the tables were processed by a Studio-generated Data
Processing workflow.
Two important use cases for the DP CLI are:
- Ingesting data from your
Hive tables immediately after installing the Big Data Discovery (BDD) product.
When you first install BDD, your existing Hive tables are not processed.
Therefore, you must use the DP CLI to launch a first-time Data Processing
operation on your tables.
- Invoking the BDD Hive Table
Detector, which in turn can start Data Processing workflows for new or deleted
Hive tables.
You can run the DP CLI either manually or from a cron job. By default,
the BDD installer does not create a cron job as part of the installation
procedure.
Skipped and auto-provisioned Hive tables
From the point of view of Data Processing, there are two types of Hive
tables — skipped tables and auto-provisioned tables, depending on the presence
of a special table property,
skipAutoProvisioning. The
skipAutoProvisioning property tells the BDD Hive Table
Detector to skip the table for processing.
Skipped tables are Hive tables that have the
skipAutoProvisioning table property present and set to
true. Thus, a Data Processing workflow will never be
launched for a skipped table. This property is set in two instances:
- The table was created from
Studio, in which case the
skipAutoProvisioning property is always set at table
creation time.
- The table was created by a
Hive administrator and a corresponding BDD data set was provisioned from that
table. Later, that data set was deleted from Studio. When a data set (from an
admin-created table) is deleted, Studio modifies the underlying Hive table by
adding the
skipAutoProvisioning table property.
Auto-provisioned tables are Hive tables that were created by
the Hive administrator and do not have a
skipAutoProvisioning property. These tables can be
provisioned by a Data Processing workflow that is launched by the BDD Hive
Table Detector.
Note: Keep in mind that when a BDD data set is deleted, its source Hive
table is not deleted from the Hive database. This applies to data sets that
were generated from either Studio-created tables or admin-created tables. The
skipAutoProvisioning property ensures that the table
will not be re-provisioned when its corresponding data set is deleted
(otherwise, the deleted data set would re-appear when the table was
re-processed).
BDD Hive Table Detector
The BDD Hive Table Detector is a process that automatically keeps a
Hive database in sync with the BDD data sets. The BDD Hive Table Detector has
two major functions:
- Automatically checks all
Hive tables within a Hive database:
- For each
auto-provisioned table that does not have a corresponding BDD data set, The BDD
Hive Table Detector launches a new data provisioning workflow.
- For all skipped
tables, such as, Studio-created tables, the BDD Hive Table Detector never
provisions them, even if they do not have a corresponding BDD data set.
- Automatically launches the
data set clean-up process if it detects that a BDD data set does not have an
associated Hive table. (That is, an orphaned BDD data set is automatically
deleted if its source Hive table no longer exists.) Typically, this scenario
occurs when a Hive table (either admin-created or Studio-created) has been
deleted by a Hive administrator.
The BDD Hive Table Detector detects empty tables, and does not launch
workflows for those tables.
The BDD Hive Table Detector is invoked with the DP CLI, which has
command flags to control the behavior of the script. For example, you can
select the Hive tables you want to be processed. The
--whitelist flag of the CLI specifies a file
listing the Hive tables that should be processed, while the
--blacklist flag controls a file with Hive
tables that should be filtered out during processing.
Logging
The DP CLI logs detailed information about its workflow into the log
file defined in the
$CLI_HOME/config/logging.properties file. This
file is documented in
Logging configuration.
The implementation of the BDD Hive Table Detector is based on the DP
CLI, so it uses the same logging properties as the DP CLI script. It also
produces verbose outputs (on some classes) to stdout/stderr.