DP CLI overview

The DP CLI (Command Line Interface) shell utility is used to launch Data Processing workflows, either manually or via a cron job.

The Data Processing workflow can be run on an individual Hive table, all tables within a Hive database, or all tables within Hive. The tables must be of the auto-provisioned type (as explained further in this topic).

The DP CLI starts workflows that are run by Spark workers. The results of the DP CLI workflow are the same as if the tables were processed by a Studio-generated Data Processing workflow.

Two important use cases for the DP CLI are:

Ingesting data from your Hive tables immediately after installing the Big Data Discovery (BDD) product. When you first install BDD, your existing Hive tables are not processed. Therefore, you must use the DP CLI to launch a first-time Data Processing operation on your tables.
Invoking the BDD Hive Table Detector, which in turn can start Data Processing workflows for new or deleted Hive tables.

The DP CLI can be run either manually or from a cron job. The BDD installer creates a cron job as part of the installation procedure if the ENABLE_HIVE_TABLE_DETECTOR property is set to TRUE in the bdd.conf file.

DP CLI script

The DP CLI is run via the data_processing_CLI script. The script is located in the $BDD_HOME/dataprocessing/edp_cli directory. The directory also contains the bdd-sp.sh script, which is a symbolic link to the data_processing_CLI script. You can therefore run the DP CLI with either script.

The following abbreviated example shows an invocation of the DP CLI, which is being run against a Hive table name "warrantyclaims" in the "default" database:

./data_processing_CLI --database default --table warrantyclaims
...
[2016-09-29T13:10:17.702-04:00] [DataProcessing] [INFO] [] 
  [org.apache.hadoop.hive.metastore.HiveMetaStoreClient] [tid:main] 
  [userID:fcalvin Connected to metastore.
New collection name = MdexCollectionIdentifier{
   databaseName=edp_cli_edp_37396cf4-fab1-40c8-bb08-d5bb09478d58, 
   collectionName=edp_cli_edp_37396cf4-fab1-40c8-bb08-d5bb09478d58}
   jobId: 1fb5df0e-66cd-4ce5-8ff8-5b93e2ee2df0
data_processing_CLI finished with state SUCCESS

When creating a new data set (as in this example), the command output will list the names of the data set (in the collectionName field) and the Dgraph database (in the databaseName field), as well as the ID of the workflow job.

Note that the "finished with state SUCCESS" message means that the workflow was successfully launched; it does not mean that the workflow was successful. This means that is possible for a workflow to be launched successfully, but fail to finish. A notification of the workflow's final status is sent to Studio, as well as written to the logs.

Skipped and auto-provisioned Hive tables

From the point of view of Data Processing, there are two types of Hive tables: skipped tables and auto-provisioned tables. The table type depends on the presence of a special table property, skipAutoProvisioning. The skipAutoProvisioning property (when set to true) tells the BDD Hive Table Detector to skip the table for processing.

Skipped tables are Hive tables that have the skipAutoProvisioning table property present and set to true. Thus, a Data Processing workflow will never be launched for a skipped table (unless the DP CLI is run manually with the --table flag set to the table). This property is set in two instances:

The table was created from Studio, in which case the skipAutoProvisioning property is always set at table creation time.
The table was created by a Hive administrator and a corresponding BDD data set was provisioned from that table. Later, that data set was deleted from Studio. When a data set (from an admin-created table) is deleted, Studio modifies the underlying Hive table by adding the skipAutoProvisioning table property.

For information on changing the value of the skipAutoProvisioning property, see Changing Hive table properties.

Auto-provisioned tables are Hive tables that were created by the Hive administrator and do not have a skipAutoProvisioning property. These tables can be provisioned by a Data Processing workflow that is launched by the BDD Hive Table Detector.

Note: Keep in mind that when a BDD data set is deleted, its source Hive table is not deleted from the Hive database. This applies to data sets that were generated from either Studio-created tables or admin-created tables. The skipAutoProvisioning property ensures that the table will not be re-provisioned when its corresponding data set is deleted (otherwise, the deleted data set would re-appear when the table was re-processed).

BDD Hive Table Detector

The BDD Hive Table Detector is a process that automatically keeps a Hive database in sync with the BDD data sets. The BDD Hive Table Detector has two major functions:

Automatically checks all Hive tables within a Hive database:
- For each auto-provisioned table that does not have a corresponding BDD data set, the BDD Hive Table Detector launches a new data provisioning workflow (unless the table is skipped via the blacklist).
- For all skipped tables, such as, Studio-created tables, the BDD Hive Table Detector never provisions them, even if they do not have a corresponding BDD data set.
Automatically launches the data set clean-up process if it detects that a BDD data set does not have an associated Hive table. (That is, an orphaned BDD data set is automatically deleted if its source Hive table no longer exists.) Typically, this scenario occurs when a Hive table (either admin-created or Studio-created) has been deleted by a Hive administrator.

The BDD Hive Table Detector detects empty tables, and does not launch workflows for those tables.

The BDD Hive Table Detector is invoked with the DP CLI, which has command flags to control the behavior of the script. For example, you can select the Hive tables you want to be processed. The --whitelist flag of the CLI specifies a file listing the Hive tables that should be processed, while the --blacklist flag controls a file with Hive tables that should be filtered out during processing.