The DP CLI (Command Line Interface) shell utility is used to launch Data Processing workflows, either manually or via a cron job.
The Data Processing workflow can be run on an individual Hive table, all tables within a Hive database, or all tables within Hive. The tables must be of the auto-provisioned type (as explained further in this topic).
The DP CLI starts workflows that are run by Spark workers. The results of the DP CLI workflow are the same as if the tables were processed by a Studio-generated Data Processing workflow.
The DP CLI can be run either manually or from a cron job. The BDD installer creates a cron job as part of the installation procedure if the ENABLE_HIVE_TABLE_DETECTOR
property is set to TRUE in the bdd.conf
file.
Skipped and auto-provisioned Hive tables
From the point of view of Data Processing, there are two types of Hive tables: skipped tables and auto-provisioned tables. The table type depends on the presence of a special table property, skipAutoProvisioning
. The skipAutoProvisioning
property (when set to true
) tells the BDD Hive Table Detector to skip the table for processing.
skipAutoProvisioning
table property present and set to true
. Thus, a Data Processing workflow will never be launched for a skipped table (unless the DP CLI is run manually with the --table
flag set to the table). This property is set in two instances:
skipAutoProvisioning
property is always set at table creation time.skipAutoProvisioning
table property.For information on changing the value of the skipAutoProvisioning
property, see Changing Hive table properties.
Auto-provisioned tables are Hive tables that were created by the Hive administrator and do not have a skipAutoProvisioning
property. These tables can be provisioned by a Data Processing workflow that is launched by the BDD Hive Table Detector.
Note:
Keep in mind that when a BDD data set is deleted, its source Hive table is not deleted from the Hive database. This applies to data sets that were generated from either Studio-created tables or admin-created tables. TheskipAutoProvisioning
property ensures that the table will not be re-provisioned when its corresponding data set is deleted (otherwise, the deleted data set would re-appear when the table was re-processed).BDD Hive Table Detector
The BDD Hive Table Detector detects empty tables, and does not launch workflows for those tables.
The BDD Hive Table Detector is invoked with the DP CLI, which has command flags to control the behavior of the script. For example, you can select the Hive tables you want to be processed. The --whitelist flag of the CLI specifies a file listing the Hive tables that should be processed, while the --blacklist flag controls a file with Hive tables that should be filtered out during processing.