Hive tables contain the data for the Data Processing workflows.
When processed, each Hive table results in the creation of a BDD data set, and that data set contains records from the Hive table. Note that a Hive table must contain at least one record in order for it to be processed. That is, Data Processing does not create a data set for an empty table.
Starting workflows
The DP CLI, when run either manually or from a cron job, invokes the BDD Hive Table Detector, which can find a Hive table that does not already exist in the DataSet Inventory. A Data Processing workflow is then run on the table. For details on running the DP CLI, see DP Command Line Interface Utility.
New Hive table workflow and diagram
Both Studio and the DP CLI can be configured to launch a Data Processing workflow that does not use the Data Enrichment modules. The following high-level diagram shows a workflow in which the Data Enrichment modules are run:
ingestStatus
attribute of the DataSet Inventory with the final status of the provisioning (ingest) operation.Handling of updated Hive tables
Existing BDD data sets are not automatically updated if their Hive source tables are updated. For example, assume that a data set has been created from a specific Hive table. If that Hive table is updated with new data, the associated BDD data set is not automatically changed. This means that now the BDD data set is not in synch with its Hive source table.
To update the data set from the updated Hive table, you must run the DP CLI with either the --refreshData flag or the --incrementalUpdate flag. For details, see Updating Data Sets.
Handling of deleted Hive tables
BDD will never delete a Hive table, even if the associated BDD data set has been deleted from Studio. However, it is possible for a Hive administrator to delete a Hive table, even if a BDD data set has been created from that table. In this case, the BDD data set is not automatically deleted and will still be viewable in Studio. (A data set whose Hive source table was deleted is called an orphaned data set.)
The next time that the DP CLI runs, it detects the orphaned data set and runs a Data Processing job that deletes the data set.
Handling of empty Hive tables
Data Processing does not process empty Hive tables. Instead, the Spark driver throws an EmptyHiveTableException
when running against an empty Hive table. This causes the Data Processing job to not create a data set for the table. Note that the command may appear to have successfully finished, but the absence of the data set means the job ultimately failed.
Handling of Hive tables created with header/footer information
Data Processing does not support processing Hive tables that are based on files (such as CSV files) containing header/footer rows. In this case, the DP workflow will ignore the header and footer set on the Hive table using the skip.header.line.count
and skip.footer.line.count
properties. If a workflow on such a table does happen to succeed, the header/footer rows will get added to the resulting BDD data set as records, instead of being omitted.
Deletion of Studio projects
When a Studio user deletes a project, Data Processing is called and it will delete the transformed data sets in the project. However, it will not delete the data sets which have not been transformed.