Working with Hive tables

Hive tables contain the data for the Data Processing workflows.

When processed, each Hive table results in the creation of a BDD data set, and that data set contains records from the Hive table. Note that a Hive table must contain at least one record in order for it to be processed. That is, Data Processing does not create a data set for an empty table.

Starting workflows

A Data Processing workflow can be started in one of two ways:
  • A user in Studio invokes an operation that creates a new Hive table. After the Hive table is created, Studio starts the Data Processing process on that table.
  • The DP CLI (Command Line Interface) utility is run.

The DP CLI, when run either manually or from a cron job, invokes the BDD Hive Table Detector, which can find a Hive table that does not already exist in the DataSet Inventory. A Data Processing workflow is then run on the table. For details on running the DP CLI, see DP Command Line Interface Utility.

New Hive table workflow and diagram

Both Studio and the DP CLI can be configured to launch a Data Processing workflow that does not use the Data Enrichment modules. The following high-level diagram shows a workflow in which the Data Enrichment modules are run:

New Hive table workflow

The steps in the workflow are:
  1. The workflow is started for a single Hive table by Studio or by the DP CLI.
  2. The job is started and the workflow is assigned to a Spark worker. Data is loaded from the Hive table's data files. The total number of rows in the table is counted, the data sampled, and a primary key is added. The number of processed (sampled) records is specified in the Studio or DP CLI configuration.
  3. The data from step 2 is written to an Avro file in HDFS. This file will remain in HDFS as long as the associated data set exists.
  4. The data set schema and metadata are discovered. This includes discovering the data type of each column, such as long, geocode, and so on. (The DataSet Inventory is also updated with the discovered metadata. If the DataSet Inventory did not exist, it is created at this point.)
  5. The Data Enrichment modules are run. A list of recommended enrichments is generated based on the results of the discovery process. The data is enriched using the recommended enrichments. If running enrichments is disabled in the configuration, then this step is skipped.
  6. The data set is created in the Dgraph, using settings from steps 4 and 5. The DataSet Inventory is also updated to include metadata for the new data set.
  7. The data set is provisioned (that is, HDFS files are written for ingest) and the Dgraph HDFS Agent is notified to pick up the HDFS files, which are sent to the Bulk Load Interface for ingesting into the Dgraph.
  8. After provisioning has finished, Studio updates the ingestStatus attribute of the DataSet Inventory with the final status of the provisioning (ingest) operation.

Handling of updated Hive tables

Existing BDD data sets are not automatically updated if their Hive source tables are updated. For example, assume that a data set has been created from a specific Hive table. If that Hive table is updated with new data, the associated BDD data set is not automatically changed. This means that now the BDD data set is not in synch with its Hive source table.

To update the data set from the updated Hive table, you must run the DP CLI with either the --refreshData flag or the --incrementalUpdate flag. For details, see Updating Data Sets.

Handling of deleted Hive tables

BDD will never delete a Hive table, even if the associated BDD data set has been deleted from Studio. However, it is possible for a Hive administrator to delete a Hive table, even if a BDD data set has been created from that table. In this case, the BDD data set is not automatically deleted and will still be viewable in Studio. (A data set whose Hive source table was deleted is called an orphaned data set.)

The next time that the DP CLI runs, it detects the orphaned data set and runs a Data Processing job that deletes the data set.

Handling of empty Hive tables

Data Processing does not process empty Hive tables. Instead, the Spark driver throws an EmptyHiveTableException when running against an empty Hive table. This causes the Data Processing job to not create a data set for the table. Note that the command may appear to have successfully finished, but the absence of the data set means the job ultimately failed.

Handling of Hive tables created with header/footer information

Data Processing does not support processing Hive tables that are based on files (such as CSV files) containing header/footer rows. In this case, the DP workflow will ignore the header and footer set on the Hive table using the skip.header.line.count and skip.footer.line.count properties. If a workflow on such a table does happen to succeed, the header/footer rows will get added to the resulting BDD data set as records, instead of being omitted.

Deletion of Studio projects

When a Studio user deletes a project, Data Processing is called and it will delete the transformed data sets in the project. However, it will not delete the data sets which have not been transformed.