This topic provides an overview of Data Processing workflows.
A Data Processing (DP) workflow is the process of extracting data and metadata from a Hive table and ingesting it as a data set in the Dgraph. The extracted data is turned into Dgraph records while the metadata provides the schema for the records, including the Dgraph attributes that define the BDD data set. Data Processing workflows are launched from Studio or by running the DP CLI (command line interface) utility.
Once data sets are ingested into the Dgraph, Studio users can view the data sets and query the records in them. Studio users can also modify (transform) the data set and even delete it.
A Data Processing job is run by a Spark worker. Data Processing runs asynchronously — it puts a Spark job on the queue for each Hive table. When the first Spark job on the first Hive table is finished, the second Spark job (for the second Hive table) is started, and so on.
Note that although a BDD data set can be deleted by a Studio user, the Data Processing component of BDD software can never delete a Hive table. Therefore, it is up to the Hive administrator to delete obsolete Hive tables.
The DataSet Inventory (DSI) is an internal structure that lets Data Processing keep track of the available data sets. Each data set in the DSI includes metadata that describes the characteristics of that data set. For example, when a data set is first created, the names of the source Hive table and the source Hive database are stored in the metadata for that data set. The metadata also includes the schemas of the data sets.
The DataSet Inventory contains an ingestStatus attribute for each data set, which indicates whether the data set has been completely provisioned (and therefore is ready to be added to a Studio project). The flag is set by the Dgraph HDFS Agent to denote the completion of an ingest.
During a normal Data Processing workflow, the default language setting for all attributes is unknown (which means a DP workflow does not use a language code for any specific language). Both Studio and the DP Command Line Interface utility can be configured with a specific language code to be used for a workflow.