Overview of workflows

This topic provides an overview of Data Processing workflows.

When the Data Processing component runs, it performs a series of steps; these steps are called a data processing workflow. Many workflows exist, for loading initial data, updating data, or for cleaning up unused data sets.

All Data Processing workflows are launched either from Studio (in which case they run automatically) or from the DP CLI (Command Line Interface) utility.

In either case, when the workflow runs, it manifests itself in various parts of the user interface, such as Explore, and Transform in Studio. For example, new source data sets become available for your discovery, in Explore. Or, you can make changes to the project data sets in Transform. Behind all these actions, lie the processes in Big Data Discovery known as Data Processing workflows. This guide describes these processes in detail.

For example, a Data Processing (DP) workflow for loading data is the process of extracting data and metadata from a Hive table and ingesting it as a data set in the Dgraph. The extracted data is turned into Dgraph records while the metadata provides the schema for the records, including the Dgraph attributes that define the BDD data set.

Once data sets are ingested into the Dgraph, Studio users can view the data sets and query the records in them. Studio users can also modify (transform) the data set and even delete it.

All Data Processing jobs are run by Spark workers. Data Processing runs asynchronously — it puts a Spark job on the queue for each Hive table. When the first Spark job on the first Hive table is finished, the second Spark job (for the second Hive table) is started, and so on.

Note that although a BDD data set can be deleted by a Studio user, the Data Processing component of BDD software can never delete a Hive table. Therefore, it is up to the Hive administrator to delete obsolete Hive tables.

DataSet Inventory

The DataSet Inventory (DSI) is an internal structure that lets Data Processing keep track of the available data sets. Each data set in the DSI includes metadata that describes the characteristics of that data set. For example, when a data set is first created, the names of the source Hive table and the source Hive database are stored in the metadata for that data set. The metadata also includes the schemas of the data sets.

The DataSet Inventory contains an ingestStatus attribute for each data set, which indicates whether the data set has been completely provisioned (and therefore is ready to be added to a Studio project). The flag is set by Studio after being notified by the Dgraph HDFS Agent on the completion of an ingest.

Language setting for attributes

During a normal Data Processing workflow, the language setting for all attributes is either a specific language (such as English or French) or unknown (which means a DP workflow does not use a language code for any specific language). The default language is set at install time for Studio and the DP CLI by the LANGUAGE property of the bdd.conf file. However, both Studio and the DP CLI can override the default language setting and specify a different language code for a workflow. For a list of supported languages, see Supported languages.