Workflow for loading new data

This topic discusses the workflow that runs inside Data Processing component of BDD when new data is loaded.

The Data Processing workflow shown in this topic is for loading data; it is one of many possible workflows. This workflow does not show updating data that has already been loaded. For information on running Refresh and Incremental update operations, see Updating Data Sets.

Loading new data includes these stages:
  • Discovery of source data in Hive tables
  • Loading and creating a sample of a data set
  • Running a select set of enrichments on this data set (if so configured)
  • Profiling the data
  • Transforming the data set
  • Exporting data from Big Data Discovery into Hadoop

You launch the Data Processing workflow for loading new data either from Studio (by creating a Hive table), or by running the Data Processing CLI (Command Line Interface) utility. As a Hadoop system administrator, you can control some steps in this workflow, while other steps run automatically in Hadoop.

The following diagram illustrates how the data processing workflow for loading new data fits within Big Data Discovery:

This diagram describes how Data Processing component fits into Big Data Discovery.

The steps in this diagram are:
  1. The workflow for data loading starts either from Studio or the Data Processing CLI.
  2. The Spark job is launched on Hadoop nodes that have Data Processing portion of Big Data Discovery installed on them.
  3. The counting, sampling, discovery, and transformations take place and are processed on Hadoop nodes. The information is written to HDFS and sent back.
  4. The data processing workflow launches the process of loading the records and their schema into the Dgraph, for each data set.

To summarize, during an initial data load, the Data Processing component of Big Data Discovery counts data in Hive tables, and optionally performs data set sampling. It then runs an initial data profiling, and applies some enrichments. These stages are discussed in this topic.

Sampling of a data set

If you work with a sampled subset of the records from large tables discovered in HDFS, you are using sample data as a proxy for the full tables. This lets you:
  • Avoid latency and increase the interactivity of data analysis, in Big Data Discovery
  • Analyze the data as if using the full set.

Data Processing does not always perform sampling; Sampling occurs only if a source data set contains more records than the default sample size used during BDD deployment. The default sample size used during deployment is 1 million records. When you subsequently run data processing workflow yourself, using the Command Line Interface (DP CLI), you can override the default sample size and specify your own.

Note:

If the number of records in the source data set is less than the value specified for the sample size, then no sampling takes place and Data Processing loads the source data in full.
Samples in BDD are taken as follows:
  • Data Processing takes a random sample of the data, using either the default size sample, or the size you specify. BDD leverages the inbuilt Spark random sampling functionality.
  • Based on the number of rows in the source data and the number of rows requested for the sample, BDD passes through the source data and, for each record, includes it in the sample with a certain (equal) probability. As a result, Data Processing creates a simple random sampling of records, in which:
    • Each element has the same probability of being chosen
    • Each subset of the same size has an equal probability of being chosen.

These requirements, combined with the large absolute size of the data sample, mean that samples taken by Big Data Discovery allow for making reliable generalizations on the entire corpus of data.

Profiling of a data set

Profiling is a process that determines the characteristics (columns) in the Hive tables, for each source Hive table discovered by the Data Processing in Big Data Discovery during data load.

Profiling is carried out by the data processing workflow for loading data and results in the creation of metadata information about a data set, including:
  • Attribute value distributions
  • Attribute type
  • Topics
  • Classification
For example, a specific data set can be recognized as a collection of structured data, social data, or geographic data.

Using Explore in Studio, you can then look deeper into the distribution of attribute values or types. Later, using Transform, you can change some of these metadata. For example, you can replace null attribute values with actual values, or fix other inconsistencies.

Enrichments

Enrichments are derived from a data set's additional information such as terms, locations, the language used, sentiment, and views. Big Data Discovery determines which enrichments are useful for each discovered data set, and automatically runs them on samples of the data. As a result of automatically applied enrichments, additional derived metadata (columns) are added to the data set, such as geographic data, a suggestion of the detected language, or positive or negative sentiment.

The data sets with this additional information appear in Catalog in Studio. This provides initial insight into each discovered data set, and lets you decide if the data set is a useful candidate for further exploration and analysis.

In addition to automatically-applied enrichments, you can also apply enrichments using Transform in Studio, for a project data set. From Transform, you can configure parameters for each type of enrichment. In this case, an enrichment is simply another type of available transformation.

Some enrichments allow you to add additional derived meaning to your data sets, while others allow you to address invalid or inconsistent values.

Transformations

Transformations are changes to a data set. Transformations allow you to perform actions such as:
  • Changing data types
  • Changing capitalization of values
  • Removing attributes or records
  • Splitting columns
  • Grouping or binning values
  • Extracting information from values

You can think of transformations as a substitute for an ETL process of cleaning your data before or during the data loading process. Use could transformations to overwrite an existing attribute, or create new attributes. Some transformations are enrichments, and as such, are applied automatically when data is loaded.

Most transformations are available directly as specific options in Transform in Studio. Once the data is loaded, you can use a list of predefined Transform functions, to create a transformation script.

For a full list of transformations available in BDD, including aggregations and joining of data sets, see the Studio User's Guide.

Exporting data from Big Data Discovery into HDFS

You can export the results of your analysis from Big Data Discovery into HDFS/Hive; this is known as exporting to HDFS.

From the perspective of Big Data Discovery, the process is about exporting the files from Big Data Discovery into HDFS/Hive. From the perspective of HDFS, you are importing the results of your work from Big Data Discovery into HDFS. In Big Data Discovery, the Dgraph HDFS Agent is responsible for exporting to HDFS and importing from it.