This topic discusses the workflow that runs inside Data Processing component of BDD when new data is loaded.
The Data Processing workflow shown in this topic is for loading data; it is one of many possible workflows. This workflow does not show updating data that has already been loaded. For information on running Refresh and Incremental update operations, see Updating Data Sets.
You launch the Data Processing workflow for loading new data either from Studio (by creating a Hive table), or by running the Data Processing CLI (Command Line Interface) utility. As a Hadoop system administrator, you can control some steps in this workflow, while other steps run automatically in Hadoop.
The following diagram illustrates how the data processing workflow for loading new data fits within Big Data Discovery:
To summarize, during an initial data load, the Data Processing component of Big Data Discovery counts data in Hive tables, and optionally performs data set sampling. It then runs an initial data profiling, and applies some enrichments. These stages are discussed in this topic.
Sampling of a data set
Data Processing does not always perform sampling; Sampling occurs only if a source data set contains more records than the default sample size used during BDD deployment. The default sample size used during deployment is 1 million records. When you subsequently run data processing workflow yourself, using the Command Line Interface (DP CLI), you can override the default sample size and specify your own.
Note:
If the number of records in the source data set is less than the value specified for the sample size, then no sampling takes place and Data Processing loads the source data in full.These requirements, combined with the large absolute size of the data sample, mean that samples taken by Big Data Discovery allow for making reliable generalizations on the entire corpus of data.
Profiling of a data set
Profiling is a process that determines the characteristics (columns) in the Hive tables, for each source Hive table discovered by the Data Processing in Big Data Discovery during data load.
Using Explore in Studio, you can then look deeper into the distribution of attribute values or types. Later, using Transform, you can change some of these metadata. For example, you can replace null attribute values with actual values, or fix other inconsistencies.
Enrichments
Enrichments are derived from a data set's additional information such as terms, locations, the language used, sentiment, and views. Big Data Discovery determines which enrichments are useful for each discovered data set, and automatically runs them on samples of the data. As a result of automatically applied enrichments, additional derived metadata (columns) are added to the data set, such as geographic data, a suggestion of the detected language, or positive or negative sentiment.
The data sets with this additional information appear in Catalog in Studio. This provides initial insight into each discovered data set, and lets you decide if the data set is a useful candidate for further exploration and analysis.
In addition to automatically-applied enrichments, you can also apply enrichments using Transform in Studio, for a project data set. From Transform, you can configure parameters for each type of enrichment. In this case, an enrichment is simply another type of available transformation.
Some enrichments allow you to add additional derived meaning to your data sets, while others allow you to address invalid or inconsistent values.
Transformations
You can think of transformations as a substitute for an ETL process of cleaning your data before or during the data loading process. Use could transformations to overwrite an existing attribute, or create new attributes. Some transformations are enrichments, and as such, are applied automatically when data is loaded.
Most transformations are available directly as specific options in Transform in Studio. Once the data is loaded, you can use a list of predefined Transform functions, to create a transformation script.
For a full list of transformations available in BDD, including aggregations and joining of data sets, see the Studio User's Guide.
Exporting data from Big Data Discovery into HDFS
You can export the results of your analysis from Big Data Discovery into HDFS/Hive; this is known as exporting to HDFS.
From the perspective of Big Data Discovery, the process is about exporting the files from Big Data Discovery into HDFS/Hive. From the perspective of HDFS, you are importing the results of your work from Big Data Discovery into HDFS. In Big Data Discovery, the Dgraph HDFS Agent is responsible for exporting to HDFS and importing from it.