Studio-loaded files: data update diagram

The diagram in this topic shows data sets loaded in Studio by uploading a personal file or importing data from a JDBC source. It illustrates how you can reload this data set in Studio. Also, you can update the data set with DP CLI, and increase its size from sample to full.

This diagram provides a summary of data loading and update options for data sets that are loaded into BDD via Studio's file upload.

In this diagram, the following actions take place, from left to right:
  • You load data with the Studio's Add data set action. This lets you load a personal file, or data from a JDBC data source. Once loaded, the data sets appear in Catalog. BDD creates a Hive table for each data set. This Hive table is associated with the data set in Catalog.
  • If you later have a newer version of these files, you can reload them with Reload data set action, found in data set's details, in Catalog.
  • Next, if you'd like to transform, or to use Discover area to create visualizations, you should create a project. This is shown in the diagram in the middle circle. Notice that in this circle, the data set is now in your project and is no longer in Catalog. This data set can still be shown in Catalog, for others to use, if they have permissions to see it. However, now you have your own "version" of this data set, inside your project.
  • Next, you can choose to load full data into this data set with Load full data, found in Data Set Manager. This replaces the data set sample with a fully loaded data set within your project. Or, if the data set was already fully loaded, you can reload the full data set in your project, by using this option again. This is shown as a circular arrow above the middle circle.
  • Alternatively, you can use the options from Data Processing CLI to run scripted updates. There are two types of scripted updates: Refresh Data and Incremental update.
  • To run Refresh Data with DP CLI, identify the data set's logical name in the properties for this data set in Studio. You must use that data set's logical name as a parameter for the data refresh command in DP CLI.
  • To run Incremental update with DP CLI, you must specify a primary key (also known as record identifier in Studio). To add a primary key, use Control Panel > Data Set Manager > Configure for Updates. You also need the data set's logical name for the incremental update.
  • Be careful to use the correct data set logical name. Notice that a data set in Catalog differs from the data set in a project.
  • Notice that you can run Reload data set in Catalog only for data sets in Catalog. If you want to update a data set that is already added to a project, you need to use one of the scripted updates from DP CLI.

    If you reload a data set in Catalog that has a private copy in a project, Studio alerts you to this when you open the project, and gives you a chance to accept these updates. If any transformations were made to the data set in the project, you can apply them to the data set in Catalog.

Note the following about this diagram:
  • You can Refresh Data with DP CLI for data sets in Catalog and in projects. Typically, you use DP CLI Refresh Data for data sets in a project.
  • You can run an Incremental Update with DP CLI after you specify a record identifier. For this, you must move the data set into a project in Studio.
  • Once you load full data, this does not change the data set that appears in Catalog. Moving a data set to a project is similar to creating your personal version of the data set. Next, you can load full data and write scripted updates to this data set using DP CLI commands Refresh data and Incremental update. You can run these updates periodically, as cron jobs. The updates will run on your personal version of this data set in this project. This way, your version of this data set is independent of the data set's version that appears in Catalog.