Configuring a project data set for incremental updates

To prepare your data set for an incremental update, you must first load a full data set and provide a record identifier so that Studio can determine the incremental changes to apply during the update.

An incremental update is performed with DP CLI. However, in Studio, you prepare your data set for this type of update so that you can then run an incremental update against a project data set with DP CLI.

An incremental update lets you add newer data to an existing data set in project, without removing already loaded data. It is most useful when you keep already loaded data, but would like to continue adding new data.

Note:

You cannot run an incremental update from Studio. For details and a diagram of the incremental update workflow, see the Getting Started Guide. For full information on how to run an incremental update with DP CLI, see the Data Processing Guide.

The following diagram shows the Configure for Updates action. This action prepares your data set for running incremental updates with DP CLI:


Shows a diagram of the data lifecycle that includes the load full step and a step for configuring for updates.

In this diagram, from left to right, the following actions take place: you load a data set into Studio from a file or a JDBC source. Next, you add a data set to a project and load a data set in full. Now you can run Configure for Updates. Notice that you can only run Configure for Updates for a data set that is already in project and has already been loaded in full.

The action Configure for Updates performs two tasks: it loads a full data set (by re-running the operation for loading a full data set), and lets you configure a record identifier. The record identifier is then used by Data Processing CLI when it runs an incremental update for the data set.

The record identifier must be unique enough to determine the delta between records in the Hive table and records in the project data set. Practically speaking, this means that in some project data sets you might have to provide a record identifier that is the combination of several attributes. In other projects, you might have a single attribute that works as a unique record identifier without any additional combination.

Studio helps you identify which attribute in your data set is a good candidate for a record identifier. When you select an attribute from the list as a record identifier, Studio calculates the percentage of records in your data set that have unique values for the combination of attributes. The uniqueness score must be 100% or the Data Processing workflow will fail and return an exception to Studio.

To check if a data set already has a record identifier defined in Studio, go to the Data Set Manager page and see if there is a Record Identifiers property specified.

To configure a project data set for incremental updates:

  1. From the Configuration Options menu, select Project Settings.
  2. Select Data Set Manager and expand options next to the data set name.
  3. Select Configure for Updates.
  4. From the list, select an attribute for the data set. If the Key Uniqueness is not 100% for the attribute, click + Attribute to add another attribute and repeat to improve uniqueness.
    Studio combines these attributes into a new unique record identifier for the data set.
  5. Click Configure for Updates.
    After you click this option, a Load Full Data Set operation starts automatically as a background process.

At this point you can schedule and run incremental updates using the IncrementalUpdate command of the Data Processing CLI. For details, see the Data Processing Guide.