About data set updates

You can update data sets by running Refresh updates and Incremental updates with the DP CLI.

When first created, a BDD data set may be sampled, which means that the BDD data set has fewer records than its source Hive table. In addition, more records can be added to the source Hive table, and these new records will not be added to the data set by default.

Two DP CLI operations are available that enable the BDD administrator to synchronize a data set with its source Hive table:
  • The --refreshData flag (abbreviated as -refresh) performs a full data refresh on a BDD data set from the original Hive table. This means that the data set will have all records from the source Hive table. If the data set had previously been sampled, it will now be a full data set. And as records get added to the Hive table, the Refresh update operation can keep the data set synchronized with its source Hive table.
  • The --incrementalUpdate flag (abbreviated as -incremental) performs an incremental update on a BDD data set from the original Hive table, using a filter predicate to select the new records. Note that this operation can be run only after the data set has been configured for Incremental updates.

Note that the equivalent of a DP CLI Refresh update can done in Studio via the Load Full Data Set feature. However, Incremental Data updates can be performed only via the DP CLI, as Studio does not support this feature.

Re-pointing a data set

if you created a data set by uploading source data into Studio and want to run Refresh and Incremental updates, you should change the source data set to point to a new Hive table. (Note that this change is not required if the data set is based on a table created directly in Hive.) For information on this re-pointing operation, see the topic on converting a project to a BDD application in the Studio User's Guide.