Refresh updates

A Refresh update replaces the schema and all the records in a project data set with the schema and records in the source Hive table.

The DP CLI --refreshData flag (abbreviated as -refresh) performs a full data refresh on a BDD data set from the original Hive table. The data set should be a project data set (that is, must added to a Studio project). Loading the full data set affects only the data set in a specific project; it does not affect the data set as it displays in the Studio Catalog.

Running a Refresh update produces the following results:

All records stored in the Hive table are loaded for that data set. This includes any table updates performed by a Hive administrator.
If the data set was sampled, it is increased to the full size of the data set. That is, it is now a full data set.
If the data set contains a transformation script, that script will be run against the full data set, so that all transformations apply to the full data set in the project.
If the --disableSearch flag is also used, record search and value search will be disabled for the data set.

The equivalent of a DP CLI Refresh update can be done in Studio via the Load Full Data Set feature (although you cannot specify a different source table as with the DP CLI).

Note that you should not start a DP CLI Refresh update if a transformation on that data set is in progress. In this scenario, the Refresh update will fail and a notification will be sent to Studio:

Reload of <logical name> from CLI has failed. Please contact an administrator.

Schema changes

There are no restrictions on how the schema of the data set is changed due to changes in the schema and/or data of the source Hive table. This non-restriction is because the Refresh update operation uses a kill-and-fill strategy, in which the entire contents of the data set are removed and replaced with those in the Hive table.

Transformation scripts in Refresh updates

If the data set has an associated Transformation script, then the script will run against the newly-ingested attributes and data. However, some of the schema changes may prevent some of the steps of the script from running. For example:

Existing columns in Hive table may be deleted. As a result, any Transformation script step that references the deleted attributes will be skipped.
New columns can be added to the Hive table and they will result in new attributes in the data set. The Transformation script will not run on these new attributes as the script would not reference them.
Added data to a Hive column may result in the attribute having a different data type (such as String instead of a previous Long). The Transformation script may or may not run on the changed attribute.

The following diagram illustrates the effects of a schema change on the Transformation script:

Effects of schema change on the Transformation script.

If the data set does not have an associated Transformation script and the Hive table schema has changed, then the data set is updated with the new schema and data.

Refresh flag syntax
This topic describes the syntax of the --refreshData flag.
Running a Refresh update
This topic describes how to run a Refresh update operation.