A Refresh update replaces the schema and all the records in a
project data set with the schema and records in the source Hive table.
The DP
CLI
--refreshData flag (abbreviated as
-refresh) performs a full data refresh on a BDD
data set from the original Hive table. The data set should be a project data
set (that is, must added to a Studio project). Loading the full data set
affects only the data set in a specific project; it does not affect the data
set as it displays in the Studio Catalog.
Running a Refresh update produces the following results:
- All records stored in the
Hive table are loaded for that data set. This includes any table updates
performed by a Hive administrator.
- If the data set was sampled,
it is increased to the full size of the data set. That is, it is now a full
data set.
- If the data set contains a
transformation script, that script will be run against the full data set, so
that all transformations apply to the full data set in the project.
- If the
--disableSearch flag is also used, record search
and value search will be disabled for the data set.
The equivalent of a DP CLI Refresh update can be done in Studio via the
Load Full Data Set feature (although you cannot
specify a different source table as with the DP CLI).
Note that you should not start a DP CLI Refresh update if a
transformation on that data set is in progress. In this scenario, the Refresh
update will fail and a notification will be sent to Studio:
Reload of <logical name> from CLI has failed. Please contact an administrator.
Schema changes
There are no restrictions on how the schema of the data set is changed
due to changes in the schema and/or data of the source Hive table. This
non-restriction is because the Refresh update operation uses a kill-and-fill
strategy, in which the entire contents of the data set are removed and replaced
with those in the Hive table.
Transformation scripts in Refresh updates
If the data set has an associated Transformation script, then the
script will run against the newly-ingested attributes and data. However, some
of the schema changes may prevent some of the steps of the script from running.
For example:
- Existing columns in Hive
table may be deleted. As a result, any Transformation script step that
references the deleted attributes will be skipped.
- New columns can be added
to the Hive table and they will result in new attributes in the data set. The
Transformation script will not run on these new attributes as the script would
not reference them.
- Added data to a Hive
column may result in the attribute having a different data type (such as String
instead of a previous Long). The Transformation script may or may not run on
the changed attribute.
The following diagram illustrates the effects of a schema change on
the Transformation script:
If the data set does not have an associated Transformation script and
the Hive table schema has changed, then the data set is updated with the new
schema and data.