The diagram in this topic shows data sets loaded by Data Processing component of BDD, from Hive. The diagram illustrates how you can update this data set using DP CLI, and increase its size from sample to full.
Refresh Data
and Incremental Update
.
To run scripted updates, you need a data set logical name, found in the data set's properties in Studio. It is important to provide the correct data set logical name to the DP CLI. If the data set is in Catalog, it is not the same data set that you have in your project. Note the correct data set logical name.
Incremental Update
with DP CLI for a data set in Catalog; you can only run it when you add the data set to a project.Incremental Update
with DP CLI after you specify a record identifier. For this, you must move the data set into a project in Studio.Refresh data
and Incremental update
. You can run these updates periodically, as cron
jobs. The updates will run on your personal version of this data set in this project. This way, your version of this data set is independent of the data set's version that appears in Catalog.With this workflow, you create a project of your own, based on this data set, where you can run scripted updates with DP CLI. This approach works well for BDD projects that you want to keep around and populate with newer data.
This way, you can continue using the configuration and visualizations you built in Studio before, and analyze newer data as it arrives.