Data update options

Here is a summary of how you can update data loaded into BDD, and when each type of update is useful to use.

Options for data updates

To update already loaded data, you have these options:

Reload data set in Studio
Refresh data with DP CLI
Run an incremental update with DP CLI
When to use each type of update

Updates that you run with DP CLI are also called scripted updates.

Reload data set in Studio

The Reload data set option in Studio is useful when you want to reload a newer version of the data than you loaded before. It applies to personally uploaded files and to data imported from a JDBC source. Note that this option works only for data sets in Studio's Catalog.

For a diagram of updating data sets that were loaded in Studio, see Studio-loaded files: data update diagram in this guide.

For detailed procedures for loading and reloading data in Studio, see the sections in the Studio User's Guide.

Refresh data with DP CLI

The Refresh data operation from DP CLI reloads an existing data set in a Studio project, replacing the contents of the data set with the latest data from Hive in its entirety. If the schema in the source Hive table changes, so does the newly-referenced data set. In this type of update, old data is removed and is replaced with new data. New attributes may be added, or attributes may be deleted. Also, the data type for an attribute may change.

For a diagram of updating data sets that were loaded with DP CLI, see DP CLI-loaded files: data update diagram in this guide.

For detailed information on how to run scripted updates with DP CLI, see the Data Processing Guide.

Run an incremental update with DP CLI

The Incremental update operation from DP CLI lets you add newer data to an existing BDD application, without removing already loaded data. In this type of update, the records' schema cannot change. An incremental update is most useful when you keep already loaded data, but would like to continue adding new data. For example, you can add more recent twitter feeds to the ones that are already loaded.

For a diagram of updating data sets that were loaded with DP CLI, see DP CLI-loaded files: data update diagram in this guide.

For detailed information on how to run scripted updates with DP CLI, see the Data Processing Guide.

When to use each type of update

This table summarizes when it is useful to use each type of update.

Type of data update Useful when...

Reload data set in Catalog (in Studio) This update is useful when you want to replace the loaded file with an updated version. Similarly, if data in a JDBC source was updated, you can reload it this way.

Type of data update	Useful when...
Reload data set in Catalog (in Studio)	This update is useful when you want to replace the loaded file with an updated version. Similarly, if data in a JDBC source was updated, you can reload it this way.
Scripted updates with DP CLI (`Refresh data` and `Incremental update`)	You can run scripted updates on files that originated from a Studio's upload, and on files that BDD discovers in Hive, when you run its data processing workflow for loading data using DP CLI. You can run either type of the scripted updates periodically, by writing update scripts and `cron` jobs on the Hadoop machines that utilize options for these updates from Data Processing CLI. Depending on characteristics of your source data, you may need to periodically run both types of scripted updates, or only one of them. For example, you may want to create a `cron` job that runs an incremental update nightly. This adds data from that day to the existing data set in a project in Studio. In addition to a periodic incremental update, you can run a `Refresh data` update weekly, to replace the data in the project wholesale with the new data that was collected in Hive during the week. A `Refresh data` update is also useful to run weekly because it lets you handle deletes from the source data set.

Scripted updates with DP CLI (Refresh data and Incremental update)

You can run scripted updates on files that originated from a Studio's upload, and on files that BDD discovers in Hive, when you run its data processing workflow for loading data using DP CLI.

You can run either type of the scripted updates periodically, by writing update scripts and cron jobs on the Hadoop machines that utilize options for these updates from Data Processing CLI.

Depending on characteristics of your source data, you may need to periodically run both types of scripted updates, or only one of them.

For example, you may want to create a cron job that runs an incremental update nightly. This adds data from that day to the existing data set in a project in Studio.

In addition to a periodic incremental update, you can run a Refresh data update weekly, to replace the data in the project wholesale with the new data that was collected in Hive during the week.

A Refresh data update is also useful to run weekly because it lets you handle deletes from the source data set.