Incremental updates

An Incremental update adds new records to a project data set from a source Hive table.

The DP CLI --incrementalUpdate flag (abbreviated as -incremental) performs a partial update of a project data set by selecting adding new and modified records. The data set should be a project data set that is a full data set (i.e., is not a sample data set) and has been configured for incremental updates.

The Incremental update operation fetches a subset of the records in the source Hive table. The subset is determined by using a filtering predicate that specifies the Hive table column that holds the records and the value of the records to fetch. The records in the subset batch are ingested as follows:
  • If a record is brand new (does not exist in the data set), it is added to the data set.
  • If a record already exists in the data set but its content has been changed, it replaces the record in the data set.

The record identifier determines if a record already exists or is new.

Schema changes and disabling search

Unlike a Refresh update, an Incremental update has these limitations:
  • An Incremental update cannot make schema changes to the data set. This means that no attributes in the data set will be deleted or added.
  • An Incremental update cannot use the --disableSearch flag. This means that the searchability of the data set cannot be changed.

Transformation scripts in Incremental updates

If the data set has an associated Transformation script, then the script will run against the new records and can transform them (if a transform step applies). Existing records in the data set are not affected.

Record identifier configuration

A data set must be configured for Incremental updates before you can run an Incremental update against it. This procedure must be done from the Project Settings > Data Set Manager page in Studio.

The data set must be configured with a record identifier for determining the delta between records in the Hive table and records in the project data set. If columns have been added or removed from the Hive table, you should run a Refresh update to incorporate those column changes in the data set.

When selecting the attributes that uniquely identify a record, the uniqueness score must be 100%. If the record identifier is not 100% unique, the Data Processing workflow will fail and return an exception. In this example, the Key Uniqueness field shows a 100% figure:

Configure for Updates dialog

After the data set is configured, its entry in the Data Set Manager page looks like this example:

An example of a data set entry in the Data Set Manager page.

Note that the Record Identifiers field now lists the attributes that were selected in the Configure for Updates dialogue.

The Configure for Updates procedure is documented in the Studio User's Guide.

Error for non-configured data sets

If the data set has not been configured for Increment updates, the Incremental update fails with an error similar to this:
...
data_processing_CLI finished with state ERROR
Exception in thread "main" com.oracle.endeca.pdi.client.EdpExecutionException: Only curated datasets can be updated.
        at com.oracle.endeca.pdi.client.EdpGeneralClient.invokeIncrementalUpdate(EdpGeneralClient.java:232)
        at com.oracle.endeca.pdi.EdpCli.runEdp(EdpCli.java:814)
        at com.oracle.endeca.pdi.EdpCli.processIncrementalUpdate(EdpCli.java:572)
        at com.oracle.endeca.pdi.EdpCli.commandLineArgumentLogic(EdpCli.java:316)
        at com.oracle.endeca.pdi.EdpCli.main(EdpCli.java:927)

In the error message, the term "curated datasets" refers to data sets that have been configured for Incremental updates. If this error occurs, configure the data set for Incremental updates and re-run the Incremental update operation.