Incremental updates

An Incremental update adds new records to a project data set from a source Hive table.

The DP CLI --incrementalUpdate flag (abbreviated as -incremental) performs a partial update of a project data set by selecting adding new and modified records. The data set should be a project data set that is a full data set (i.e., is not a sample data set) and has been configured for incremental updates.

The Incremental update operation fetches a subset of the records in the source Hive table. The subset is determined by using a filtering predicate that specifies the Hive table column that holds the records and the value of the records to fetch. The records in the subset batch are ingested as follows:

The record identifier determines if a record already exists or is new.

Schema changes and disabling search

Unlike a Refresh update, an Incremental update has these limitations:
  • An Incremental update cannot make schema changes to the data set. This means that no attributes in the data set will be deleted or added.
  • An Incremental update cannot use the --disableSearch flag. This means that the searchability of the data set cannot be changed.

Transformation scripts in Incremental updates

If the data set has an associated Transformation script, then the script will run against the new records and can transform them (if a transform step applies). Existing records in the data set are not affected.

Record identifier configuration

A data set must be configured for Incremental updates before you can run an Incremental update against it. This procedure must be done from the Project Settings > Data Set Manager page in Studio.

The data set must be configured with a record identifier for determining the delta between records in the Hive table and records in the project data set. If columns have been added or removed from the Hive table, you should run a Refresh update to incorporate those column changes in the data set.

When selecting the attributes that uniquely identify a record, the uniqueness score should be as close as possible to 100%. If the record identifier is not 100% unique, the total record count decreases by the number of records that have duplicate or missing identifiers. In this example, the Key Uniqueness field shows a 100% figure:

After the data set is configured, its entry in the Data Set Manager page looks like this example:

Note that the Record Identifiers field now lists the attributes that were selected in the Configure for Updates dialogue.

The configure-for-updates procedure is fully documented in the Data Exploration and Analysis Guide.

Error for non-configured data sets

If the data set is not a full data set or is not configured for Increment updates, the Incremental update fails with an error similar to this:
...
[2015-07-21T10:23:05.653-04:00] [DataProcessing] [ERROR] [] [com.oracle.endeca.pdi.logging.ProvisioningLogger] 
   [tid:Driver] [userID:yarn] Error running EDP
java.lang.RuntimeException: Cannot run incremental update on either non-full (sampled) dataset 
or dataset for which record identifiers were not provided.
	at com.oracle.endeca.pdi.workflow.IncrementalUpdateWorkflow.runWorkflow(IncrementalUpdateWorkflow.java:149)
	at com.oracle.endeca.pdi.workflow.IncrementalUpdateWorkflow.runWorkflow(IncrementalUpdateWorkflow.java:109)
	at com.oracle.endeca.pdi.EdpMain.runIncrementalUpdate(EdpMain.java:190)
	at com.oracle.endeca.pdi.EdpMain.runEdp(EdpMain.java:111)
	at com.oracle.endeca.pdi.EdpMain.main(EdpMain.java:61)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:427)
...

If this error occurs, configure the data set for Incremental updates and re-run the update operation.