An Incremental update adds new records to a project data set from a source Hive table.
The DP CLI --incrementalUpdate flag (abbreviated as -incremental) performs a partial update of a project data set by selecting adding new and modified records. The data set should be a project data set that is a full data set (i.e., is not a sample data set) and has been configured for incremental updates.
The record identifier determines if a record already exists or is new.
If the data set has an associated Transformation script, then the script will run against the new records and can transform them (if a transform step applies). Existing records in the data set are not affected.
A data set must be configured for Incremental updates before you can run an Incremental update against it. This procedure must be done from the
page in Studio.The data set must be configured with a record identifier for determining the delta between records in the Hive table and records in the project data set. If columns have been added or removed from the Hive table, you should run a Refresh update to incorporate those column changes in the data set.
When selecting the attributes that uniquely identify a record, the uniqueness score should be as close as possible to 100%. If the record identifier is not 100% unique, the total record count decreases by the number of records that have duplicate or missing identifiers. In this example, the Key Uniqueness field shows a 100% figure:
After the data set is configured, its entry in the Data Set Manager page looks like this example:
Note that the Record Identifiers field now lists the attributes that were selected in the Configure for Updates dialogue.
The configure-for-updates procedure is fully documented in the Data Exploration and Analysis Guide.
... [2015-07-21T10:23:05.653-04:00] [DataProcessing] [ERROR] [] [com.oracle.endeca.pdi.logging.ProvisioningLogger] [tid:Driver] [userID:yarn] Error running EDP java.lang.RuntimeException: Cannot run incremental update on either non-full (sampled) dataset or dataset for which record identifiers were not provided. at com.oracle.endeca.pdi.workflow.IncrementalUpdateWorkflow.runWorkflow(IncrementalUpdateWorkflow.java:149) at com.oracle.endeca.pdi.workflow.IncrementalUpdateWorkflow.runWorkflow(IncrementalUpdateWorkflow.java:109) at com.oracle.endeca.pdi.EdpMain.runIncrementalUpdate(EdpMain.java:190) at com.oracle.endeca.pdi.EdpMain.runEdp(EdpMain.java:111) at com.oracle.endeca.pdi.EdpMain.main(EdpMain.java:61) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:427) ...
If this error occurs, configure the data set for Incremental updates and re-run the update operation.