An Incremental update adds new records to a project data set from a source Hive table.
The DP CLI --incrementalUpdate flag (abbreviated as -incremental) performs a partial update of a project data set by selecting adding new and modified records. The data set should be a project data set that is a full data set (i.e., is not a sample data set) and has been configured for incremental updates.
The record identifier determines if a record already exists or is new.
Schema changes and disabling search
Transformation scripts in Incremental updates
If the data set has an associated Transformation script, then the script will run against the new records and can transform them (if a transform step applies). Existing records in the data set are not affected.
Record identifier configuration
A data set must be configured for Incremental updates before you can run an Incremental update against it. This procedure must be done from the
page in Studio.The data set must be configured with a record identifier for determining the delta between records in the Hive table and records in the project data set. If columns have been added or removed from the Hive table, you should run a Refresh update to incorporate those column changes in the data set.
When selecting the attributes that uniquely identify a record, the uniqueness score must be 100%. If the record identifier is not 100% unique, the Data Processing workflow will fail and return an exception. In this example, the Key Uniqueness field shows a 100% figure:
After the data set is configured, its entry in the Data Set Manager page looks like this example:
Note that the Record Identifiers field now lists the attributes that were selected in the Configure for Updates dialogue.
The Configure for Updates procedure is documented in the Studio User's Guide.
Error for non-configured data sets
... data_processing_CLI finished with state ERROR Exception in thread "main" com.oracle.endeca.pdi.client.EdpExecutionException: Only curated datasets can be updated. at com.oracle.endeca.pdi.client.EdpGeneralClient.invokeIncrementalUpdate(EdpGeneralClient.java:232) at com.oracle.endeca.pdi.EdpCli.runEdp(EdpCli.java:814) at com.oracle.endeca.pdi.EdpCli.processIncrementalUpdate(EdpCli.java:572) at com.oracle.endeca.pdi.EdpCli.commandLineArgumentLogic(EdpCli.java:316) at com.oracle.endeca.pdi.EdpCli.main(EdpCli.java:927)
In the error message, the term "curated datasets" refers to data sets that have been configured for Incremental updates. If this error occurs, configure the data set for Incremental updates and re-run the Incremental update operation.