Running an Incremental update

This topic describes how to run an Incremental update operation.

This procedure assumes that the data set has been configured for Incremental updates (that is, a record identifier has been configured).

Note that the example in the procedure does not use the --table and --database flags, which means that the command will run against the original Hive table from which the data set was created.

To run an Incremental update on a data set:

  1. Obtain the Data Set Logical Name of the data set you want to incrementally update:
    1. In Studio, go to Project Settings > Data Set Manager.
    2. In the Data Set Manager, select the data set and expand the options next to its name.
    3. Get the value from the Data Set Logical Name field.
  2. From a Linux command prompt, change to the $BDD_HOME/dataprocessing/edp_cli directory.
  3. Run the DP CLI with the --incrementalUpdate flag, the Data Set Logical Name, and the filter predicate. For example:
    ./data_processing_CLI --incrementalUpdate 10128:WarrantyClaims "yearest > 1850"
    
If the operation was successful, the DP CLI prints these messages at the end of the stdout output:
...
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: web2014.example.com
         ApplicationMaster RPC port: 0
         queue: root.fcalvill
         start time: 1437415956086
         final status: SUCCEEDED
         tracking URL: http://web2014.example.com:8088/proxy/application_1436970078353_0041/A
         user: fcalvill
data_processing_CLI finished with state SUCCESS
Note that the tracking URL field shows an HTTP link to the Application Page (in Cloudera Manager or Ambari) for this workflow. The YARN Application Overview page should have a State of "FINISHED" and a FinalStatus of "SUCCESSFUL". The Name field will have an entry similar to this example:
EDP: IncrementalUpdateConfig{collectionId=MdexCollectionIdentifier{
databaseName=default_edp_2c08eb40-8eff-4c7e-b05e-2e451434936d, 
collectionName=default_edp_2c08eb40-8eff-4c7e-b05e-2e451434936d}, 
whereClause=claim_date >= unix_timestamp('2006-01-01 00:00:00', 'yyy-MM-dd HH:mm:ss')}
Note the following about the Name information:
  • IncrementalUpdateConfig is the name of the type of Incremental workflow.
  • whereClause lists the filter predicate used in the command.

You can also check the Dgraph HDFS Agent log for the status of the Dgraph ingest operation.

If the Incremental update determines that there are no records that fit the filter predicate criteria, the DP CLI exits gracefully with a message that no records are to be updated.

Note that future Incremental updates on this data set will continue to use the same Data Set Logical Name. You will also use this name if you set up a Incremental update cron job for this data set.