Running a Refresh update

This topic describes how to run a Refresh update operation.

This procedure assumes that:
  • A data set has been created, either from Studio or with the DP CLI.
  • The data set has been added to a Studio project.

To run a Refresh update on a data set:

  1. Obtain the data set key of the data set you want to refresh:
    1. In Studio, go to Project Settings > Data Set Manager.
    2. In the Data Set Manager, select the data set and expand the options next to its name.
    3. Get the value from the Data Set Key field.
  2. From a Linux command prompt, change to the $BDD_HOME/dataprocessing/edp_cli directory.
  3. Run the DP CLI with the --refreshData flag and the data set key. For example:
    ./data_processing_CLI --refreshData default_edp_171506f0-e2d6-4ed1-8f5e-052a1fad721a_10135
If the operation was successful, the DP CLI prints these messages at the end of the stdout output:
...
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: web2014.example.com
         ApplicationMaster RPC port: 0
         queue: root.fcalvill
         start time: 1437157181086
         final status: SUCCEEDED
         tracking URL: http://web2014.example.com:8088/proxy/application_1436970078353_0020/A
         user: fcalvill
Refreshing existing collection: default_edp_171506f0-e2d6-4ed1-8f5e-052a1fad721a_10135
Collection key for new record:  refreshed_edp_34cdbff2-2e5f-4c09-9388-2b9f5ae3148e
data_processing_CLI finished with state SUCCESS
The YARN Application Overview page should have a State of "FINISHED" and a FinalStatus of "SUCCESSFUL". The Name field will have an entry similar to this example:
EDP: DatasetRefreshConfig{hiveDatabase=, hiveTable=, 
collectionToRefresh=edp_cli_edp_479776cd-2d93-4de0-bfc0-196b7f16b2b5_10121, 
newCollectionName=refreshed_edp_0f49f22d-7344-4448-b82f-3c70bfad6314, op=REFRESH_DATASET}
Note the following about the Name information:
  • hiveDatabase and hiveTable are blank because the --database and --table flags were not used. In this case, the Refresh update operation uses the same Hive table and database that were used when the data set was first created.
  • collectionToRefresh is the data set key used for the command. This name is the same as the Refreshing existing collection field in the stdout listed above.
  • newCollectionName is an internal name for the refreshed data set. This name will not appear in the Studio UI (the data set key value will continue to be used as it is a persistent name). This name is also the same as the Collection key for new record field in the stdout listed above.

You can also check the Dgraph HDFS Agent log for the status of the Dgraph ingest operation.

Note that future Refresh updates on this data set will continue to use the same data set key. You will also use this key if you set up a Refresh update cron job for this data set.