Running a Refresh update

This topic describes how to run a Refresh update operation.

This procedure assumes that:
  • The data set has been created, either from Studio or with the DP CLI.
  • The data set has been added to a Studio project.

To run a Refresh update on a data set:

  1. Obtain the Data Set Logical Name of the data set you want to refresh:
    1. In Studio, go to Project Settings > Data Set Manager.
    2. In the Data Set Manager, select the data set and expand the options next to its name.
    3. Get the value from the Data Set Logical Name field.
  2. From a Linux command prompt, change to the $BDD_HOME/dataprocessing/edp_cli directory.
  3. Run the DP CLI with the --refreshData flag and the Data Set Logical Name. For example:
    ./data_processing_CLI --refreshData 10128:WarrantyClaims
    
If the operation was successful, the DP CLI prints these messages at the end of the stdout output:
[2016-06-24T09:56:22.963-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:fcalvill] 
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: 10.152.105.219
         ApplicationMaster RPC port: 0
         queue: root.fcalvill
         start time: 1466776490743
         final status: SUCCEEDED
         tracking URL: http://bus2014.example.com:8088/proxy/application_1466716670116_0002/A
         user: fcalvill
Refreshing existing collection: MdexCollectionIdentifier{
   databaseName=edp_cli_edp_ad9a93eb-fbec-49ca-bdc9-8ac897dd5c8f, 
   collectionName=edp_cli_edp_ad9a93eb-fbec-49ca-bdc9-8ac897dd5c8f}
Collection key for new record:  MdexCollectionIdentifier{
   databaseName=refreshed_edp_a284bd0c-23fe-4d26-9e92-cbfc22b1555e, 
   collectionName=refreshed_edp_a284bd0c-23fe-4d26-9e92-cbfc22b1555e}
data_processing_CLI finished with state SUCCESS
The YARN Application Overview page should have a State of "FINISHED" and a FinalStatus of "SUCCESSFUL". The Name field will have an entry similar to this example:
EDP: DatasetRefreshConfig{hiveDatabase=, hiveTable=, 
collectionToRefresh=MdexCollectionIdentifier{databaseName=edp_cli_edp_ad9a93eb-fbec-49ca-bdc9-8ac897dd5c8f, 
collectionName=edp_cli_edp_ad9a93eb-fbec-49ca-bdc9-8ac897dd5c8f}, 
newCollectionId=MdexCollectionIdentifier{databaseName=refreshed_edp_a284bd0c-23fe-4d26-9e92-cbfc22b1555e, 
collectionName=refreshed_edp_a284bd0c-23fe-4d26-9e92-cbfc22b1555e}, 
op=REFRESH_DATASET}
Note the following about the Name information:
  • hiveDatabase and hiveTable are blank because the --database and --table flags were not used. In this case, the Refresh update operation uses the same Hive table and database that were used when the data set was first created.
  • collectionToRefresh is name of the data set that was refreshed. This name is the same as the Refreshing existing collection field in the stdout listed above.
  • newCollectionId is an internal name for the refreshed data set. This name will not appear in the Studio UI (the original Data Set Logical Name will continue to be used as it is a persistent name). This name is also the same as the Collection key for new record field in the stdout listed above.

You can also check the Dgraph HDFS Agent log for the status of the Dgraph ingest operation.

Note that future Refresh updates on this data set will continue to use the same Data Set Logical Name. You will also use this name if you set up a Refresh update cron job for this data set.