DP CLI workflow examples

This topic shows some workflow examples using the DP CLI.

Excluding specific Data Enrichment modules

The --excludePlugins flag (abbreviated as -ep) specifies a list of Data Enrichment modules to exclude when enrichments are run. This flag should be used only enrichments are being run as part of the workflows (for example, with the --excludePlugins flag).

The syntax is:
./data_processing_CLI --excludePlugins <excludeList>
where excludeList is a space-separated string of one or more of these Data Enrichment canonical module names:
  • address_geo_tagger (for the Address GeoTagger)
  • ip_geo_extractor (for the IP Address GeoTagger)
  • reverse_geo_tagger (for the Reverse GeoTagger)
  • tfidf_term_extractor (for the TF.IDF Term extractor)
  • doc_level_sentiment_analysis (for the document-level Sentiment Analysis module)
  • language_detection (for the Language Detection module)
For example:
./data_processing_CLI --table masstowns --runEnrichment --excludePlugins reverse_geo_tagger

For details on the Data Enrichment modules, see Data Enrichment Modules.

Cleaning up aborted jobs

The --cleanAbortedJobs flag (abbreviated as -clean) cleans up artifacts left over from incomplete Data Processing workflows:
./data_processing_CLI --cleanAbortedJobs
A successful result should be similar to this example:
...
[2015-07-13T10:18:13.683-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:fcalvill] 
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: web12.example.com
         ApplicationMaster RPC port: 0
         queue: root.fcalvill
         start time: 1436797065603
         final status: SUCCEEDED
         tracking URL: http://web12.example.com:8088/proxy/application_1434142292832_0016/A
         user: fcalvill
Clean aborted job completed.
data_processing_CLI finished with state SUCCESS
Note that the name of the workflow on the YARN All Applications page is:
EDP: CleanAbortedJobsConfig{}

Ping checking the DP components

The --pingCheck flag (abbreviated as -ping) ping checks the status of components that Data Processing needs:
./data_processing_CLI --pingCheck
A successful result should be similar to this example:
...
[2015-07-14T14:52:32.270-04:00] [DataProcessing] [INFO] [] [com.oracle.endeca.pdi.logging.ProvisioningLogger]
[tid:main] [userID:fcalvill] Ping check time elapsed: 7 ms
data_processing_CLI finished with state SUCCESS