DP CLI workflow examples

This topic shows some workflow examples using the DP CLI.

Excluding specific Data Enrichment modules

The --excludePlugins flag (abbreviated as -ep) specifies a list of Data Enrichment modules to exclude when enrichments are run. This flag should be used only enrichments are being run as part of the workflows (for example, with the --excludePlugins flag).

The syntax is:
./data_processing_CLI --excludePlugins <excludeList>
where excludeList is a space-separated string of one or more of these Data Enrichment canonical module names:
  • address_geo_tagger (for the Address GeoTagger)
  • ip_geo_extractor (for the IP Address GeoTagger)
  • reverse_geo_tagger (for the Reverse GeoTagger)
  • tfidf_term_extractor (for the TF.IDF Term extractor)
  • doc_level_sentiment_analysis (for the document-level Sentiment Analysis module)
  • language_detection (for the Language Detection module)
For example:
./data_processing_CLI --table masstowns --runEnrichment --excludePlugins reverse_geo_tagger

For details on the Data Enrichment modules, see Data Enrichment Modules.

Ping checking the DP components

The --pingCheck flag (abbreviated as -ping) ping checks the connection status of the components that Data Processing needs:
./data_processing_CLI --pingCheck
A successful result should be similar to this example:
Ping ok
data_processing_CLI finished with state SUCCESS

Running a DP health check

The --healthCheck flag (abbreviated as -health) returns the status of the components that are required by Data Processing:
./data_processing_CLI --healthCheck
A successful result should be similar to this example:
...
[2015-07-14T14:52:32.270-04:00] [DataProcessing] [INFO] [] [com.oracle.endeca.pdi.logging.ProvisioningLogger]
[tid:main] [userID:fcalvill] Ping check time elapsed: 7 ms
data_processing_CLI finished with state SUCCESS
A successful result should be similar to this example:
health=ok, v1="db:ok; yarn:ok; hdfs:ok"
data_processing_CLI finished with state SUCCESS
These status fields will report either ok (if a successful connection was made to the component) or notok (if the connection attempt to the component was unsuccessful):
  • db is the connection to the Workflow Manager database.
  • yarn is the connection to the YARN service.
  • hdfs is the connection to HDFS.
  • auth is the connection to a secure cluster. This field is displayed only if the cluster is configured as a secure cluster.
  • mode:shutdown is reported only when Workflow Manager is in shutdown mode at the present time.
  • health is the overall status of Data Processing. This status is ok if all the above connection statuses are ok.

Getting job status

The get-job-status flag returns the status of a completed or active Data Processing job:
./data_processing_CLI get-job-status <jobId>
When a Data Processing workflow is started from the DP CLI, its job ID is displayed, as in this example:
./data_processing_CLI -d default -t warrantyclaims
...
New collection name = MdexCollectionIdentifier{databaseName=edp_cli_edp_997d2151-b694-4bd2-88be-732461731b6c, collectionName=edp_cli_edp_997d2151-b694-4bd2-88be-732461731b6c}
jobId: 950b2d4a-20cf-4e9d-9f5f-5cf713ade145
Note that jobs started from Studio do not display a job ID, which means you cannot get their status.
You can then use that job ID to get its status:
./data_processing_CLI get-job-status 950b2d4a-20cf-4e9d-9f5f-5cf713ade145
The status for the job will be one of the following:
  • NOT_STARTED
  • RUNNING
  • SUCCEEDED
  • NOTFOUND
  • ABORTING
  • ABORTED
  • CANCELLING
  • CANCELLED
  • FAILED
The following example shows the status of a job that has failed:
./data_processing_CLI get-job-status 950b2d4a-20cf-4e9d-9f5f-5cf713ade145
Job status for job id: 950b2d4a-20cf-4e9d-9f5f-5cf713ade145 is FAILED

Cancelling jobs

The cancel-job flag cancels an active Data Processing job:
./data_processing_CLI cancel-job <jobId>
The result output should be one of the following:
  • Case 1: output when the job was not found:
    Could not cancel job with id: 123 because it could not be found.
  • Case 2: output when the job succeeded already:
    Could not cancel job with id: 123 because it has already succeeded.
  • Case 3: output when the job failed already:
    Could not cancel job with id: 123 because it has already failed.
  • Case 4: output when the operation results in an error:
    Could not cancel job with id: 123 because of error: <error_message>
  • Case 5: output when the job was successfully cancelled:
    Job status for job id: 123 is CANCELLED

Note that the output in Case 5 will be the same as a get-job-status command.