Data Processing logging

Location of the log files

Each run of Data Processing produces a new log file into the OS temp directory of each machine that is involved in the Data Processing job. The Data Processing log files are located on each node that has been involved in a Data Processing job. These include:

The client that started the job (which could be nodes running the DP CLI or nodes running Studio)
An Oozie (YARN) worker node
Spark worker nodes

The logging location on each node is defined by the edpJarDir property in the data_processing-CLI file. By default, this is the /opt/bdd/edp/data directory.

Log files

The Data Processing log files are named edpLog*.log. The naming pattern is set in the logging.properties configuration. The default pattern is edpLog%u%g.log, where %u is a unique number to resolve conflicts between simultaneous Java processes and %g is the generation number to distinguish between rotating logs. The generation number is rotated, thus the latest run of Data Processing will be generation number 0. The configuration defaults produce 10,000 log files with a maximum file size of 1MB. Logs larger than 1MB roll over to the next log file.

A sample error log message is:

[2015/01/15 14:14:15] INFO: Starting Data Processing on Hive Table: default.claims
[2015/01/15 14:14:15] SEVERE: Error runnning EDP
java.lang.Exception Example Error Log Message
      at com.oracle.endeca.pdi.EdpMain.main(EdpMain.java:38)
      ...

Finding the Data Processing logs

When a client launches a Data Processing workflow, an Oozie job is created to run the actual Data Processing job. This job is run by an arbitrary node in the CDH cluster (node is chosen by YARN). To find the Data Processing logs, you should track down this specific cluster node using the Oozie Job ID. The Oozie Job ID is printed out to the console when the DP CLI runs, or you can find it in the Studio logs.

To find the Data Processing logs:

Go to the Oozie Web UI and find the corresponding job using the Oozie Job ID.
Click on the job to bring up detailed Oozie information.
Under the Actions pane, click the DataProcessingJavaTask action.
In the Action Info tab of the Action pane, find the External ID. The external ID matches a YARN Job ID.
Go to the YARN HistoryServer Web UI and find the corresponding job using the Oozie External ID. To do so:
1. Browse the Cloudera Manager and click the YARN service in the left pane.
2. In the Quick Links section in the top left, click HistoryServer Web UI.
Click the job to bring up detailed MapReduce information. The Node property indicates which machine ran the Data Processing job.
Log into the machine and go to the Data Processing directory on the cluster. By default, this is the /opt/bdd/edp/data directory. All the logs for Data Processing should reside in this directory.
To find a specific log, you may need to use grep (or other similar tool) for the corresponding workflow information.