Useful CDH logs

There are some CDH log files that may contain valuable information for debugging issues with the Data Processing component of Big Data Discovery.

YARN logs

To find the Data Processing logs in YARN:
  1. Go to the Oozie Web UI and find the corresponding job using the Oozie Job ID.
  2. Click the job to bring up detailed Oozie information.
  3. Under the Actions pane, click the DataProcessingJavaTask action.
  4. In the Action Info tab of the Action pane, find the External ID. The external ID matches a YARN Job ID.
  5. Go to the YARN HistoryServer Web UI and find the corresponding job using the Oozie External ID. To do so:
    1. Browse the Cloudera Manager and click the YARN service in the left pane.
    2. In the Quick Links section in the top left, click HistoryServer Web UI.
  6. Click the job to bring up detailed MapReduce information.
  7. Click the Map task type to go to the Map Tasks page for the job.
  8. Click the Map task. There should be only one Map task on this page.
  9. Click the logs link. This displays a page with some logging information and links to the stdout and stderr full logs for the Map task.
  10. In either the stderr or stdout log type sections, go to the Click here for the full log link. This displays the full log for the selected log type.
The stdout log lists the Data Processing operation type that was invoked for the workflow, as shown in this abbreviated entry:
>>> Invoking Main class now >>>

Main class        : com.oracle.endeca.pdi.EdpOozieJobReceiver
Arguments         :
                    PROVISION_DATASET_FROM_HIVE
                    {
  "@class" : "com.oracle.endeca.pdi.client.config.EdpEnvConfig",
  "endecaServer" : {
    "@class" : "com.oracle.endeca.pdi.concepts.EndecaServer",
    "host" : "web04.us.example.com",
    "wsPort" : 7001,
    "contextRoot" : "/endeca-server",
    "ssl" : false
  },
...
The Arguments field lists the operation type in the Data Processing workflow:
  • APPLY_TRANSFORM_TO_DATASET — updates a project data set by applying a transformation to it.
  • APPLY_TRANSFORM_TO_DATASOURCE — creates a new BDD data set (and a corresponding Hive table) by applying a transformation to an existing project data set and saving the transformed data to the new Hive table. This operation is also called forking the data set.
  • CLEANUP_DATASETS — deletes any BDD data set that does not have a corresponding source Hive table.
  • CLEANUP_ORPHANED_DATASETS — deletes any BDD data set that was generated from a Studio project, and the project no longer exists.
  • PROVISION_DATASET_FROM_HIVE — creates a new BDD data set from a Hive table.

Spark worker logs

Inside of the main Data Processing log, you can find several references to a specific Spark job's Application ID. They are of the form app-TIMESTAMP-INCREMENTALCOUNTER. This Application ID is necessary to find the corresponding Spark workers.

You can display a specific Spark worker log by using the Spark Web UI. To do so, select the Spark job on the Spark Web UI and find each of the Spark workers used to run the Data Processing job. Here you have access to the stdout and stderr from each worker. The logs for each Spark worker are similar but should differ slightly because they are running on separate machines.