There are some CDH log files that may contain valuable information
for debugging issues with the Data Processing component of Big Data Discovery.
YARN logs
To find the Data Processing logs in YARN:
- Go to the Oozie Web UI and
find the corresponding job using the Oozie Job ID.
- Click the job to bring up
detailed Oozie information.
- Under the
Actions pane, click the
DataProcessingJavaTask action.
- In the
Action Info tab of the
Action pane, find the External ID. The
external ID matches a YARN Job ID.
- Go to the
YARN HistoryServer Web UI and find the
corresponding job using the Oozie External ID. To do so:
- Browse the Cloudera
Manager and click the YARN service in the left pane.
- In the
Quick Links section in the top left, click
HistoryServer Web UI.
- Click the job to bring up
detailed MapReduce information.
- Click the
Map task type to go to the
Map Tasks page for the job.
- Click the Map task. There
should be only one Map task on this page.
- Click the
logs link. This displays a page with some
logging information and links to the
stdout and
stderr full logs for the Map task.
- In either the
stderr or
stdout log type sections, go to the
Click here for the full log link. This
displays the full log for the selected log type.
The
stdout log lists the Data Processing operation type
that was invoked for the workflow, as shown in this abbreviated entry:
>>> Invoking Main class now >>>
Main class : com.oracle.endeca.pdi.EdpOozieJobReceiver
Arguments :
PROVISION_DATASET_FROM_HIVE
{
"@class" : "com.oracle.endeca.pdi.client.config.EdpEnvConfig",
"endecaServer" : {
"@class" : "com.oracle.endeca.pdi.concepts.EndecaServer",
"host" : "web04.us.example.com",
"wsPort" : 7001,
"contextRoot" : "/endeca-server",
"ssl" : false
},
...
The Arguments field lists the operation type in the Data Processing
workflow:
- APPLY_TRANSFORM_TO_DATASET
— updates a project data set by applying a transformation to it.
- APPLY_TRANSFORM_TO_DATASOURCE
— creates a new BDD data set (and a corresponding Hive table) by applying a
transformation to an existing project data set and saving the transformed data
to the new Hive table. This operation is also called forking the data set.
- CLEANUP_DATASETS —
deletes any BDD data set that does not have a corresponding source Hive table.
- CLEANUP_ORPHANED_DATASETS
— deletes any BDD data set that was generated from a Studio project, and the
project no longer exists.
- PROVISION_DATASET_FROM_HIVE
— creates a new BDD data set from a Hive table.
Spark worker logs
Inside of the main Data Processing log, you can find several
references to a specific Spark job's Application ID. They are of the form
app-TIMESTAMP-INCREMENTALCOUNTER. This Application ID
is necessary to find the corresponding Spark workers.
You can display a specific Spark worker log by using the
Spark Web UI. To do so, select the Spark job on
the Spark Web UI and find each of the Spark workers used to run the Data
Processing job. Here you have access to the
stdout and
stderr from each worker. The logs for each Spark
worker are similar but should differ slightly because they are running on
separate machines.