Example of DP logs during a workflow

This example gives an overview of the various DP logs that are generated when you run a workflow with the DP CLI.

The example assumes that the Hive administrator has created a table named masstowns (which contains information about towns and cities in Massachusetts). The workflow will be run with the DP CLI, which is described in DP Command Line Interface Utility.

The DP CLI command line is:
./data_processing_CLI --database default --table masstowns --maxRecords 1000

The --table flag specifies the name of the Hive table, the --database flag states that the table in is the Hive database named "default", and the --maxRecords flag sets the sample size to be a maximum of 1,000 records.

Command stdout

The DP CLI first prints out the configuration with which it is running, which includes the following:
...
EdpEnvConfig{endecaServer=http://web07.example.oracle.com:7003/endeca-server/, edpDataDir=/user/bdd/edp/data, 
...
ProvisionDataSetFromHiveConfig{hiveDatabaseName=default, hiveTableName=masstowns, 
newCollectionId=MdexCollectionIdentifier{databaseName=
edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e, 
collectionName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e}, 
runEnrichment=false, maxRecordsForNewDataSet=1000, disableTextSearch=false, 
languageOverride=en, operation=PROVISION_DATASET_FROM_HIVE, transformScript=, 
accessType=public_default, autoEnrichPluginExcludes=[Ljava.lang.String;@71034e3b}
ProvisionDataSetFromHiveConfig{notificationName=CLIDATALOAD, 
ecid=0000LM3rDDu7ADkpSw4Eyc1NROXb000001, startTime=1466796128122, 
properties={dataSetDisplayName=Taxi_Data, isCli=true}}
New collection name = MdexCollectionIdentifier{
databaseName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e, 
collectionName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e}
data_processing_CLI finished with state SUCCESS
...

The operation field lists the operation type of the Data Processing workflow. In this example, the operation is PROVISION_DATASET_FROM_HIVE, which means that it will create a new BDD data set from a Hive table.

$BDD_HOME/logs/edp logs

In this example, the $BDD_HOME/logs/edp directory has three logs. The owner of one of them is the user ID of the person who ran the DP CLI, while the owner of other two logs is the user yarn:
  • The non-YARN log contains information similar to the stdout information. Note that it does contain entries from the Spark executors.
  • The YARN logs contain information that is similar to YARN logs in the next section.

YARN logs

If you use the YARN ResourceManager Web UI link, the All Applications page shows the Spark applications that have run. In our example, the job name is:
EDP: ProvisionDataSetFromHiveConfig{hiveDatabaseName=default, hiveTableName=masstowns, 
newCollectionId=MdexCollectionIdentifier{
databaseName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e, 
collectionName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e}}
The Name field shows these characteristics about the job:
  • ProvisionDataSetFromHiveConfig is the type of DP workflow that was run.
  • hiveDatabaseName lists the name of the Hive database (default in this example).
  • hiveTableName lists the name of the Hive table that was provisioned (masstowns in this example).
  • newCollectionId lists the name of the new data set and its Dgraph database (both names are the same).

Clicking on History in the Tracking UI field displays the job history. The information in the Application Overview panel includes the name of the name of the user who ran the job, the final status of the job, and the elapsed time of the job. FAILED jobs will have error information in the Diagnostics field.

Clicking on logs in the Logs field displays the stdout and stderr output. The stderr output will be especially useful for FAILED jobs. In addition, the stdout section has a link (named Click here for the full log) that displays more detailed output information.

Dgraph HDFS Agent log

When the DP workflow finishes, the Dgraph HDFS Agent fetches the DP-created files and sends them to the Dgraph for ingest. The log messages for the Dgraph HDFS Agent component for the ingest operation will be similar to the following entries (note that the message details are not shown):
Received request for database edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e
Starting ingest for: MdexCollectionIdentifier{
  databaseName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e, 
  collectionName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e},
  ...
createBulkIngester edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e
Finished reading 1004 records for MdexCollectionIdentifier{
  databaseName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e, 
  collectionName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e},
  ...
sendRecordsToIngester 1004
closeBulkIngester
Ingest finished with 1004 records committed and 0 records rejected. 
  Status: INGEST_FINISHED. Request info: MdexCollectionIdentifier{
  databaseName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e, 
  collectionName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e}, 
  ...
Notification server url: http://busgg2014.us.oracle.com:7003/bdd/v1/api/workflows
About to send notification
Terminating
Notification{workflowName=CLIDataLoad, sourceDatabaseName=null, sourceDatasetKey=null, 
  targetDatabaseName=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e, 
  targetDatasetKey=edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e, 
  ecid=0000LM3rDDu7ADkpSw4Eyc1NROXb000001, status=SUCCEEDED, 
  startTime=1466796128122, timestamp=1466796195365, progressPercentage=100.0, 
  errorMessage=null, properties={dataSetDisplayName=masstowns, isCli=true}}
Notification sent successfully
Terminating

The ingest operation is complete when the final Status: INGEST_FINISHED message is written to the log.

Dgraph out log

As a result of the ingest operation for the data set, the Dgraph out log (dgraph.out) will have these bulk_ingest messages:
Start ingest for collection: edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e for database edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e        
Starting a bulk ingest operation for database edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e  
batch 0 finish BatchUpdating status Success for database edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e       
Ending bulk ingest at client's request for database edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e - finalizing changes       
Bulk ingest completed: Added 1004 records and rejected 0 records, for database edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e 
Ingest end - 0.584MB in 2.010sec = 0.291MB/sec for database edp_cli_edp_ac680edd-c25f-4b9d-8cab-11441c5a3d2e

At this point, the data set records are in the Dgraph and the data set can be viewed in Studio.

Studio log

Similar to workflows run from the DP CLI, Studio-generated workflows also produce logs in the $BDD_HOME/logs/edp directory, as well as YARN logs, Dgraph HDFS Agent logs, and Dgraph out logs.

In addition, Studio workflows are also logged in the $BDD_DOMAIN/servers/<serverName>/logs/bdd-studio.log file.