- Analytical Application Infrastructure User Guide
- Data Management Framework
- Data Mapping
- Creating Data Mapping Definition
- Executing H2H on Spark
Executing H2H on Spark
- 1. Register a cluster from DMT Configurations > Register Cluster with the
following details:
- Name - Enter the name of the target information domain of the H2H mapping.
- Description - Enter a description for the cluster.
- Livy Service URL - Enter the Livy Service URL used to connect to Spark from OFSAA.
- To execute H2H on Spark, set the EXECUTION_ENGINE_MODE parameter as SPARK from
ICC or RRF.
- Execution through Operations module- Pass [EXECUTION_ENGINE_MODE]=SPARK
while defining the H2H tasks from the Task Definition
window.
For more information, see Component: LOAD DATA section.
- Execution through RRF module- Pass the following as a parameter while
defining H2H as jobs from the Component Selector
window:
“EXECUTION_ENGINE_MODE”,”SPARK”
- Execution through Operations module- Pass [EXECUTION_ENGINE_MODE]=SPARK
while defining the H2H tasks from the Task Definition
window.
- Spark Session Management- In a batch execution, a new Spark session is created
when the first H2H-spark task is encountered and the same spark session is
reused for the rest of the H2H-spark tasks in the same Run. For the spark
session to close at the end of the run, set the CLOSE_SPARK_SESSION to YES in
the last H2H-spark task in the batch.
- Execution through Operations module - Pass [CLOSE_SPARK_SESSION]=YES
while defining the last H2H-Spark task from the Task Definition
window.
For more information, see Component: LOAD DATA section.
- Execution through RRF module- Pass the following as a parameter while
defining the last H2H-spark job from the Component Selector
window:
“CLOSE_SPARK_SESSION”,”YES”
Note:
- Ensure that the task with “CLOSE_SPARK_SESSION”,”YES” has less precedence set from all the rest of the H2H-spark tasks.
- By default, the created spark session will be closed when any of the H2H-spark tasks fail.
- Execution of H2H with a large number of mappings may fail because Spark restricts the length of the SQL code in the spark.sql file to a maximum of 65535 (2^16 - 1).
- When you run an H2H Load with Hive and Apache Spark, it fails with
the following:
error: Error executing statement : java.lang.RuntimeException: Cannot create staging directory 'hdfs://<HOST_NAME>/user/hive/warehouse/hivedatadom.db/dim_account/.hive-staging_hive_2020-07-06_22-44-57_448_3115454008595470139-1': Permission denied: user=<USER_NAME>, access=WRITE, inode="/user/hive/warehouse/hivedatadom.db/dim_account":hive:hive:drwxrwxr-x
Provide the required permissions to the logged-in user in the Hive Database Storage, which enables the user to access and perform tasks in the storage.
- Execution through Operations module - Pass [CLOSE_SPARK_SESSION]=YES
while defining the last H2H-Spark task from the Task Definition
window.