Executing H2H on Spark

Following are the configurations required for executing H2H on Spark:
  1. 1. Register a cluster from DMT Configurations > Register Cluster with the following details:
    • Name - Enter the name of the target information domain of the H2H mapping.
    • Description - Enter a description for the cluster.
    • Livy Service URL - Enter the Livy Service URL used to connect to Spark from OFSAA.
  2. To execute H2H on Spark, set the EXECUTION_ENGINE_MODE parameter as SPARK from ICC or RRF.
    • Execution through Operations module- Pass [EXECUTION_ENGINE_MODE]=SPARK while defining the H2H tasks from the Task Definition window.

      For more information, see Component: LOAD DATA section.

    • Execution through RRF module- Pass the following as a parameter while defining H2H as jobs from the Component Selector window:

      “EXECUTION_ENGINE_MODE”,”SPARK”

  3. Spark Session Management- In a batch execution, a new Spark session is created when the first H2H-spark task is encountered and the same spark session is reused for the rest of the H2H-spark tasks in the same Run. For the spark session to close at the end of the run, set the CLOSE_SPARK_SESSION to YES in the last H2H-spark task in the batch.
    • Execution through Operations module - Pass [CLOSE_SPARK_SESSION]=YES while defining the last H2H-Spark task from the Task Definition window.

      For more information, see Component: LOAD DATA section.

    • Execution through RRF module- Pass the following as a parameter while defining the last H2H-spark job from the Component Selector window:

      “CLOSE_SPARK_SESSION”,”YES”

    Note:

    1. Ensure that the task with “CLOSE_SPARK_SESSION”,”YES” has less precedence set from all the rest of the H2H-spark tasks.
    2. By default, the created spark session will be closed when any of the H2H-spark tasks fail.
    3. Execution of H2H with a large number of mappings may fail because Spark restricts the length of the SQL code in the spark.sql file to a maximum of 65535 (2^16 - 1).
    4. When you run an H2H Load with Hive and Apache Spark, it fails with the following:

      error: Error executing statement : java.lang.RuntimeException: Cannot create staging directory 'hdfs://<HOST_NAME>/user/hive/warehouse/hivedatadom.db/dim_account/.hive-staging_hive_2020-07-06_22-44-57_448_3115454008595470139-1': Permission denied: user=<USER_NAME>, access=WRITE, inode="/user/hive/warehouse/hivedatadom.db/dim_account":hive:hive:drwxrwxr-x

      Provide the required permissions to the logged-in user in the Hive Database Storage, which enables the user to access and perform tasks in the storage.