Running Oozie Workflows with Spark Jobs Through Hue
You can run Oozie workflow with Spark jobs through Hue on Big Data Service clusters.
To run Oozie workflow, use the Oozie editor to create and update the Oozie workflow. Then, you can run the workflow in the Hue UI.
For analytics, usage of Hive and Spark are crucial, and you must schedule an Oozie workflow for Spark jobs.
Oozie provides ways to run wide variety of jobs. It does this using an action interface.
You can run Spark jobs on Hue in the following ways:
- Spark Action: Invoke the spark job by providing the pyspark file/ JAR to trigger
- Shell Action: Use shell as a wrapper to invoke whatever is necessary and run Spark submit as part of shell action.
Oozie Workflows Using HUE
Use the Oozie editor to setup workflow for Big Data Service clusters.
Using the Oozie editor, set up a Oozie workflow. Select the appropriate action widget (Spark, shell, or other) and add to the workflow by dragging and dropping them into the workflow through the UI
After the widget is added and all relevant details are provided, run the job using the play button on the Oozie editor. This triggers the Oozie workflow and start the execution.
After job submission, view various details in the /user/hue/oozie/workspaces/<id>
directory which contains workflow.xml
, job.properties
, and all relevant details for the execution.
Using Oozie in Hue
Running a Spark Action
- Sign in to Hue.
- Create a script file and upload it to Hue.
- In the leftmost navigation menu, click Scheduler.
- Click Workflow, and then click My Workflow to create a workflow.
- Click the Spark program icon to drag the Spark action to the Drop your action here area.
- Select the Jar file or Python file from the Jar/py name dropdown.
- In Main class, specify the Class Entry point of the Spark application for Spark Java/Scala Application, along with the JAR file.
- For HA clusters:
- In the Files field, add the keytab by clicking on plus symbol.
- In the Options list, add the following parameter:
--principal <principal> --keytab <keytab>
- To access Hive tables in an HA environment, click the gear icon, and then click Credentials.
- Select hcat.
- Click the save icon.
Running a Shell Action
Using the Oozie editor, you can add more details to the Shell action ranging to more files and properties to be used during the Shell action.
- Sign in to Hue.
- Create a script file and upload it to Hue.
- In the leftmost navigation menu, click Scheduler.
- Click Workflow, and then click My Workflow to create a workflow.
- Click the Shell program icon to drag the Shell action to the Drop your action here area.
- Provide the shell from HDFS location in the
Shell Command
section. - In the Files section add:
- HDFS location for the shell script
- HDFS location for the Jar / Python file which contains the Spark job
- Click the save icon.
- For HA clusters, permission must be provided to access files through Ranger for access to:
/user/hue
directory so that, user can access the directories from hue./user/{user}
directory so that the user can place jar/py files in their directory
Complete the following:
- Sign in to Ranger UI and navigate to the Hue plugin.
- Select the policy to add the user.
- Under Allow Conditions, add the user or group to allow access.
- Click Save.