Getting Started with Spark-Submit and CLI

A tutorial to help you get started running a Spark application in Data Flow using spark-submit while using the execute string at the CLI.

Follow the existing tutorial for Getting Started with Oracle Cloud Infrastructure Data Flow, but use CLI to run spark-submit commands.

Before You Begin

Complete some prerequisites and set up authentication before you can use spark-submit commands in Data Flow with CLI.

  1. Complete the prerequisites to use spark-submit with CLI.
  2. Set up authentication to use spark-submit with CLI.

Authentication to Use Spark-submit with CLI

Set up authenticate to use spark-submit with CLI.

When the prerequisites to use spark-submit with CLI are complete, and CLI is installed, set up authenticate profile with the following commands:
 $ oci session authenticate
 
    - select the intended region from the provided list of regions.
    - Please switch to newly opened browser window to log in!
    - Completed browser authentication process!
    - Enter the name of the profile you would like to create: <profile_name> ex. oci-cli
    - Config written to: ~/.oci/config
 
    - Try out your newly created session credentials with the following example command:
             $ oci iam region list --config-file ~/.oci/config --profile <profile_name> --auth security_token
A profile is created in your ~/.oci/config file . Use the profile name to run the tutorial.

1. Create the Java Application Using Spark-Submit and CLI

Use Spark-submit and the CLI to complete tutorials.

Use spark-submit and CLI to complete the first exercise, ETL with Java, from the Getting Started with Oracle Cloud Infrastructure Data Flow tutorial.
  1. Set up your tenancy.
  2. If you don't have a bucket in Object Storage where you can save your input and results, you must create a bucket with a suitable folder structure. In this example, the folder structure is /output/tutorial1.
  3. Run this code:
    oci --profile <profile-name> --auth security_token data-flow run submit \
    --compartment-id <compartment-id> \
    --display-name Tutorial_1_ETL_Java \
    --execute '
        --class convert.Convert 
        --files oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/kaggle_berlin_airbnb_listings_summary.csv 
        oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar \
        kaggle_berlin_airbnb_listings_summary.csv oci://<bucket-name>@<namespace-name>/output/tutorial1'
    If you have run this tutorial before, delete the contents of the output directory, oci://<bucket-name>@<namespace-name>/output/tutorial1, to prevent the tutorial failing.
    Note

    To find the compartment-id, from the navigation menu, click Indentity and click Compartments. The compartments available to you are listed, including the OCID of each.

2: Machine Learning with PySpark

Use Spark-submit and CLI to carry out machine learning with PySpark,

Complete exercise 3. Machine Learning with PySpark, from the Getting Started with Oracle Cloud Infrastructure Data Flow tutorial.
  1. Before attempting this exercise, complete 1. Create the Java Application Using Spark-Submit and CLI. Its results are used in this exercise.
  2. Run the following code:
    oci --profile <profile-name> --auth security_token data-flow run submit \
    --compartment-id <compartment-id> \
    --display-name Tutorial_3_PySpark_ML \
    --execute '
        oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow_lab_2019_pyspark_ml.py 
        oci://<your_bucket>@<namespace-name>/output/tutorial1'

What's Next

Use Spark-submit and the CLI in other situations.

You can use spark-submit from the CLI to create and run Java, Python, or SQL applications with Data Flow, and explore the results. Data Flow handles all details of deployment, tear down, log management, security, and UI access. With Data Flow, you focus on developing Spark applications without worrying about the infrastructure.