Spark-Submit Functionality in Data Flow
Find out how to use Data Flow with Spark-submit.
Spark-Submit Compatibility
You can use spark-submit compatible options to run your applications using Data Flow.
Spark-submit is an industry standard command for running applications on Spark clusters. The following spark-submit compatible options are supported by Data Flow:
- 
--conf
- 
--files
- 
--py-files
- 
--jars
- 
--class
- 
--driver-java-options
- 
--packages
- 
main-application.jarormain-application.py
- arguments to main-application. Arguments passed to the main method of your main class (if any).
The --files option flattens your file hierarchy, so all files are placed
            at the same level in the current working directory. To keep the file hierarchy, use
            either archive.zip, or --py-files with a JAR, ZIP or
            EGG dependecy module.
--packages option is used to include any other dependencies by
            supplying a comma-delimited list of Maven coordinates. For example,
            --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.2With the
--packages option, each Run's driver pod needs to download
                dependencies dynamically, which relies on network stability and access to Maven
                central or other remote repositories. Use the Data Flow Dependency Packager to generate a
                dependency archive for production.For all spark-submit options on Data Flow, the URI
                must begin oci://.... URIs starting with
                    local://... or hdfs://... aren't supported.
                Use the fully qualified domain names (FQDN) in the URI. Load all files, including
                    main-application, to Oracle Cloud Infrastructure Object Storage.
Creating a Spark-Submit Data Flow Application explains how to create an application in the
                Console using spark-submit. You can also use
            spark-submit with a Java SDK or from the CLI. If you're using CLI, you don't have to
            create a Data Flow Application to run your Spark
            application with spark-submit compatible options on Data Flow. This is useful if you already have a working
            spark-submit command in a different environment. When you follow the syntax of the
                run submit command, an Application is created, if one doesn't
            already exist in the main-application URI.
Installing Public CLI with the run submit Command
These steps are needed to install a public CLI with the run submit
        command for use with Data Flow:
Using Spark-submit in Data Flow
You can take your spark-submit CLI command and convert it into a compatible CLI command in Data Flow.
run submit command. If you already have a working Spark application
            in any cluster, you're familiar with the spark-submit syntax. For example:
            spark-submit --master spark://<IP-address>:port \
--deploy-mode cluster \
--conf spark.sql.crossJoin.enabled=true \
--files oci://file1.json \
--class org.apache.spark.examples.SparkPi \ 
--jars oci://file2.jar <path_to>/main_application_with-dependencies.jar 1000oci data-flow run submit \
--compartment-id <compartment-id> \
--execute "--conf spark.sql.crossJoin.enabled=true
 --files oci://<bucket-name>@<namespace>/path/to/file1.json
 --jars oci://<bucket-name>@<namespace>/path/to/file2.jar
 oci://<bucket-name>@<namespace>/path_to_main_application_with-dependencies.jar 1000"- Upload all the files, including the main application, in the Object Storage.
- Replace the existing URIs with the corresponding oci://...URI.
- 
Remove any unsupported or reserved spark-submit parameters. For example, --masterand--deploy-modeare reserved for Data Flow and a user doesn't need to populate them.
- 
Add --executeparameter and pass in a spark-submit compatible command string. To build the--executestring, keep the supported spark-submit parameters, and main-application and its arguments, in sequence. Put them inside a quoted string (single-quote or double-quotes).
- Replace spark submitwith the Oracle Cloud Infrastructure standard command prefix,oci data-flow run submit.
- Add the Oracle Cloud Infrastructure mandatory argument
                    and parameter pairs for --profile,--auth security_token, and--compartment-id.
Run Submit Examples
Some examples of run submit in Data Flow.
Oci-cli Examples
Examples of run submit using oci-cli in Data Flow.
oci --profile oci-cli --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--execute "--conf spark.sql.crossJoin.enabled=true
 --class org.apache.spark.examples.SparkPi
 oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10"--jars,
                                --files, and
                        pyfiles:oci --profile oci-cli --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--execute "--jars oci://<bucket-name>@<tenancy-name>/a.jar
 --files \"oci://<bucket-name>@<tenancy-name>/b.json\"
 --py-files oci://<bucket-name>@<tenancy-name>/c.py
 --conf spark.sql.crossJoin.enabled=true
 --class org.apache.spark.examples.SparkPi
 oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10"archiveUri, --jars,
                                --files, and
                        pyfiles:oci --profile oci-cli --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--archive-uri  "oci://<bucket-name>@<tenancy-name>/mmlspark_original.zip" \
--execute "--jars local:///opt/dataflow/java/mmlspark_2.11-0.18.1.jar
 --files \"local:///opt/dataflow/java/mmlspark_2.11-0.18.1.jar\"
 --py-files local:///opt/dataflow/java/mmlspark_2.11-0.18.1.jar
 --conf spark.sql.crossJoin.enabled=true
 --class org.apache.spark.examples.SparkPi
 oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10"jars, files, and
                                pyfiles:oci --profile oci-cli --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--archive-uri  "oci://<bucket-name>@<tenancy-name>/mmlspark_original.zip" \
--execute "--jars oci://<bucket-name>@<tenancy-name>/fake.jar
 --conf spark.sql.crossJoin.enabled=true
 --class org.apache.spark.examples.SparkPi
 oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10"
#result
{'opc-request-id': '<opc-request-id>', 'code': 'InvalidParameter',
 'message': 'Invalid OCI Object Storage uri. The object was not found or you are not authorized to access it.
 {ResultCode: OBJECTSTORAGE_URI_INVALID,
 Parameters: [oci://<bucket-name>@<tenancy-name>/fake.jar]}', 'status': 400}To enable Resource Principal Auth, add the Spark property in the conf file using Spark submit, and add the following configuration in the execute method:
--execute "--conf dataflow.auth=resource_principal --conf other-spark-property=other-value"
Oci-curl Example
An example of run submit using oci-curl in Data Flow.
oci-curl <IP-Address>:443 POST /Users/<user-name>/workspace/sss/dependency_test/spark-submit-test.json
 /latest/runs --insecure --noproxy <IP-Address>
 
{
"execute": "--jars local:///opt/dataflow/java/mmlspark_2.11-0.18.1.jar
 --files \"local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar\"
 --py-files local:///opt/spark/conf/spark.properties
 --conf spark.sql.crossJoin.enabled=true
 --class org.apache.spark.examples.SparkPi
     oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10",
"displayName": "spark-submit-test",
"sparkVersion": "2.4",
"driverShape": "VM.Standard2.1",
"executorShape": "VM.Standard2.1",
"numExecutors": 1,
"logsBucketUri": "",
"freeformTags": {},
"definedTags": {}
}