Run Applications

Learn how to run the applications you have created in Data Flow, provide argument and parameter values, review the results, and diagnose and tune the runs, including providing JVM options.

Understand Runs

Every time a Data Flow Application is executed, a Data Flow Run is created. The Data Flow Run captures and securely stores the application's output, logs, and statistics. The output is saved so it can be viewed by anyone with the correct permissions using the UI or REST API. Runs also give you secure access to the Spark UI  for debugging and diagnostics.

Run Arguments and Parameters

For Python, Java, and Scala Applications, arguments are the equivalent of the application command line. SQL Applications do not use arguments. Parameters are variables the user can set at run-time which influence the behavior of the Spark application. When the user runs an application, parameters are substituted into the arguments string. The arguments string is used as the command line when launching the Spark app.

If a parameter contains spaces, then you must put double quotes around it. For example: "Jean Pierre". If the parameter contains a double quote, then you must escape the double quote using a backslash. For example: "Jean Pierre \"JP\" Berger".
Arguments and Parameters
TechnologyStyleAccessing Parameters
Java or ScalaArguments and ParametersParameters are substituted into command line arguments. Your Spark app will need a command line parser such as commons-cli or any similar library.
PySparkArguments and ParametersParameters are substituted into command line arguments. Your Python app will need a command line parser such as argparse, click, or any similar library.
SparkSQLParameters onlyParameters are substituted into your SQL code before it executes.
Here are some examples of using arguments and parameters.
Example: Arguments and Parameters in Java or Scala Applications
This Application passes a single command-line argument, parameterized as ${source_dir}, which you need to set at run time.

Example: Arguments and Parameters in Python Applications
This Application passes the command string -N ${name}. You need to provide a value for ${name} when you run the Application.

Example: Parameters in SQL Applications
SQL applications only support Parameters. The Parameters are substituted into the text of the SQL script before it executes. This is equivalent to using the -d option of the spark-sql command line interface. You can use more than one parameter.

Run Resource Configuration

Every time you run a Data Flow Application, you have the ability to customize the Driver Shape, the Executor Shape, and the Number of Executors. Use of these resources is subject to the Data Flow service limits as configured by the Data Flow administrator. Data Flow automatically configures Spark drivers and executors to consume all available resources based on the type of VMs you chose. When a Run completes, the resources are shut down automatically. If you need to kill a Run, the resources are released automatically.

Supported Spark Properties and JVM Options in Data Flow

For every run of a Data Flow application, you can add Spark Properties and JVM options in the Spark Configuration Properties field, although not all Spark properties and JVM options are allowed. Here is a list of allowed options.

spark.driver.maxResultSize
spark.extraListeners
spark.logConf
spark.driver.extraJavaOptions
spark.executor.extraJavaOptions
spark.redaction.regex
spark.python.profile
spark.python.worker.memory
spark.reducer.maxSizeInFlight
spark.reducer.maxReqsInFlight
spark.reducer.maxBlocksInFlightPerAddress
spark.maxRemoteBlockSizeFetchToMem
spark.shuffle.compress
spark.shuffle.file.buffer
spark.shuffle.io.maxRetries
spark.shuffle.io.numConnectionsPerPeer
spark.shuffle.io.preferDirectBufs
spark.shuffle.io.retryWait
spark.shuffle.io.backLog
spark.shuffle.service.enabled
spark.shuffle.service.port
spark.shuffle.service.index.cache.size
spark.shuffle.maxChunksBeingTransferred
spark.shuffle.sort.bypassMergeThreshold
spark.shuffle.spill.compress
spark.shuffle.accurateBlockThreshold
spark.shuffle.registration.timeout
spark.shuffle.registration.maxAttempts
spark.eventLog.logBlockUpdates.enabled
spark.eventLog.longForm.enabled
spark.ui.dagGraph.retainedRootRDDs
spark.ui.killEnabled
spark.ui.liveUpdate.minFlushPeriod
spark.ui.retainedJobs
spark.ui.retainedStages
spark.ui.retainedTasks
spark.ui.showConsoleProgress
spark.worker.ui.retainedExecutors
spark.worker.ui.retainedDrivers
spark.sql.ui.retainedExecutions
spark.ui.retainedDeadExecutors
spark.ui.filters
spark.ui.requestHeaderSize
spark.broadcast.compress
spark.checkpoint.compress
spark.io.compression.codec
spark.io.compression.lz4.blockSize
spark.io.compression.snappy.blockSize
spark.io.compression.zstd.level
spark.io.compression.zstd.bufferSize
spark.kryo.classesToRegister
spark.kryo.referenceTracking
spark.kryo.registrationRequired
spark.kryo.registrator
spark.kryo.unsafe
spark.kryoserializer.buffer.max
spark.kryoserializer.buffer
spark.rdd.compress
spark.serializer
spark.serializer.objectStreamReset
spark.memory.fraction
spark.memory.storageFraction
spark.memory.offHeap.enabled
spark.memory.offHeap.size
spark.memory.useLegacyMode
spark.storage.replication.proactive
spark.cleaner.periodicGC.interval
spark.cleaner.referenceTracking
spark.cleaner.referenceTracking.blocking
spark.cleaner.referenceTracking.blocking.shuffle
spark.cleaner.referenceTracking.cleanCheckpoints
spark.broadcast.blockSize
spark.broadcast.checksum
spark.default.parallelism
spark.executor.heartbeatInterval
spark.files.fetchTimeout
spark.files.useFetchCache
spark.files.overwrite
spark.hadoop.cloneConf
spark.storage.memoryMapThreshold
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
spark.rpc.message.maxSize
spark.blockManager.port
spark.driver.blockManager.port
spark.rpc.io.backLog
spark.network.timeout
spark.port.maxRetries
spark.rpc.numRetries
spark.rpc.retry.wait
spark.rpc.askTimeout
spark.rpc.lookupTimeout
spark.core.connection.ack.wait.timeout
spark.scheduler.maxRegisteredResourcesWaitingTime
spark.scheduler.minRegisteredResourcesRatio
spark.scheduler.mode
spark.scheduler.revive.interval
spark.scheduler.listenerbus.eventqueue.capacity
spark.scheduler.blacklist.unschedulableTaskSetTimeout
spark.blacklist.enabled
spark.blacklist.timeout
spark.blacklist.task.maxTaskAttemptsPerExecutor
spark.blacklist.task.maxTaskAttemptsPerNode
spark.blacklist.stage.maxFailedTasksPerExecutor
spark.blacklist.stage.maxFailedExecutorsPerNode
spark.blacklist.application.maxFailedTasksPerExecutor
spark.blacklist.application.maxFailedExecutorsPerNode
spark.blacklist.killBlacklistedExecutors
spark.blacklist.application.fetchFailure.enabled
spark.speculation
spark.speculation.interval
spark.speculation.multiplier
spark.speculation.quantile
spark.task.maxFailures
spark.task.reaper.enabled
spark.task.reaper.pollingInterval
spark.task.reaper.threadDump
spark.task.reaper.killTimeout
spark.stage.maxConsecutiveAttempts
For more information on these, see Spark Configuration Guide.
Important

When you’re running in Data Flow, do not change the value of spark.master. If you do, your job will not use all the resources you provisioned.

View Data Flow Runs

See Runs In Data Flow

Click on Runs from the Data Flow menu, or from the Dashboard. The resulting page displays a table of nine columns, which lists each Run. Runs for up to six months previously are available to view. Depending on the status of the job, you'll see a button to kill the job, or re-run the job. At the end of the row for each Run, is a menu from which you can select to view the Run details, launch the Spark UI, or re-run the application.

Another way to view the Run details, is if you click on a Run's name

Search, Sort, Or Filter Runs In Data Flow

You can filter, search, and sort the list of Runs in a number of ways.

In the left-hand side menu is a filters section. You can filter on the State of your Runs and the Language used. These are drop-down lists. You can also set a date range during which Runs were created using the Created Start Date and Created End Date fields. Finally, you can enter all or part of a Run name in the Name Prefix field, and filter by Run name. You can choose to filter on one, or some, or all of these options. You can clear all these fields to remove the filters.

The State filter gives the options of Any state, Accepted, In Progress, Canceling, Canceled, Failed, or Succeeded.

The Language filter, lets you filter by All, Java, Python, SQL, or Scala.

The Created Start Date and Created End Date fields allow you to pick a date from a calendar, along with a time (UTC). It displays the current month, but allows you to navigate to previous months. Or there are quick links to allow you to choose, today's date, yesterday's date, or the past three days.

If you have applied tags to your Runs, you can filter on these tags in the Tag Filters section. You can clear these tag filters.

These filtering options also allow you to search for a Run if you can't remember the specifics of it; for example, you know it was created last week, but can't remember exactly when.

You can sort the list of Runs by Created date, either ascending or descending.

View Run Details

From the list of Runs, either click on the Run name, or select View Details from the menu at the end of the table row. You'll see a new page containing general information about the Run, including a link to the Spark UI, plus links to any related runs. Rather than viewing related runs, you can list the log files generated for the run. You can either view the log files in the browser, or download them. Run data is written to the tenancy location, and also to an external location if you have specified one.

Each run of an application generates a set of output logs: application logs, such as stdout and stderr, and diagnostics logs, such as Spark driver and executor logs. Diagnostic logs are uploaded every 10-12 minutes into the Run execution, so, depending on the time the Run takes, not always before it completes, and possibly more than one set of diagnostic logs for a Run. Application logs are available immediately after the Run completes.

Diagnose and Tune Runs

Diagnose and Tune Data Flow Runs relies on the Spark UI. The Spark UI is easily accessed within Data Flow. When your application is running select the Spark UI action to view the Spark UI. For completed Runs you can access the Run’s Spark UI from the hamburger menu on the right-hand side. Similarly, you can access the logs for each completed Run. The logs are written to the tenancy location, where they are retained for seven days from run time, and also to an external location if you have specified one.
Note

The Spark UI and Run logs are only accessible by the Run’s owner or your Data Flow administrators.

Run Security

Run Security

Spark applications that run with Data Flow use the same IAM permissions as the user who initiates the run. The Data Flow Service creates a security token in the Spark Cluster that allow it to assume the identity of the running user. This means the Spark application can access data transparently based on the end user’s IAM permissions. There is no need to hard-code credentials in your Spark application when you access IAM-compatible systems.

Illustration of the security used in an Apache Spark run

If the service you are contacting is not IAM-compatible, you will need to use credential management or key management solutions like Oracle Cloud Infrastructure Key Management.

Learn more about Oracle Cloud Infrastructure IAM in the IAM documentation.

Move a Run to a new Compartment

You can change the compartment for a Run in two ways.
  • From the Run page:
    1. In the Runs page, click the Action icon for the Run you want to move.
    2. From the menu, click Move Resource.
  • From the Run Details page:
    1. In the Runs page, click on the name of the Run you want to move, to take you to the Run Details page.
    2. In the Run Details page, click Move Resource.
In both cases the Move Resource to a Different Compartment dialog box displays. There is a drop-down list set to the current compartment. Click on it and select a new compartment. Then click Move Resource.
Note

If the Run State is not one of Canceled, Finished, or Succeeded, it cannot be moved to a different compartment. For these states Move Resource is not available in either the Action icon menu or the Runs Details page.