Data Flow Overview

Learn about Data Flow and how you can use it to easily create, share, run, and view the output of Apache Spark  applications.

The Data Flow architecture showing Applications, Library, and Runs in the User Layer. Below this is the Administrator Layer consisting of Administrator controls for access policies and usage limits. Below is the Infrastructure Layer of elastic compute and the elastic storage. Finally is the Security Layer consisting of identity management and access management.

Data Flow Concepts

An understanding of these concepts is essential for using Data Flow.

Data Flow Applications
An Application is an infinitely reusable Spark application template consisting of a Spark application, its dependencies, default parameters, and a default run-time resource specification. Once a developer creates a Data Flow Application, anyone can use it without worrying about the complexities of deploying it, setting it up, or running it.
Data Flow Library
The Library is the central repository of Data Flow Applications. Anyone can browse, search, and execute applications published to the Library, subject to having the correct permissions in the Data Flow system.
Data Flow Runs
Every time a Data Flow Application is run, a Run is created. The Data Flow Run captures the Application's output, logs, and statistics that are automatically securely stored. Output is saved so it can be viewed by anyone with the correct permissions using the UI or REST API. Runs give you secure access to the Spark UI for debugging and diagnostics.
Elastic Compute
Every time you run a Data Flow Application, you decide how big you want it to be. Data Flow allocates your VMs, runs your job, securely captures all output, and shuts the cluster down. You don't have anything to maintain in Data Flow. Clusters only run when there is real work to do.
Elastic Storage
Data Flow works with the Oracle Cloud Infrastructure Object Storage service. For more information, see the Overview of Object Storage.
Security
Data Flow is integrated with Oracle Cloud Infrastructure Identity and Access Management (IAM) for authentication and authorization. Your Spark applications run on behalf of the person who launches them. This means that the Spark application has the same privileges the end user has. You do not need to use credentials to access any IAM-capable system. In addition, Data Flow benefits from all the other security attributes of Oracle Cloud Infrastructure including transparent encryption of data at rest and in motion.
Administrator Controls
Data Flow allows you to set service limits, and create administrators who have full control over all applications and runs. You are in control regardless of how many users you have.
Apache Spark
Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Spark Application
A Spark Application uses the Spark API to perform distributed data processing tasks. Spark Applications can be written in several languages including Java, Python and more. Spark Applications manifest themselves as files such as JAR files that are executed within the Spark framework.
Spark UI
The Spark UI is included with Apache Spark and is an important tool for debugging and diagnosing Spark applications. You can access the Spark UI for any Data Flow Run, subject to the Run’s authorization policies.
Spark Logs
Spark generates Spark Log files which are useful for debugging and diagnostics. Each Data Flow Run automatically stores log files which you can access via UI or API, subject to the Run’s authorization policies.
Enhanced Logs
Driver and executor logs, both StdOut and StdErr, provided by Oracle Cloud Infrastructure Logging. It is optional whether you use them.