Data Flow Overview
Learn about Data Flow and how you can use it to easily create, share, run, and view the output of Apache Spark applications.
Data Flow Concepts
An understanding of these concepts is essential for using Data Flow.
- Data Flow Applications
- An Application is an infinitely reusable Spark application template consisting of a Spark application, its dependencies, default parameters, and a default run-time resource specification. Once a developer creates a Data Flow Application, anyone can use it without worrying about the complexities of deploying it, setting it up, or running it.
- Data Flow Library
- The Library is the central repository of Data Flow Applications. Anyone can browse, search, and execute applications published to the Library, subject to having the correct permissions in the Data Flow system.
- Data Flow Runs
- Every time a Data Flow Application is run, a Run is created. The Data Flow Run captures the Application's output, logs, and statistics that are automatically securely stored. Output is saved so it can be viewed by anyone with the correct permissions using the UI or REST API. Runs give you secure access to the Spark UI for debugging and diagnostics.
- Elastic Compute
- Every time you run a Data Flow Application, you decide how big you want it to be. Data Flow allocates your VMs, runs your job, securely captures all output, and shuts the cluster down. You don't have anything to maintain in Data Flow. Clusters only run when there is real work to do.
- Elastic Storage
- Data Flow works with the Oracle Cloud Infrastructure Object Storage service. For more information, see the Overview of Object Storage.
- Security
- Data Flow is integrated with Oracle Cloud Infrastructure Identity and Access Management (IAM) for authentication and authorization. Your Spark applications run on behalf of the person who launches them. This means that the Spark application has the same privileges the end user has. You do not need to use credentials to access any IAM-capable system. In addition, Data Flow benefits from all the other security attributes of Oracle Cloud Infrastructure including transparent encryption of data at rest and in motion.
- Administrator Controls
- Data Flow allows you to set service limits, and create administrators who have full control over all applications and runs. You are in control regardless of how many users you have.
- Apache Spark
- Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
- Spark Application
- A Spark Application uses the Spark API to perform distributed data processing tasks. Spark Applications can be written in several languages including Java, Python and more. Spark Applications manifest themselves as files such as JAR files that are executed within the Spark framework.
- Spark UI
- The Spark UI is included with Apache Spark and is an important tool for debugging and diagnosing Spark applications. You can access the Spark UI for any Data Flow Run, subject to the Run’s authorization policies.
- Spark Logs
- Spark generates Spark Log files which are useful for debugging and diagnostics. Each Data Flow Run automatically stores log files which you can access via UI or API, subject to the Run’s authorization policies.
- Enhanced Logs
- Driver and executor logs, both StdOut and StdErr, provided by Oracle Cloud Infrastructure Logging. It is optional whether you use them.