Getting Started with Data Flow
Learn about Oracle Cloud Infrastructure Data Flow, what it is, what you need to do before you begin using it, including setting up policies and storage, loading data, and how to import and bundle Spark applications.
What is Oracle Cloud Infrastructure Data Flow
Data Flow is a cloud-based serverless platform with a rich user interface. It allows Spark developers and data scientists to create, edit, and run Spark jobs at any scale without the need for clusters, an operations team, or highly specialized Spark knowledge. Being serverless means there is no infrastructure for you to deploy or manage. It is entirely driven by REST APIs, giving you easy integration with applications or workflows. You can:
Connect to Apache Spark data sources.
Create reusable Apache Spark applications.
Launch Apache Spark jobs in seconds.
Create Apache Spark applications using SQL, Python, Java, or Scala.
Manage all Apache Spark applications from a single platform.
Process data in the Cloud or on-premises in your data center.
Create Big Data building blocks that you can easily assemble into advanced Big Data applications.

Before you Begin with Data Flow
Avoid entering confidential information when assigning descriptions, tags, or friendly names to your cloud resources through the Oracle Cloud Infrastructure Console, API, or CLI. This applies when creating or editing an application in Data Flow.
Before you begin using Data Flow, you must have:
- An Oracle Cloud Infrastructure account. Trial accounts can be used to demo Data Flow.
- A Service Administrator role for your Oracle Cloud services. When the service is activated, Oracle sends the credentials and URL to the designated Account Administrator. The Account Administrator creates an account for each user who needs access to the service.
- A supported browser, such as:
Microsoft Internet Explorer 11.x+
Mozilla Firefox ESR 38+
Google Chrome 42+
- A Spark Application uploaded to Oracle Cloud Infrastructure Object Storage. Do not
provide it packaged in a zipped format such as
.zip
or.gzip
. - Data for processing loaded into Oracle Cloud Infrastructure Object Storage. Data can be read from external data sources or clouds. Data Flow optimizes performance and security for data stored in an Oracle Cloud Infrastructure Object Store.
- Spark streaming is not supported.
- This table shows various technologies supported by Data Flow. It is for reference only, and is not meant to be comprehensive.
Supported Technologies Technology Value Supported Spark Versions Spark 2.4.4 Supported Application Types - Java
- Scala
- SparkSQL
- PySpark (Python 3 only)
Set Up Administration
Before you can create, manage and execute applications in Data Flow, the tenant administrator (or any user with elevated privileges to create buckets and modify IAM) must create specific storage buckets and associated policies in IAM. These set up steps are required in Object Store and IAM for Data Flow to function.
Object Store: Setting Up Storage
Before running applications in Data Flow service, there are two storage buckets that are required in Object Store, see the Object Storage documentation.
- Data Flow Logs
- Data Flow requires a bucket to store
the logs (both standard out and standard err) for every application run.
Create a standard storage tier bucket called
dataflow-logs
in the Object Store service. The location of the bucket must follow the pattern:oci://dataflow-logs@<Object_Store_Namespace>/
- Data Flow Warehouse
- Data Flow requires a data warehouse
for Spark SQL applications. Create a standard storage tier bucket called
dataflow-warehouse
in the Object Store service. The location of the warehouse must follow the pattern:oci://dataflow-warehouse@<Object_Store_Namespace>/
Identity: Policy Set Up
Data Flow requires policies to be set in IAM to access resources in order to manage and run applications. For more information on how IAM policies work, refer to the Identity and Access Management documentation. For more information about tags and tag namespaces to add to your policies, see Managing Tags and Tag Namespaces.
- For administration-like users (or super-users) of the service who can take any
action on the service, including managing applications owned by other users and
runs initiated by any user within their tenancy subject to the policies assigned
to the group:
- Create a group in your identity service called
dataflow-admin
and add users to this group. - Create a policy called
dataflow-admin
and add the following statements:ALLOW GROUP dataflow-admin TO READ buckets IN <TENANCY>
ALLOW GROUP dataflow-admin TO MANAGE dataflow-family IN <TENANCY>
ALLOW GROUP dataflow-admin TO MANAGE objects IN <TENANCY> WHERE ALL {target.bucket.name='dataflow-logs', any {request.permission='OBJECT_CREATE', request.permission='OBJECT_INSPECT'}}
dataflow-logs
bucket. - Create a group in your identity service called
- The second category is for all other users who are only authorized to create and
delete their own applications. But they can run any application within their
tenancy, and have no other administrative rights such as deleting applications
owned by other users or canceling runs initiated by other users.
- Create a group in your identity service called
dataflow-users
and add users to this group. - Create a policy called
dataflow-users
and add the following statements:ALLOW GROUP dataflow-users TO READ buckets IN <TENANCY>
ALLOW GROUP dataflow-users TO USE dataflow-family IN <TENANCY>
ALLOW GROUP dataflow-users TO MANAGE dataflow-family IN <TENANCY> WHERE ANY {request.user.id = target.user.id, request.permission = 'DATAFLOW_APPLICATION_CREATE', request.permission = 'DATAFLOW_RUN_CREATE'}
ALLOW GROUP dataflow-users TO MANAGE objects IN <TENANCY> WHERE ALL {target.bucket.name='dataflow-logs', any {request.permission='OBJECT_CREATE', request.permission='OBJECT_INSPECT'}}
- Create a group in your identity service called
Once you have configured the federation trust, use the Oracle Cloud Infrastructure Console to map the appropriate Identity Provider User Group to the required Data Flow User Group in the identity service.
The Data Flow service needs permission to perform actions on behalf of the user or group on objects within the tenancy.
dataflow-service
and add the following
statement:ALLOW SERVICE dataflow TO READ objects IN tenancy WHERE target.bucket.name='dataflow-logs'
- To allow use of the
virtual-network-family:
allow group dataflow-admin to use virtual-network-family in compartment <compartment-name>
- To allow access to more specific resources, you need the following
policies:
allow group dataflow-admin to manage vnics in compartment <compartment-name> allow group dataflow-admin to use subnets in compartment <compartment-name> allow group dataflow-admin to use network-security-groups in compartment <compartment-name>
- To allow access to specific operations, you need the following policies:
allow group dataflow-admin to manage virtual-network-family in compartment <compartment-name> where any {request.operation='CreatePrivateEndpoint', request.operation='UpdatePrivateEndpoint', request.operation='DeletePrivateEndpoint' }
- To allow changing of the network configuration, you need the following
policy:
allow group dataflow-admin to manage dataflow-private-endpoint in <tenancy>
Although these examples grant the policies to dataflow-admin
, you could
choose to grant these policies only to a subset of users, so limiting the users that can
perform operations on private endpoints. If you are only using private endpoints to
access data in a Run, and the private endpoint in question exists in your tenant, you
don't need any of these policies.
dataflow-admin
group can create Runs that can either,
activate a private endpoint configuration, or switch the network configuration back to
Internet. After a Run activates a private endpoint, this private endpoint remains active
until changed by a user from the dataflow-admin
group with the
appropriate privileges. See Set Up Administration for the right
set of privileges. A user in the dataflow-users
group can launch Runs
only if the Application is configured to use the active private endpoint. When correctly configured, private endpoints can access a mix of private resources on the VCN plus Internet resources. Provide a list of these resources in the DNS Zones section when configuring a private endpoint.
Importing an Apache Spark Application to the Oracle Cloud
Your Spark applications need to be hosted in Oracle Cloud Infrastructure Object Storage before you can run them. You can upload your application to any bucket. The user running the application must have read access to all assets (including all related compartments, buckets and files) for the application to launch successfully.
Best Practices for Bundling Applications
Technology | Notes |
---|---|
Java or Scala Applications | For the best reliability, upload applications as Uber JARs or Assembly JARs, with all dependencies included in the Object Store. Use tools like Maven Assembly Plugin (Java) or sbt-assembly (Scala) to build appropriate JARs. |
SQL Applications | Upload all your SQL files (.sql ) to the Object Store. |
Python Applications | Build applications with the default libraries and upload the python file to the Object Store. To include any third-party libraries or packages, see Adding Third-Party Libraries to Data Flow Applications. |
Do not provide your application package in a zipped format such as .zip
or .gzip
.
oci://<bucket>@<tenancy>/<applicationfile>
For example, with a Java or Scala application, let’s suppose a developer at examplecorp
developed a Spark application called logcrunch.jar
and uploaded it to a bucket called production_code
. You can always determine the correct tenancy by clicking on the user profile icon in the top right of the console UI.
oci://production_code@examplecorp/logcrunch.jar
Load Data into the Oracle Cloud
Data Flow is optimized to manage data in Oracle Cloud Infrastructure Object Storage. Managing data in Object Storage maximizes performance and allows your application to access data on behalf of the user running the application.
Approach | Tools |
---|---|
Native web UI | The Oracle Cloud Infrastructure Console lets you manage storage buckets and upload files, including directory trees. |
Third-party tools | Consider using REST APIs and the Command Line Infrastructure. |
Cross Tenancy Access
- The Data Flow user belongs to group
tenancy-a-group
in a tenancy calledTenancy_A
. - Data Flow runs in
Tenancy_A
. - The objects to be read are in a tenancy called
Tenancy_B
.
You need to allow tenancy-a-group
to read buckets and objects in
Tenancy_B
.
Tenancy_A
:
define tenancy Tenancy_B as tenancy-b-ocid
endorse group tenancy-a-group to read buckets in tenancy Tenancy_B
endorse group tenancy-a-group to read objects in tenancy Tenancy_B
The first statement is a "define" statement that assigns a friendly label to the OCID of
Tenancy_B
. The second and third statements let the user's group,
tenancy-a-group
, read buckets and objects in
Tenancy_B
.
Tenancy_B
:define tenancy Tenancy_A as tenancy-a-ocid
define group tenancy-a-group as tenancy-a-group-ocid
admit group tenancy-a-group of tenancy Tenancy_A to read buckets in tenancy
admit group tenancy-a-group of tenancy Tenancy_A to read objects in tenancy
The first and second statements are define
statements that assign a
friendly label to the OCID of Tenancy_A
and
tenancy-a-group
. The third and fourth statements let
tenancy-a-group
read the buckets and objects in
Tenancy_B
. The word admit
indicates that the
access applies to a group outside the tenancy in which the buckets and objects
reside.
your_compartment
:admit group tenancy-a-group of tenancy Tenancy_A to read buckets in compartment your_compartment
your_bucket
in
your_compartment
:admit group tenancy-a-group of tenancy Tenancy_A to read objects in compartment your_compartment where target.bucket.name = 'your_bucket'