Data Flow Pools

Data Flow Pools  can be used in many Data Flow batch, Streaming, Session workloads by various users at the same time in same tenant.

Pools provide a powerful and flexible mechanism for efficiently managing Spark-based batch, streaming, and session workloads across several users within the same tenant. Designed to support both time-sensitive production environments and dynamic development scenarios, Pools reduce application start up times by maintaining preinitialized compute infrastructure.

They enable enterprise-grade workload isolation, ensuring critical production jobs aren't impacted by development activities through dedicated resource segmentation. Cost control is streamlined with fine-grained IAM policies that restrict pool usage to authorized users or specific environments. At the same time, intelligent queuing mechanisms enable high-volume job submission while optimizing resource usage.

Pools can be scheduled to start automatically within defined time windows and eventually stopping after a timeout on the usage, aligning compute availability with business processes and minimizing idle cost. Also, built-in automation handles security patching seamlessly without disrupting running applications, making Pools an ideal choice for secure, scalable, and cost-efficient Spark workload execution.

Pools offer a wide range of functionalities for various use cases, such as:

  • Time sensitive large production workloads with many executors which needs faster start up time in seconds.
  • Critical production workloads aren't effected by dynamic development workloads because their resources can be separated into different pools.
  • Control cost and usage for development with IAM policy that lets you submit Data Flow runs to specific pools.
  • Large number of Data Flow runs need to be processed with less start up time.
  • Queueing Data Flow runs in a pool for efficient use of resources and cost control.
  • Workloads run only in specific time window of a day that need the automatic start of a pool on a schedule and auto stop when idle.
  • Automatic security patching without affecting runs or resources in a pool.

Another use case for the Data Flow is the ability to preallocate (or reserve) nodes with special configuration. These are resources of specific sizes (or the relation between CPU and memory) that are rare in the data centers supporting the region. The jobs that require such special configuration typically involve large data volumes and record sizes that can't be easily distributed by allocating more nodes to the job execution.

For these scenarios, it's practical to interrogate the data centers in the region to assess the availability of these resources. Oracle samples offer a straightforward approach to examining such capacity in various commercial contexts.

Configuring Runs and Applications to Use Pools

Use pools with Data Flow Applications and Runs.

Developing Application with a Pool

While developing applications, you can select a pool in any state except DELETED to be added to an Application. Select only those driver and executor shapes configured in the Data Flow pool added to the application.

Run an Application with a Pool

While submitting a Data Flow Run, select a pool in any state except DELETED to be added an Application. Select only those driver and executor shapes configured in the Data Flow pool added to the Run.

Queuing Data Flow with Pool

You can submit more runs to the pool queue while the pool compute resources are used by other Runs. By default, Runs are queued for 20 minutes to wait for resources in the pool to be available. You can configure the wait time in the queue by setting the Spark configuration, spark.dataflow.acquireQuotaTimeout , in the Data Flow Run or Application advanced options. The value for this configuration can be formatted as 1h | 30m | 45min, and so on.

While a Data Flow Run is waiting in the queue for the resources that are held by active Runs in the Pool to be available, a cold start up occurs.

Starting Data Flow Pool from Run

Stopped or accepted Data Flow pools can also be started by submitting a Run with a pool.

Runs wait for the pool to become active to start. We recommend using the queuing feature of the pool to avoid run timeouts. Canceling and stopping the Run doesn't stop the pool.

Overriding the Pool ID in a Run or Application

  • When adding a pool in an Application and Run, the pool added to the Run is used.

  • When adding a pool in an Application, but not in a Run, when submitting the Run the pool added to the Application is used.

  • When adding a pool in a Run, but not in an Application, when submitting the Run the pool added to the Run is used.

  • This lets the use io many pools in different Runs of same Application.

Limits

  • Tenant level Data Flow limits and compartment quotas are still applicable while creating or starting pools.
  • Maximum of 1000 nodes in total of all configurations in a pool.
  • There aren't limits on number of pool that can be created and used. An administrator can write a compartment quota policy to limit a user, user group, or compartment to control shape and number of nodes configured in a pool.