Data Flow Pools

Data Flow Pools  can be used in many Data Flow batch, Streaming, Session workloads by various users at the same time in same tenant.

Pools offer a wide range of functionalities for various use cases, such as:

  • Time sensitive large production workloads with many executors which needs faster start up time in seconds.
  • Critical production workloads aren't effected by dynamic development workloads because their resources can be separated into different pools.
  • Control cost and usage for development with IAM policy that lets you submit Data Flow runs to specific pools.
  • Large number of Data Flow runs need to be processed with less start up time.
  • Queueing Data Flow runs in a pool for efficient use of resources and cost control.
  • Workloads run only in specific time window of a day that need the automatic start of a pool on a schedule and auto stop when idle.
  • Automatic security patching without affecting runs or resources in a pool.

Configuring Runs and Applications to Use Pools

Use pools with Data Flow Applications and Runs.

Developing Application with a Pool

While developing applications, you can choose a pool in any state except DELETED to be added to an Application. Choose only those driver and executor shapes configured in the Data Flow pool added to the application.

Run an Application with a Pool

While submitting a Data Flow Run, choose a pool in any state except DELETED to be added an Application. Choose only those driver and executor shapes configured in the Data Flow pool added to the Run.

Queuing Data Flow with Pool

You can submit more runs to the pool queue while the pool compute resources are used by other Runs. By default, Runs are queued for 20 minutes to wait for resources in the pool to be available. You can configure the wait time in the queue by setting the Spark configuration, spark.dataflow.acquireQuotaTimeout , in the Data Flow Run or Application advanced options. The value for this configuration can be formatted as 1h | 30m | 45min, and so on.

While a Data Flow Run is waiting in the queue for the resources that are held by active Runs in the Pool to be available, a cold start up occurs.

Starting Data Flow Pool from Run

Stopped or accepted Data Flow pools can also be started by submitting a Run with a pool.

Runs wait for the pool to become active to start. We recommend using the queuing feature of the pool to avoid run timeouts. Canceling and stopping the Run doesn't stop the pool.

Overriding the Pool ID in a Run or Application

  • When adding a pool in an Application and Run, the pool added to the Run is used.

  • When adding a pool in an Application, but not in a Run, when submitting the Run the pool added to the Application is used.

  • When adding a pool in a Run, but not in an Application, when submitting the Run the pool added to the Run is used.

  • This lets the use io many pools in different Runs of same Application.

Limits

  • Tenant level Data Flow limits and compartment quotas are still applicable while creating or starting pools.
  • Maximum of 1000 nodes in total of all configurations in a pool.
  • There aren't limits on number of pool that can be created and used. An administrator can write a compartment quota policy to limit a user, user group, or compartment to control shape and number of nodes configured in a pool.