Creating a Pool

Create a pool in Data Flow to configure a group of compute resources that can be used to run various Spark data and machine learning workloads, including batch, streaming, and interactive.

Pools require special statements to let administrators create or users to attach Data Flow runs. For more details, see polices to use and manage pools.

1. On the Pools page, select Create pool. If you need help finding the Pools page, see Listing Pools.
2. In the Create pool panel, enter a name for the pool and an optional description to help you find it.
3. For Configuration 1, enter the following values:
  
  Select a compute shape.
  
  Optional: Customize the number of OCPUs and the amount of memory.
  
  Enter the number of shapes. to include in the pool. The minimum is 2.
4. (Optional) To add another configuration, select Add configuration and repeat the preceding step.
  
  Note
  
  You can have a maximum of three configurations per pool.
5. (Optional) Create a schedule to start the pool.
  
  Select the day you want the pool to start.
  
  Select the hour (UTC) on which to automatically start the pool.
  
  To add another schedule, select Add schedule and repeat steps a and b.
6. (Optional) Specify the idle timeout in minutes to automatically stop the pool when it's idle.
7. (Optional) In the Tags section, add one or more tags to the <resourceType>. If you have permissions to create a resource, then you also have permissions to apply free-form tags to that resource. To apply a defined tag, you must have permissions to use the tag namespace. For more information about tagging, see Resource Tags. If you're not sure whether to apply tags, skip this option or ask an administrator. You can apply tags later..
8. Select one of the following options:
  
  Select Create pool to create the pool now and start it at the scheduled time.
  
  Select Create and start pool to create the pool now and start it immediately after it's created.
  
  Select Save as stack to create the pool later.
Use the create command and required parameters to create a Pool:
oci data-flow pool create [OPTIONS]
For a complete list of flags and variable options for CLI commands, see the CLI Command Reference.
Run the OperationName operation to create a Pool.

Resource Allocation for Critical Sizes

Within the request to create a Pool, Data Flow requires some contiguity to establish a cluster. That goes beyond the data center in the OCI region, but at the fault domains. For example, a request to provision a Pool with 10 nodes of 32 OCPUs (64 cores) and 256 GB RAM each can only be provisioned by the PHX-AD2 data center, on Fault-Domain-2, which at the moment is showing 72 as the available capacity. The block size of nodes for Pools' contiguous allocation is 12, meaning a request greater than 12 can be split into multiples of Fault Domains allocation. A smaller number of node requests are allocated in proportion to the quantity requested.

Another hidden feature in these special circumstances for the resource allocation is the ability to start the pool with fewer nodes and eventually change the Pool, after it becomes active, to get extra resources. This can be done without interfering with the Jobs that are running using the Pool resources. It lets a job start and use the Pool without waiting for all the necessary resources, but only for those resources that are available at that moment. An administrator can change the number of nodes, and Data Flow reinitiates the process to interrogate the Data Centers in the region to assess the possibility of allocating the demand.

Considerations About Private Network

One common question about the resource reservation is the nature of its use regarding the network security when public and private job runs are executing within the Pool. The Pool segregates access to its resources, ensuring that private traffic isn't mixed with public traffic within the reserved capacity, as it would be if the job runs were independently allocated. Both types of runs, public and private, can coexist simultaneously in the Pool. Data Flow isolates the job runs according to design.

Considerations About Job Concurrency

Pools have a queueing mechanism that lets you submit several Data Flow runs to the same Pool, even when the Pool is fully used. Using the built-in queuing mechanism, new runs can wait for resource availability, ensuring seamless job submission without manual intervention.

Pools can be optimized using the wait time default configuration by adjusting the default 20-minute queue timeout using the spark.dataflow.acquireQuotaTimeout Spark configuration parameter. Set a value that aligns with your workload priorities and expected resource turnaround times. For example:

Use shorter wait times (say, 10 minutes) for latency-sensitive or interactive jobs.
Use longer timeouts (say, one hour) for large, batch-oriented jobs with flexible start times.

Another consideration is regarding the Job Runs with autoscaling enabled. Data Flow considers the total number of resources of the required configuration type to be available in the Pool. Let's say the autoscaling has a minimum of 10 shapes and a maximum of 20. For the job to start with the Pool configuration, 20 shapes need to be available in the Pool. Data Flow queues the request according to the acquire quota timeout property as described before.

One way of mitigating the amount of resources is by configuring the spark.scheduler.minRegisteredResourcesRatio property. For example, configuring the minimum ratio to 0.8 in the previous example lets the Pool queue the job run until 16 shapes are available to start the execution.

Considerations About Billing

A Data Flow Pool is billed when it's active, regardless of whether its resources are used. When resources in the Pool are used, the billing for compute resource usage is identical to that of a normal Data Flow run execution. Therefore, a Data Flow Pool, where resources are used 100% of the time, incurs the exact charges as Data Flow runs when resources are used on demand.

Summary and Recommendations

The following summarizes the key points and the general recommendations about using Data Flow Pools:

Use Data Flow Pools for Faster Start up and SLA Compliance:

Pools reduce Spark job start up time.
Ideal for time-sensitive production jobs or workloads requiring service-level agreement (SLA) guarantees.
Use scheduling to automatically start and stop pools, aligning with business hours and reducing idle costs.

Enable Workload Isolation for Production and Development:

Pools support enterprise-grade isolation, ensuring production workloads aren't disrupted by concurrent development activity.
Use fine-grained IAM policies to restrict access to pools for specific users or environments.

Optimize Resource Allocation and Cost:

Use scheduling and autoshutdown timeouts to avoid unnecessary billing during idle periods.
Understand that billing occurs while the Pool is active, regardless of usage.
When fully used, Pool-based billing is equal to on-demand Data Flow runs.

Use Pool Configuration for Specialized Workloads:

Use Pools to preallocate rare or large-shaped resources necessary for jobs with large data volumes or special compute or memory requirements.

Before allocating Pools with large or rare shapes, use OCI Capacity Reporter tools to check available resources across Fault Domains and ADs.

Dynamically Scale Pools Without Interruption:

Pools can be started with fewer nodes and scaled dynamically while jobs are running.
Administrators can increase the pool size after creation to meet job demand without interrupting active workloads.

Maintain Network Security and Isolation:

Data Flow Pools ensure network traffic isolation between public and private job runs using data plane-level segmentation.
Both job types can coexist safely in the same Pool without interference.

Follow Best Practices for Pool Creation and Attachment:

Special permissions or setup steps are needed for users to create or attach jobs to Data Flow Pools.
See Set Up Identity and Access Management Policies for specific commands and permission models.

These recommendations help maximize efficiency, scalability, and cost control when managing Spark workloads in OCI using Data Flow Pools.