Sizing Your Data Flow Application
Every time you run a Data Flow Application you specify a size and number of executors which, in turn, determine the number of OCPUs used to run your Spark application.
An OCPU is equivalent to a CPU core, which itself is equivalent to two vCPUs. Refer to Compute Shapes for more information on how many OCPUs each shape contains.
A rough guide is to assume 10 GB of data processed per OCPU per hour. Optimized data formats like Parquet appear to run much faster since only a small subset of data is processed.
As an example if you want to process 1 TB of data with an SLA of 30 minutes, expect to use about
200 OCPUs:
You can allocate 200 OCPUs in various ways. For example, you can select an executor shape of VM.Standard2.8 and 25 total executors for 8 * 25 = 200 total OCPUs.
The number of OCPUs is limited by the VM shape you chose and the value set in your tenancy for
VM.Total
. You cannot use more VMs
across all VM shapes than the value in VM.Total. For example, if each VM shape is
set to 20, and VM.Total
is set to 20, you cannot use more than 20
VMs across all your VM shapes. With flexible shapes, where the limit is measured as
cores or OCPUs, 80 cores in a flexible shape is equivalent to 10 VM.Standard2.8
shapes. See Service Limits for more
information.Flexible Compute Shapes
Data Flow supports flexible compute shapes for Spark jobs.
- VM.Standard3.Flex (Intel)
- VM.StandardE3.Flex (AMD)
- VM.StandardE4.Flex (AMD)
- VM.Standard.A1.Flex (Arm processor from Ampere)
The driver and executor must have the same shape.
Migrating Applications from VM.Standard2 Compute Shapes
Follow these steps when migrating your existing Data Flow applications from VM.Standard2 to flexible compute shapes.