Sizing Your Data Flow Application

Every time you run a Data Flow Application you specify a size and number of executors which, in turn, determine the number of OCPUs used to run your Spark application.

An OCPU is equivalent to a CPU core, which itself is equivalent to two vCPUs. Refer to Compute Shapes for more information on how many OCPUs each shape contains.

A rough guide is to assume 10 GB of data processed per OCPU per hour. Optimized data formats like Parquet appear to run much faster since only a small subset of data is processed.

The number of OCPUs equals the processed data in gigabytes divided by ten times the desired run time in hours.

As an example if you want to process 1 TB of data with an SLA of 30 minutes, expect to use about 200 OCPUs: The example number of OCPUs required equals one thousand, divided by ten times by a half, which equals two hundred.

You can allocate 200 OCPUs in various ways. For example, you can select an executor shape of VM.Standard2.8 and 25 total executors for 8 * 25 = 200 total OCPUs.

This formula is a rough estimate and your run-times might differ. You can better estimate your actual workload’s processing rate by loading your Application and viewing the history of Application Runs. This history lets you to see the number of OCPUs used, total data processed, and run time, letting you to estimate the resources you need to meet your SLAs. From there, you estimate the amount of data a Run processes and size the Run appropriately.
Note

The number of OCPUs is limited by the VM shape you chose and the value set in your tenancy for VM.Total. You cannot use more VMs across all VM shapes than the value in VM.Total. For example, if each VM shape is set to 20, and VM.Total is set to 20, you cannot use more than 20 VMs across all your VM shapes. With flexible shapes, where the limit is measured as cores or OCPUs, 80 cores in a flexible shape is equivalent to 10 VM.Standard2.8 shapes. See Service Limits for more information.

Flexible Compute Shapes

Data Flow supports flexible compute shapes for Spark jobs.

The following flexible compute shapes are supported:
  • VM.Standard3.Flex (Intel)
  • VM.StandardE3.Flex (AMD)
  • VM.StandardE4.Flex (AMD)
  • VM.Standard.A1.Flex (Arm processor from Ampere)
Learn more about flexible compute shapes from the Compute documentation.
When you create an application or edit an application, select the flexible shape for the driver and executor. For each OCPU selection, you can choose the flexible memory option.
Note

The driver and executor must have the same shape.

Migrating Applications from VM.Standard2 Compute Shapes

Follow these steps when migrating your existing Data Flow applications from VM.Standard2 to flexible compute shapes.

  1. Request the limits for your choice of flexible shape.
    OCPU count defines the flexible shape limits. With VM.Standard2 compute shapes, node count defined the limits. For example, if you have an application which uses 16 OCPUs for driver and 16 OCPUs for one executor, you request 32 OCPUs in your limit increase request.
  2. (Optional) If you expect to run more concurrent jobs across different shapes, request more Vm.Total.
  3. When you create an application or edit an application, select the flexible shape for the driver and executor.
    Note

    The driver and executor must have the same shape.
  4. (Optional) For each OCPU selection, choose the flexible memory option.