Data Flow

Data Flow is a fully managed, serverless Apache Spark service used to run large-scale distributed data processing for analytics. It enables teams to execute batch ETL/ELT pipelines, data preparation, feature engineering, and aggregations without provisioning or managing Spark clusters. Data Flow provisions compute on demand for each run, auto-scales, and then releases resources when jobs complete—supporting cost-efficient execution for intermittent and scheduled workloads. This subject area enables tracking Data Flow pools and their details.

Business Questions

The subject area can answer the following business questions:

  • What's the Data Flow pool count?
  • What's the number of nodes currently in active use for a pool?
  • What's the number of runs currently using a pool?
  • How many Data Flow pools are active today?
  • How has the pool count changed over time (monthly)?
  • What's the pool count across compartments?

Logical Model

The Data Flow subject area is based on a relationship-driven logical model.

This diagram shows how the Data Flow pool fact table is related to its dimension tables:


Relationship diagram with the Data Flow Pool table connected to pool, compartment, region, tenancy, and date dimensions.

Metric Details

The fact folders in this subject area include the following metrics:

Metric Details for Data Flow
Metric Description
Data Flow Pool Count COUNT(ocira_fact_key) from the fact view's ocira_fact_key (maps to ocira$fact_key)
Idle Timeout in Minutes Provides the idle timeout configured for the pool
Pool Metrics Active Runs Count Provides the number of runs currently using the pool
Pool Metrics Actively Used Node Count Provides the number of nodes currently in active use for this pool