Before You Begin With Data Flow SQL Endpoints

Review prerequisites for Data Flow SQL Endpoints.

To use Data Flow SQL Endpoints, you must have:

  • An Oracle Cloud Infrastructure account. Trial accounts can be used to show Data Flow.
  • A Service Administrator role for the Oracle Cloud services. When the service is activated, the credentials and URL are sent to the chosen Account Administrator. The Account Administrator creates an account for each user who needs access to the service.
  • A supported browser, such as:
    • Microsoft Internet Explorer 11.x or later

    • Mozilla Firefox ESR 38 or later

    • Google Chrome 42 or later

    Note

    For the Spark UI, only use Google Chrome.
  • Data for processing loaded into Object Storage. The data can be read from external data sources or cloud services. Data Flow SQL Endpoints optimizes performance and security for data stored in Object Storage.

Note

Avoid entering confidential information when assigning descriptions, tags, or friendly names to the cloud resources through the Oracle Cloud Infrastructure Console, API, or CLI. It applies when creating or editing applications in Data Flow.

Understanding SQL Endpoints

Data Flow SQL Endpoint is a service entity that uses long-running compute clusters in your tenancy. You pick a compute shape and how many instances you want to use. Each cluster runs until an administrator stops it. Spark runs in the cluster. Its SQL engine is fast, integrates with Data Flow, and it supports unstructured data. You connect using ODBC or JDBC, authenticate with IAM credentials.

What are Data Flow SQL Endpoints

Data Flow SQL Endpoints are designed for developers, data scientists, and advanced analysts to interactively query data directly where it lives in the data lake. This data is relational, semi-structured, and unstructured such as logs, sensor streams, and video streams typically stored in the object store. As the volume and complexity of data grows, tools to explore and analyze data in the data lake in native formats, rather than transforming or moving it, become important. Using Data Flow SQL Endpoints, you can economically process large amounts of raw data, with cloud native security used to control access. You can access the insights they need in a self-service way, with no need to coordinate complex IT projects or worry about stale data. Queries in Data Flow SQL Endpoints seamlessly interoperate with Data Flow Batch for scheduled production pipelines. They enable fast data analytic and use long-running autoscaling compute clusters that are fixed in size and run until stopped by the administrator.

Data Flow SQL Endpoints:

  • Provide interactive analytics directly against the data lake.
  • Are built on Spark for scale-out, easy read and write of unstructured data, and interoperability with existing Data Flow.
  • Uses SQL to make analytics easier.
  • Support major Business Intelligence (BI) tools using ODBC or JDBC connections with IAM credentials.
  • Use data for processing loaded into Object Storage. The data can be read from external data sources or cloud services.

Data Flow SQL Endpoints support all the same types of file supported by Spark. For example, JSON, Parquet, CSV, and Avro.

Considerations about the Data Catalog Metastore Integration

A tight integration between the Data Flow SQL Endpoints and the Data Catalog Metastore ("metastore") is fundamental to providing consistent, reliable, and governed access to both external and managed tables. Through this integration, a SQL Endpoint uses the metastore as the authoritative repository for schemas, table definitions, partition metadata, and storage locations, allowing queries to be planned and optimized without repeatedly scanning the underlying files.

For external tables, the metastore ensures that schema and partition information remain consistent with Object Storage layouts, while for managed and Delta tables, it tracks transactional metadata, lineage, and lifecycle operations. This unified metadata layer enables Spark SQL to deliver predictable performance, enforce governance and access controls, support schema evolution, and maintain compatibility across workloads and clusters.

The metastore employs a lightweight locking mechanism to ensure that concurrent data definition language (DDL) operations issued through a SQL Endpoint don't corrupt metadata or create inconsistent table states. When a DDL statement such as CREATE/ALTER TABLE/PARTITION or DROP TABLE/PARTITION is executed, the metastore acquires an exclusive lock, preventing other sessions from modifying its schema or metadata until the operation completes.

This lock coordination protects against race conditions—for example, two users altering the same table simultaneously—and ensures that the SQL Endpoint operates on a coherent, serialized view of metadata. By coupling DDL execution with metastore-level lock enforcement, the SQL Endpoint maintains transactional integrity for metadata operations even in highly concurrent, multi-user environments.

However, the locks temporarily block access to table metadata, and long-running DDL operations can introduce noticeable delays for other users' queries that require metadata reads, especially in shared or highly concurrent environments.

To minimize these impacts, coordinate DDL activity during low-traffic maintenance windows or through orchestrated workflows that ensure schema modifications occur outside peak query execution periods.