IVF Indexing on External Iceberg Tables

This feature enables vector search capabilities on Iceberg external tables stored in cloud object storage systems such as Oracle Cloud Infrastructure Object Storage or Amazon S3.

Iceberg tables are often extremely large and consist of many data files. A single global index can take too long to refresh even when only a small number of data files change because the refresh may still require scanning and rebuilding large portions of the overall index.

To address this, you can create an IVF vector index within the database for an Iceberg table stored outside the Oracle AI Database. This design uses a local, per-file indexing approach, enabling targeted, incremental refresh aligned with Iceberg's snapshot-based change patterns. As a result, it delivers fast index refresh, more predictable performance, and more efficient index maintenance at scale.

External table support for Iceberg has the following requirements:
  • The Iceberg table must be non-partitioned
  • The Iceberg table must use the Copy-on-Write table property

This section introduces how to set up, create, and validate an IVF vector index on external Iceberg tables. It covers system prerequisites, required configurations, index creation, post-creation validation, and index maintenance operations.

File-Based Local IVF Index Architecture

Unlike the original partition-local IVF index, where the partition key is derived from the base table fragment object ID, the Iceberg external table implementation uses the underlying data file.

In this design, the partition key is the data file ID associated with each Iceberg/Parquet file. Oracle assigns and tracks this identifier in the XT_FILE auxiliary table, which stores file metadata such as file path, name, and last-modified timestamp. Although the external base table itself is not physically partitioned, it can be treated as virtually partitioned by data file ID for indexing purposes. This enables the system to build and maintain file-local IVF index structures, improving refresh and maintenance efficiency because changes typically affect only a subset of files.

FILE_ID FILE_PATH FILE_NAME FILE_LASTMOD PART_ID
1 path/to/file file1.parquet 24-MAR-25 10.23.52.000 PM 1
2 path/to/file file2.parquet 24-MAR-25 10.23.54.000 PM 2
3 path/to/file file3.parquet 24-MAR-25 10.23.54.000 PM 3

Prerequisites

Before creating an IVF vector index on an Iceberg external table, configure the following prerequisites:

Enable hidden columns

Vector indexing relies on hidden columns to relate each row to its underlying storage file. These must be enabled before index creation:
-- Session level
ALTER SESSION SET "_xt_table_hidden_column" = TRUE;	  -- Enable FILE$PATH, FILE$NAME

-- System level
ALTER SYSTEM SET "_xt_table_hidden_column" = TRUE;	  -- Enable FILE$PATH, FILE$NAME

-- Flush plan cache
ALTER SYSTEM FLUSH SHARED_POOL;                           -- Recommended after changes

Create External Iceberg Table for IVF Indexing

There are two main approaches to create external tables for use with IVF indexing on Iceberg tables. Each approach determines how your table and vector index interact with data changes on the source Iceberg table.

Metadata-Based External Table Creation (Snapshot-Reference)

To enable IVF indexing, you must first create an external table that references a specific Iceberg snapshot (typically through a metadata.json URI). This table is bound to the snapshot described by the chosen metadata.json file.

Because external tables lack reliable ROWID semantics, you must define a single column primary key, which is used for precise row identification during vector search. The primary key should be declared with the PRIMARY KEY RELY DISABLE clause and the corresponding column in the underlying Iceberg table must be UNIQUE and NOT NULL.

DEFINE metadata_uri = 'https://objectstorage.<region>.oraclecloud.com/n/<tenancy>/b/<bucket>/o/<pathtometadata>/metadata.json';

CREATE TABLE IF NOT EXISTS vdb_ext_ivf01
(
  id   VARCHAR(10)   PRIMARY KEY RELY DISABLE,
  c1   NUMBER,
  n1   VARCHAR2(10),
  emb  VECTOR(1024, float32)
)
ORGANIZATION EXTERNAL
(
  TYPE ORACLE_BIGDATA
  DEFAULT DIRECTORY DATA_PUMP_DIR
  ACCESS PARAMETERS
  (
    com.oracle.bigdata.credential.name = OCI_CRED
    com.oracle.bigdata.fileformat      = parquet
    com.oracle.bigdata.access_protocol = iceberg
  )
  LOCATION ('iceberg:&metadata_uri')
)
REJECT LIMIT UNLIMITED;

If you change or update the source Iceberg table (for example, adding new files), a new metadata.json will be generated, but the base table and its index will not see these changes. To see new data, you must recreate or manually repoint the external table to the latest metadata.json and rebuild the index.

Catalog-Based Table Creation (Dynamic Reference)

This method creates the external table using Iceberg's catalog capabilities, referencing the catalog instead of referencing to a specific snapshot. Using this method, the table always reflects the most up-to-date state.

CREATE TABLE IF NOT EXISTS vdb_ext_ivf01
(
  id   VARCHAR(10)   PRIMARY KEY RELY DISABLE,
  c1   NUMBER,
  n1   VARCHAR2(10),
  emb  VECTOR(1024, float32)
)
ORGANIZATION EXTERNAL
(
  TYPE ORACLE_BIGDATA
  DEFAULT DIRECTORY DATA_PUMP_DIR
  ACCESS PARAMETERS
  (
    com.oracle.bigdata.fileformat               = parquet
    com.oracle.bigdata.credential.name          = OCI_CRED
    com.oracle.bigdata.access_protocol          = iceberg
    com.oracle.bigdata.access_protocol.config   =
    {
      "iceberg_catalog_type": "aws_glue",
      "iceberg_glue_region": "us-west-2",
      "iceberg_table_path": "<database>.<table>"
    }
  )
  LOCATION ('iceberg:')
)
REJECT LIMIT UNLIMITED;

Access is configured by specifying catalog details in ACCESS PARAMETERS, rather than specifying a specific metadata file. When the table is queried, Oracle automatically fetches the latest snapshot and underlying data files from the catalog. Any changes to the source Iceberg table (such as newly added or modified partitions) are automatically reflected after a short interval.

Create IVF Index for an External Iceberg Table

The IVF index support for external tables over Iceberg enables scalable and efficient vector search on large datasets stored in external object stores.

The approach leverages a file-local IVF partition mechanism, in which each external data file (such as a Parquet file) is associated with its own dedicated IVF index partition. The indexing process is tightly integrated with Iceberg snapshot management, ensuring data consistency and operational efficiency.

IVF index creation on Iceberg tables uses the same syntax as on regular tables. The index build mechanism automatically detects the underlying files of the declared Iceberg snapshot and constructs a file-local IVF index for each. Internal partition-local IVF logic is leveraged to optimize both index build and query performance.

You can specify any distance metric for similarity calculations. In this example, cosine distance is used.

CREATE VECTOR INDEX vdb_ext_ivf01_idx
ON vdb_ext_ivf01 (emb)
ORGANIZATION NEIGHBOR PARTITIONS DISTANCE COSINE;
Upon a successful IVF index creation, the following auxiliary tables are generated:
  • VECTOR$VDB_EXT_IVF01_IDX$76013_76015_0$IVF_FLAT_CENTROIDS
  • VECTOR$VDB_EXT_IVF01_IDX$76013_76015_0$IVF_FLAT_CENTROID_PARTITIONS
  • VECTOR$VDB_EXT_IVF01_IDX$76013_76015_0$XT_FILE
  • VECTOR$VDB_EXT_IVF01_IDX$76013_76015_0$XT_FILE_MAINT_TASK
It is important to validate the associated auxiliary tables to ensure the index is correctly built and aligned with the underlying data. Consider the following validation points:
  • XT_FILE Auxiliary Table

    The XT_FILE auxiliary table should list all external data files that are part of the Iceberg snapshot used during index creation. This ensures the index covers the complete set of files referenced by that snapshot.

  • ID and PART_ID Consistency

    Within the XT_FILE table, the ID and PART_ID columns must have identical values. This reflects the one-to-one mapping between each external file and its corresponding IVF index partition.

  • IVF_FLAT_CENTROID_PARTITIONS Table Size

    The IVF_FLAT_CENTROID_PARTITIONS table must contain the same number of records as the base table. This guarantees that every row in your data has a corresponding entry within the index partition, ensuring completeness for vector search operations.

Vector Search with IVF Index on Iceberg External Tables

After you create an IVF vector index on an Iceberg external table, you can run approximate k-Approximate Nearest Neighbor (k-ANN) queries that use the index to accelerate semantic or similarity search.

The following example shows a k-ANN query that computes cosine distance and returns the approximate top 5 closest vectors:

SELECT  id, vector_distance(emb, '[0,1,7]', COSINE) AS dist
FROM    vdb_ext_ivf01 
ORDER   BY dist
FETCH APPROX FIRST 5 ROWS ONLY;

Snapshot-referenced external table: k-ANN results are based on the specific Iceberg snapshot pointed to by the table's metadata.json. Newer Iceberg commits are not visible until the external table is updated to reference a newer snapshot.

Catalog-referenced external table: k-ANN results reflect the latest committed Iceberg snapshot available at query time. As new snapshots are committed, k-ANN query results are always evaluated as of the latest snapshot regardless of whether the vector index has been maintained/refreshed. The k-ANN query is internally transformed to use the correct data files as of the latest snapshot. Since the vector index is file-local, the transformed query can perform k-ANN using a partial vector index along with data-file scans, always returning results consistent with the latest snapshot.

Supported k-ANN Query Patterns
IVF indexing on an Iceberg external table supports k-ANN queries with no filter and with pre-filtering.

Note:

IVF indexes with included columns are not currently supported on Iceberg external tables.

IVF Index Maintenance

Index maintenance ensures that the IVF index stays synchronized with changes to the underlying Iceberg table so that queries can utilize the most current data.

When the underlying Iceberg table changes, the index may require maintenance to reflect the latest snapshot. During maintenance, the index maintenance process builds indexes for newly added data files.

To achieve this, the system uses coordinated background tasks to advance the IVF index from an older Iceberg snapshot to the latest one while preserving query consistency during ongoing maintenance.

Key Components

Index maintenance uses the following internal auxiliary tables to track external data files and maintenance tasks:
  • XT_FILE: Tracks each external data file and whether its file-local IVF index is usable.
    CREATE TABLE VECTOR$<index_name>$<index_objn>$XT_FILE (
      id     NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,  -- File ID
      path   VARCHAR2(4000)          NOT NULL,                 -- File path
      name   VARCHAR2(4000)          NOT NULL,                 -- File name
      status NUMBER        DEFAULT 1 NOT NULL                  -- 1: Index usable
                                                               -- 0: Index not usable
    );
    • status: 1 (index is usable) or 0 (index is not usable)
  • XT_FILE_MAINT_TASK: Records maintenance operations required to align the index with a newer Iceberg snapshot.
    CREATE TABLE VECTOR$<index_name>$<index_objn>$XT_FILE_MAINT_TASK (
      task_id      NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,   -- Task ID
      file_id      NUMBER,                                            -- File ID
      task_type    CHAR(1)        NOT NULL,                           -- Task type, A: Add index partition, D: Drop index partition
      task_status  CHAR(1)        NOT NULL,                           -- Task status, R: Ready, I: In progress, C: Completed, F: Failed 
      snapshot_id  VARCHAR2(4000) NOT NULL                            -- Target iceberg snapshot ID
    );
    • task_type: A (add index partition ) or D (drop index partition)
    • tast_status: R (Ready), I (In Progress), C (Completed), F (Failed)
    • snapshot_id: identifies the Iceberg snapshot that introduced the change

Updates to the vector index catalog

The vector index catalog (VECSYS.VECTOR$INDEX) stores additional information required for Iceberg-based indexing, including:
  • Iceberg access information (credential schema/name) and the current/next metadata_uri (snapshot metadata).
  • References to the auxiliary tables created for the index (object numbers and generated table names).
// SPARE2 JSON column
{
  "iceberg": {
    "credential_schema": "<CREDENTIAL_SCHEMA>",
    "credential_name"  : "<CREDENTIAL_NAME>",
    "metadata_uri"     : "https://oci_bucket/iceberg/vector/metadata/v1.metadata.json",
    "new_metadata_uri" : "https://oci_bucket/iceberg/vector/metadata/v2.metadata.json"
  }
}

// IDX_AUXILIARY_TABLES JSON column
{
  "xt_file_objn"           : 74315,
  "xt_file_name"           : "VECTOR$IDX_NAME$74315_74318_0$XT_FILE",
  "xt_file_maint_task_objn": 74320,
  "xt_file_maint_task_name": "VECTOR$IDX_NAME$74320_74321_0$XT_FILE_MAINT_TASK"
}

This catalog allows background tasks to locate the correct Iceberg table version and the corresponding maintenance state tables for a given IVF index.

Background (Autotask++) tasks

Automated maintenance is implemented using the following background tasks:

Table Change Detection Autotask

Periodically checks the Iceberg catalog for a newer snapshot than the one currently indexed. When changes are detected, it will do the following:
  1. Retrieves the list of files referenced by the new snapshot.
  2. Compares it with files recorded in XT_FILE.
  3. Computes a file-level diff. The file-list diffs consist of an add-file list, containing files newly introduced in the newer snapshot, and a drop-file list, containing files no longer present in the new snapshot.
  4. Registers the corresponding maintenance tasks in the XT_FILE_MAINT_TASK table.

Index Creation Autotask

Builds file-local IVF index structures for files in the add-file list. For each new file, it performs the following:
  1. Allocates a new file identifier (FILE_ID) using XT_FILE.
  2. Adds the corresponding partitions to the IVF index tables. For example, CENTROIDS and CENTROID_PARTITIONS using the file's ID as the partition key.
  3. Runs k-means clustering to populate CENTROIDS.
  4. Populates CENTROID_PARTITIONS based on the clustering output.
  5. Updates the XT_FILE_MAINT_TASK and XT_FILE tables to mark the index creation as completed.