IVF Indexing on External Iceberg Tables
This feature enables vector search capabilities on Iceberg external tables stored in cloud object storage systems such as Oracle Cloud Infrastructure Object Storage or Amazon S3.
Iceberg tables are often extremely large and consist of many data files. A single global index can take too long to refresh even when only a small number of data files change because the refresh may still require scanning and rebuilding large portions of the overall index.
To address this, you can create an IVF vector index within the database for an Iceberg table stored outside the Oracle AI Database. This design uses a local, per-file indexing approach, enabling targeted, incremental refresh aligned with Iceberg's snapshot-based change patterns. As a result, it delivers fast index refresh, more predictable performance, and more efficient index maintenance at scale.
- The Iceberg table must be non-partitioned
- The Iceberg table must use the Copy-on-Write table property
This section introduces how to set up, create, and validate an IVF vector index on external Iceberg tables. It covers system prerequisites, required configurations, index creation, post-creation validation, and index maintenance operations.
- File-Based Local IVF Index Architecture
Unlike the original partition-local IVF index, where the partition key is derived from the base table fragment object ID, the Iceberg external table implementation uses the underlying data file. - Prerequisites
Before creating an IVF vector index on an Iceberg external table, configure the following prerequisites: - Create External Iceberg Table for IVF Indexing
There are two main approaches to create external tables for use with IVF indexing on Iceberg tables. Each approach determines how your table and vector index interact with data changes on the source Iceberg table. - Create IVF Index for an External Iceberg Table
The IVF index support for external tables over Iceberg enables scalable and efficient vector search on large datasets stored in external object stores. - Vector Search with IVF Index on Iceberg External Tables
After you create an IVF vector index on an Iceberg external table, you can run approximate k-Approximate Nearest Neighbor (k-ANN) queries that use the index to accelerate semantic or similarity search. - IVF Index Maintenance
Index maintenance ensures that the IVF index stays synchronized with changes to the underlying Iceberg table so that queries can utilize the most current data.
Parent topic: Create Vector Indexes and Hybrid Vector Indexes
File-Based Local IVF Index Architecture
Unlike the original partition-local IVF index, where the partition key is derived from the base table fragment object ID, the Iceberg external table implementation uses the underlying data file.
In this design, the partition key is the data file ID associated with each
Iceberg/Parquet file. Oracle assigns and tracks this identifier in the
XT_FILE auxiliary table, which stores file metadata such as file path,
name, and last-modified timestamp. Although the external base table itself is not physically
partitioned, it can be treated as virtually partitioned by data file ID for indexing purposes.
This enables the system to build and maintain file-local IVF index structures, improving
refresh and maintenance efficiency because changes typically affect only a subset of files.
| FILE_ID | FILE_PATH | FILE_NAME | FILE_LASTMOD | PART_ID |
|---|---|---|---|---|
| 1 | path/to/file | file1.parquet | 24-MAR-25 10.23.52.000 PM | 1 |
| 2 | path/to/file | file2.parquet | 24-MAR-25 10.23.54.000 PM | 2 |
| 3 | path/to/file | file3.parquet | 24-MAR-25 10.23.54.000 PM | 3 |
Parent topic: IVF Indexing on External Iceberg Tables
Prerequisites
Before creating an IVF vector index on an Iceberg external table, configure the following prerequisites:
Enable hidden columns
-- Session level
ALTER SESSION SET "_xt_table_hidden_column" = TRUE; -- Enable FILE$PATH, FILE$NAME
-- System level
ALTER SYSTEM SET "_xt_table_hidden_column" = TRUE; -- Enable FILE$PATH, FILE$NAME
-- Flush plan cache
ALTER SYSTEM FLUSH SHARED_POOL; -- Recommended after changesParent topic: IVF Indexing on External Iceberg Tables
Create External Iceberg Table for IVF Indexing
There are two main approaches to create external tables for use with IVF indexing on Iceberg tables. Each approach determines how your table and vector index interact with data changes on the source Iceberg table.
Metadata-Based External Table Creation (Snapshot-Reference)
To enable IVF indexing, you must first create an external table that
references a specific Iceberg snapshot (typically through a metadata.json
URI). This table is bound to the snapshot described by the chosen
metadata.json file.
Because external tables lack reliable ROWID semantics, you
must define a single column primary key, which is used for precise row identification during
vector search. The primary key should be declared with the PRIMARY KEY RELY
DISABLE clause and the corresponding column in the underlying Iceberg table must
be UNIQUE and NOT NULL.
DEFINE metadata_uri = 'https://objectstorage.<region>.oraclecloud.com/n/<tenancy>/b/<bucket>/o/<pathtometadata>/metadata.json';
CREATE TABLE IF NOT EXISTS vdb_ext_ivf01
(
id VARCHAR(10) PRIMARY KEY RELY DISABLE,
c1 NUMBER,
n1 VARCHAR2(10),
emb VECTOR(1024, float32)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_BIGDATA
DEFAULT DIRECTORY DATA_PUMP_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.credential.name = OCI_CRED
com.oracle.bigdata.fileformat = parquet
com.oracle.bigdata.access_protocol = iceberg
)
LOCATION ('iceberg:&metadata_uri')
)
REJECT LIMIT UNLIMITED;
If you change or update the source Iceberg table (for example, adding new
files), a new metadata.json will be generated, but the base table and its
index will not see these changes. To see new data, you must recreate or manually repoint the
external table to the latest metadata.json and rebuild the index.
Catalog-Based Table Creation (Dynamic Reference)
This method creates the external table using Iceberg's catalog capabilities, referencing the catalog instead of referencing to a specific snapshot. Using this method, the table always reflects the most up-to-date state.
CREATE TABLE IF NOT EXISTS vdb_ext_ivf01
(
id VARCHAR(10) PRIMARY KEY RELY DISABLE,
c1 NUMBER,
n1 VARCHAR2(10),
emb VECTOR(1024, float32)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_BIGDATA
DEFAULT DIRECTORY DATA_PUMP_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.fileformat = parquet
com.oracle.bigdata.credential.name = OCI_CRED
com.oracle.bigdata.access_protocol = iceberg
com.oracle.bigdata.access_protocol.config =
{
"iceberg_catalog_type": "aws_glue",
"iceberg_glue_region": "us-west-2",
"iceberg_table_path": "<database>.<table>"
}
)
LOCATION ('iceberg:')
)
REJECT LIMIT UNLIMITED;
Access is configured by specifying catalog details in ACCESS PARAMETERS,
rather than specifying a specific metadata file. When the table is queried, Oracle
automatically fetches the latest snapshot and underlying data files from the catalog. Any
changes to the source Iceberg table (such as newly added or modified partitions) are
automatically reflected after a short interval.
Parent topic: IVF Indexing on External Iceberg Tables
Create IVF Index for an External Iceberg Table
The IVF index support for external tables over Iceberg enables scalable and efficient vector search on large datasets stored in external object stores.
The approach leverages a file-local IVF partition mechanism, in which each external data file (such as a Parquet file) is associated with its own dedicated IVF index partition. The indexing process is tightly integrated with Iceberg snapshot management, ensuring data consistency and operational efficiency.
IVF index creation on Iceberg tables uses the same syntax as on regular tables. The index build mechanism automatically detects the underlying files of the declared Iceberg snapshot and constructs a file-local IVF index for each. Internal partition-local IVF logic is leveraged to optimize both index build and query performance.
You can specify any distance metric for similarity calculations. In this example, cosine distance is used.
CREATE VECTOR INDEX vdb_ext_ivf01_idx
ON vdb_ext_ivf01 (emb)
ORGANIZATION NEIGHBOR PARTITIONS DISTANCE COSINE;
VECTOR$VDB_EXT_IVF01_IDX$76013_76015_0$IVF_FLAT_CENTROIDS-
VECTOR$VDB_EXT_IVF01_IDX$76013_76015_0$IVF_FLAT_CENTROID_PARTITIONS VECTOR$VDB_EXT_IVF01_IDX$76013_76015_0$XT_FILEVECTOR$VDB_EXT_IVF01_IDX$76013_76015_0$XT_FILE_MAINT_TASK
XT_FILE AuxiliaryTableThe
XT_FILE auxiliarytable should list all external data files that are part of the Iceberg snapshot used during index creation. This ensures the index covers the complete set of files referenced by that snapshot.IDandPART_IDConsistencyWithin the
XT_FILEtable, theIDandPART_IDcolumns must have identical values. This reflects the one-to-one mapping between each external file and its corresponding IVF index partition.IVF_FLAT_CENTROID_PARTITIONSTable SizeThe
IVF_FLAT_CENTROID_PARTITIONStable must contain the same number of records as the base table. This guarantees that every row in your data has a corresponding entry within the index partition, ensuring completeness for vector search operations.
Parent topic: IVF Indexing on External Iceberg Tables
Vector Search with IVF Index on Iceberg External Tables
After you create an IVF vector index on an Iceberg external table, you can run approximate k-Approximate Nearest Neighbor (k-ANN) queries that use the index to accelerate semantic or similarity search.
The following example shows a k-ANN query that computes cosine distance and returns the approximate top 5 closest vectors:
SELECT id, vector_distance(emb, '[0,1,7]', COSINE) AS dist
FROM vdb_ext_ivf01
ORDER BY dist
FETCH APPROX FIRST 5 ROWS ONLY;Snapshot-referenced external table: k-ANN results
are based on the specific Iceberg snapshot pointed to by the table's
metadata.json. Newer Iceberg commits are not visible until the external
table is updated to reference a newer snapshot.
Catalog-referenced external table: k-ANN results reflect the latest committed Iceberg snapshot available at query time. As new snapshots are committed, k-ANN query results are always evaluated as of the latest snapshot regardless of whether the vector index has been maintained/refreshed. The k-ANN query is internally transformed to use the correct data files as of the latest snapshot. Since the vector index is file-local, the transformed query can perform k-ANN using a partial vector index along with data-file scans, always returning results consistent with the latest snapshot.
Supported k-ANN Query PatternsNote:
IVF indexes with included columns are not currently supported on Iceberg external tables.See Also:
Parent topic: IVF Indexing on External Iceberg Tables
IVF Index Maintenance
Index maintenance ensures that the IVF index stays synchronized with changes to the underlying Iceberg table so that queries can utilize the most current data.
When the underlying Iceberg table changes, the index may require maintenance to reflect the latest snapshot. During maintenance, the index maintenance process builds indexes for newly added data files.
To achieve this, the system uses coordinated background tasks to advance the IVF index from an older Iceberg snapshot to the latest one while preserving query consistency during ongoing maintenance.
Key Components
XT_FILE: Tracks each external data file and whether its file-local IVF index is usable.CREATE TABLE VECTOR$<index_name>$<index_objn>$XT_FILE ( id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- File ID path VARCHAR2(4000) NOT NULL, -- File path name VARCHAR2(4000) NOT NULL, -- File name status NUMBER DEFAULT 1 NOT NULL -- 1: Index usable -- 0: Index not usable );status:1(index is usable) or0(index is not usable)
XT_FILE_MAINT_TASK: Records maintenance operations required to align the index with a newer Iceberg snapshot.CREATE TABLE VECTOR$<index_name>$<index_objn>$XT_FILE_MAINT_TASK ( task_id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- Task ID file_id NUMBER, -- File ID task_type CHAR(1) NOT NULL, -- Task type, A: Add index partition, D: Drop index partition task_status CHAR(1) NOT NULL, -- Task status, R: Ready, I: In progress, C: Completed, F: Failed snapshot_id VARCHAR2(4000) NOT NULL -- Target iceberg snapshot ID );task_type:A(add index partition ) orD(drop index partition)tast_status:R(Ready),I(In Progress),C(Completed),F(Failed)snapshot_id: identifies the Iceberg snapshot that introduced the change
Updates to the vector index catalog
VECSYS.VECTOR$INDEX) stores additional
information required for Iceberg-based indexing, including:
- Iceberg access information (credential schema/name) and the current/next metadata_uri (snapshot metadata).
- References to the auxiliary tables created for the index (object numbers and generated table names).
// SPARE2 JSON column
{
"iceberg": {
"credential_schema": "<CREDENTIAL_SCHEMA>",
"credential_name" : "<CREDENTIAL_NAME>",
"metadata_uri" : "https://oci_bucket/iceberg/vector/metadata/v1.metadata.json",
"new_metadata_uri" : "https://oci_bucket/iceberg/vector/metadata/v2.metadata.json"
}
}
// IDX_AUXILIARY_TABLES JSON column
{
"xt_file_objn" : 74315,
"xt_file_name" : "VECTOR$IDX_NAME$74315_74318_0$XT_FILE",
"xt_file_maint_task_objn": 74320,
"xt_file_maint_task_name": "VECTOR$IDX_NAME$74320_74321_0$XT_FILE_MAINT_TASK"
}
This catalog allows background tasks to locate the correct Iceberg table version and the corresponding maintenance state tables for a given IVF index.
Background (Autotask++) tasks
Automated maintenance is implemented using the following background tasks:
Table Change Detection Autotask
- Retrieves the list of files referenced by the new snapshot.
- Compares it with files recorded in
XT_FILE. - Computes a file-level diff. The file-list diffs consist of an
add-filelist, containing files newly introduced in the newer snapshot, and adrop-filelist, containing files no longer present in the new snapshot. - Registers the corresponding maintenance tasks in the
XT_FILE_MAINT_TASKtable.
Index Creation Autotask
- Allocates a new file identifier (
FILE_ID) usingXT_FILE. - Adds the corresponding partitions to the IVF index tables. For
example,
CENTROIDSandCENTROID_PARTITIONSusing the file's ID as the partition key. - Runs k-means clustering to populate
CENTROIDS. - Populates
CENTROID_PARTITIONSbased on the clustering output. - Updates the
XT_FILE_MAINT_TASKandXT_FILEtables to mark the index creation as completed.
Parent topic: IVF Indexing on External Iceberg Tables