Summary of DBMS_DCAT Subprograms

This table lists the DBMS_DCAT package procedures used to create, query and drop Data Catalog connections.

Subprogram Description
SET_DATA_CATALOG_CREDENTIAL Procedure Set the data catalog access credential used by a specific connection to the data catalog
SET_OBJECT_STORE_CREDENTIAL Procedure Set the credential used by the given unique connection identifier for accessing the Object Store
SET_DATA_CATALOG_CONN Procedure Create a connection to the given data catalog
UNSET_DATA_CATALOG_CONN Procedure Remove an existing Data Catalog connection

SET_DATA_CATALOG_CREDENTIAL Procedure

This procedure sets the Data Catalog access credential used by a specific connection to the Data Catalog.

Syntax:
PROCEDURE DBMS_DCAT.SET_DATA_CATALOG_CREDENTIAL(
    credential_name VARCHAR2(128) DEFAULT NULL,
    dcat_con_id     VARCHAR2 DEFAULT NULL
  );
Parameters:
Parameter Description

credential_name

The credential used for accessing the Data Catalog.

dcat_con_id The unique Data Catalog connection identifier. This credential is used for the connection identified by dcat_con_id. The default is Null.

This credential must have Manage Data Catalog permissions; see Data Catalog Policies.

SET_OBJECT_STORE_CREDENTIAL Procedure

This procedure sets the credential that is used by the given unique connection identifier for accessing the Object Store. Changing the Object Store access credential alters all existing synced tables to use the new credential.

Syntax:
PROCEDURE DBMS_DCAT.SET_OBJECT_STORE_CREDENTIAL(
    credential_name  VARCHAR2(128),
    dcat_con_id      IN VARCHAR2 DEFAULT NULL
  );
Parameters:
Parameter Description

credential_name

The credential used by the external tables for accessing the Object Store.

dcat_con_id The unique Data Catalog connection identifier. The default is NULL.

SET_DATA_CATALOG_CONN Procedure

This procedure creates a connection to the given Data Catalog. The connection is required to synchronize metadata with Data Catalog. An Autonomous AI Database instance can connect to multiple Data Catalog instances and supports connecting to OCI Data Catalogs and AWS Glue Data Catalogs.

Syntax:
PROCEDURE DBMS_DCAT.SET_DATA_CATALOG_CONN (
   region         VARCHAR2 DEFAULT NULL,
   endpoint       VARCHAR2 DEFAULT NULL,
   catalog_id     VARCHAR2 DEFAULT NULL,
   dcat_con_id    VARCHAR2 DEFAULT NULL,
   catalog_type   VARCHAR2 DEFAULT NULL,
   token_endpoint VARCHAR2 DEFAULT Null
 );
Parameters:
Parameter Description

region

The Data Catalog region. If the endpoint is specified, region is optional. If both endpoint and region are given, then the endpoint takes precedence. Default is NULL.

endpoint

The Data Catalog endpoint. If the region is specified, endpoint is optional. If both endpoint and region are given, then the endpoint takes precedence. Default is NULL.

catalog_id

The unique Oracle Cloud Identifier (OCID) for the Data Catalog instance. When connecting to AWS Glue Data Catalogs, catalog_id is optional.

dcat_con_id A unique Data Catalog connection identifier. This identifier is required when connecting to multiple Data Catalogs and is optional when connecting to only one. It is used to refer to the Data Catalog connection in subsequent calls or when querying views. If no identifier is specified this procedure generates a NULL connection identifier. The following restrictions apply for dcat_con_id:
  • It must be unique within the Autonomous AI Database instance.
  • It must start with a letter.
  • It may contain alphanumeric characters, underscores (_), dollar signs ($), and pound signs (#).
  • It must be at least 16 characters long.
catalog_type The type of data catalog to connect. Allowed values:
  • OCI_DCAT - OCI Data Catalog
  • AWS_GLUE - AWS Glue Data Catalog
  • NULL - The catalog type is automatically detected from the provided region or endpoint.
token_endpoint The access token endpoint for OAuth authentication. Default is NULL.

Usage:

You only need to call this procedure once to set the connection. As part of the connection process, Autonomous AI Database adds custom properties to Data Catalog. These custom properties are accessible to Data Catalog users and allow you to override default names (for schemas, tables and columns) and column data types.

Before creating a connection, credentials must be created and set. For a description of the connection process, see Typical Workflow with Data Catalog for OCI Data Catalogs and User Workflow for Querying with AWS Glue Data Catalog for AWS Glue Data Catalogs.

Example: Connecting with a known OCID:

In this example, Autonomous AI Database is connecting to Data Catalog in the uk-london-1 region. The catalog_id parameter uses the Oracle Cloud Identifier (ocid) for the Data Catalog instance. The type of Data Catalog is automatically determined: AWS Glue Data Catalog or OCI Data Catalog.
BEGIN
  DBMS_DCAT.SET_DATA_CATALOG_CONN(
    region=>'uk-london-1',
    catalog_id=>'ocid1.datacatalog.oc1.uk-london-1...');
END;
/

Example: Connecting to an AWS Glue Data Catalog

A connection is the association between an Autonomous AI Database instance and an AWS Glue Data Catalog. After a successful connection, the Autonomous AI Database instance is able to synchronize with AWS Glue. Each AWS account has one AWS Glue Data Catalog per region and each catalog can be accessed using the corresponding service endpoint for each region. An Autonomous AI Database instance can be associated with an AWS Glue Data Catalog by invoking the API DBMS_DCAT.SET_DATA_CATALOG_CONN and specify the endpoint for the region where the catalog resides.

See AWS Glue endpoints and quotas.

In this example, Autonomous AI Database is connecting to an AWS Glue Data Catalog in the uk-london-1 region. Because this is an AWS Glue Data Catalog connection, the catalog_id parameter is not needed.
BEGIN
  DBMS_DCAT.SET_DATA_CATALOG_CONN(
    region=>'uk-london-1',
    catalog_type=>'AWS_GLUE'
END;
/

UNSET_DATA_CATALOG_CONN Procedure

This procedure removes an existing Data Catalog connection.

Syntax:

Note:

Invoking this procedure drops all of the protected schemas and external tables that were created as part of previous synchronizations. It does not impact the metadata in Data Catalog.
PROCEDURE DBMS_DCAT.UNSET_DATA_CATALOG_CONN (
    dcat_con_id IN VARCHAR2 DEFAULT NULL
);
Parameters:
Parameter Description
dcat_con_id The unique Data Catalog connection identifier. Default is Null.

Synchronization Subprograms

Running a synchronization, creating and dropping a synchronization job, and dropping synchronized schemas can be performed with the procedures listed in this table.

Note:

On April 4, 2022, the sync_option and grant_read parameters were added to the DBMS_DCAT.RUN_SYNC procedure. To ensure correct performance of scheduled sync jobs created prior to that date, you need to drop and recreate the scheduled sync jobs. See DROP_SYNC_JOB Procedure and CREATE_SYNC_JOB Procedure.
Subprogram Description
CREATE_SYNC_JOB Procedure Create a scheduler job to invoke RUN_SYNC periodically
DROP_SYNC_JOB Procedure Drop an existing sync job for the given unique connection identifier
DROP_SYNCED_SCHEMAS Procedure Drop all previously synchronized schemas for the given unique connection identifier
RUN_SYNC Procedure Run a synchronization operation

RUN_SYNC Procedure

This procedure runs a synchronization operation and is the entry point to the synchronization. As an input, it takes lists of selected data catalog assets, folders and entities and materializes them by creating, dropping, and altering external tables.

The sync_option parameter specifies which operation the RUN_SYNC procedure performs: SYNC, DELETE or REPLACE. The operation is performed over entities within the scope of the synced_objects parameter.

Every call to the RUN_SYNC procedure returns a unique operation_id that can be used to query the USER_LOAD_OPERATIONS view to obtain information about the status of the sync and the corresponding log_table. The DBMS_DCAT$SYNC_LOG view can be queried for easy access to the log_table for the last sync operation executed by the current user. For further details, see DBMS_DCAT$SYNC_LOG View.

Note:

On April 4, 2022, the sync_option and grant_read parameters were added to the DBMS_DCAT.RUN_SYNC procedure. To ensure correct performance of scheduled sync jobs created prior to that date, you need to drop and recreate the scheduled sync jobs. See DROP_SYNC_JOB Procedure and CREATE_SYNC_JOB Procedure .

Synchronizing Partitioned Logical Entities or Glue Tables

The RUN_SYNC procedure creates a partitioned external table for each logical entity or Glue table when all three of the following apply:
  1. The OCI data catalog logical entity or Glue table has one or more partitioned attributes.
  2. For OCI data catalogs, the logical entity is derived from a prefix-based filename pattern. Partitioned logical entities derived from regex-based patterns are not supported.
  3. For OCI data catalogs, the logical entity is based on partitioned data that follows the hive-style or non-hive folder format. Logical entities based on partitioned data that follow the non-hive style format using object names are not supported.
    • Example 1. Logical entities based on harvested objects that follow the Hive style partitioning format with prefix-based filename patterns.

      Consider the following objects:
      Bucket: MYBUCKET
      cluster1/db1.db/sales/country=USA/year=2020/month=01/sales1.csv
      cluster1/db1.db/sales/country=USA/year=2020/month=01/sales2.csv
      cluster1/db1.db/sales/country=USA/year=2020/month=02/sales1.csv

      Harvesting the bucket using a filename pattern with a starting folder prefix of cluster1/db1.db generates a logical entity named SALES with three partition attributes: country, year, and month. The type for partitioned attributes is Partition while the type for non-partitioned attributes is Primitive.

    • Example 2. Logical entities based on harvested objects that follow the non-Hive style partitioning format with prefix-based filename patterns.
      Consider the following objects:
      Bucket: MYBUCKET
      cluster2/db2.db/sales/USA/2020/01/sales1.csv
      cluster2/db2.db/sales/USA/2020/01/sales2.csv
      cluster2/db2.db/sales/USA/2020/02/sales1.csv

      Harvesting the bucket using a filename pattern with a starting folder prefix of cluster2/db2.db generates a logical entity named SALES with three partition attributes: name0, name1, and name2. The only difference between the generated logical entity compared to Example 1, is that the names of partitioned attributes are auto generated, while in Example 1 they are extracted from the URL (country, year, and month respectively).

For a complete end to end example of synchronizing partitioned logical entities, see Example: A Partitioned Data Scenario.

Incremental Synchronization of Partitioned Logical Entities/Glue Tables

Every call to the RUN_SYNC procedure specifies a set of OCI data catalog logical entities or AWS Glue tables to be synced with the database. When a logical entity or Glue table is present in two RUN_SYNC calls, the second call preserves and possibly alters existing external tables. The following table shows which logical entity or Glue table changes are supported when the logical entity or Glue table is partitioned:
Logical Entity or Glue Table Change Action

Addition, removal, or update of a partition

All partitions of the external partitioned table are updated, regardless of whether a change has been detected by the data catalog.

Addition of a partitioned attribute

Adding a partitioned column to an external partitioned table is not supported. An exception is raised.

Deletion of a partition attribute

Dropping a partitioned column from an external partitioned table is not supported. An exception is raised.

Renaming of a partitioned attribute

Renaming a partitioned column in an external partitioned table is not supported. An exception is raised.

Syntax:
PROCEDURE DBMS_DCAT.RUN_SYNC (
    synced_objects   IN  CLOB,
    sync_option      IN VARCHAR2 DEFAULT 'SYNC',
    error_semantics  IN VARCHAR2 DEFAULT 'SKIP_ERRORS',
    log_level        IN VARCHAR2 DEFAULT 'INFO',
    grant_read       IN VARCHAR2 DEFAULT NULL,
    dcat_con_id      IN VARCHAR2 DEFAULT NULL
 );
PROCEDURE DBMS_DCAT.RUN_SYNC (
    synced_objects   IN  CLOB,
    sync_option      IN VARCHAR2 DEFAULT 'SYNC',
    error_semantics  IN VARCHAR2 DEFAULT 'SKIP_ERRORS',
    log_level        IN VARCHAR2 DEFAULT 'INFO',
    grant_read       IN VARCHAR2 DEFAULT NULL,
    operation_id     OUT NOCOPY NUMBER,
    dcat_con_id      IN VARCHAR2 DEFAULT NULL
 );
Parameters:
Parameter Description

synced_objects

This parameter is a JSON document that specifies the data catalog objects to synchronize.

For OCI Data Catalogs, the JSON document specifies a set of entities in multiple granularity: data assets, folders (Object Store buckets) or logical entities. It contains an asset_list that is either an array of asset objects or an array containing a single "*" string that stands for 'sync all (object store) data assets in the catalog'.

For AWS Glue Data Catalogs, the JSON document specifies a list of tables in multiple granularity: databases, tables. The document specifies a list of databases. Users can restrict the set of tables to be synced by specifying individual tables within a database.

sync_option (Optional) There are three options:
  • SYNC (Default) - This option ensures that what is in the data catalog, over the synced_objects scope, is represented in the Autonomous AI Database. If a logical entity or Glue table was deleted from the data catalog, since the last sync operation, then it is deleted in the Autonomous AI Database. The following operations are performed over the synced_objects scope:
    • Adds tables for new data catalog entities
    • Removes tables for deleted data catalog entities
    • Updates properties (such as name, columns and data types) for existing tables
  • DELETE - Deletes tables within the synced_objects scope.
  • REPLACE - Replaces all currently synced objects with the objects within the synced_objects scope.

error_semantics

(Optional) This parameter specifies the error behavior. If set to SKIP_ERRORS, the sync attempts to continue despite errors encountered for individual entities. If set to STOP_ON_ERROR, the procedure fails on the first encountered error. The default is SKIP_ERRORS.

log_level

(Optional) This parameter specifies the following values in increasing level of logging detail: (OFF, FATAL, ERROR, WARN, INFO, DEBUG, TRACE, ALL). The default is INFO.
grant_read (Optional) This parameter is a list of users/roles that are automatically granted READ privileges on all external tables processed by this invocation of RUN_SYNC. All users/roles in the grant_read list are given READ privileges on all new or already existing external tables that correspond to entities specified by the synced_objects parameter. The RUN_SYNC procedure preserves already granted privileges on synced external tables.

operation_id

(Optional) This parameter is used to find the corresponding entry in USER_LOAD_OPERATIONS for the sync and determine the name of the log table.

Note: A version of RUN_SYNC that does not return an operation_id is available so users can query USER_LOAD_OPERATIONS for the latest sync.

dcat_con_id This parameter is the unique data catalog connection identifier that was specified when the connection to the data catalog was created. SeeSET_DATA_CATALOG_CONN Procedure . This parameter identifies which connection is used for synchronization and becomes a part of the derived schema name. See Synchronization Mapping for a description of how the schema name is derived. The parameter default is NULL.

Example: Synchronize all OCI Data Catalog entities

In the following example, all Data Catalog entities are synchronized.
EXEC DBMS_DCAT.RUN_SYNC(synced_objects=>'{"asset_list":["*"]}');

Example: synced_objects parameter for synchronizing all OCI Data Catalog data assets

The following is an example synced_objects parameter for synchronizing all (Object Storage) data assets in the Data Catalog.
{"asset_list" : ["*"]}

Example: synced_objects parameter for synchronizing specific OCI Data Catalog data assets

The following is an example synced_objects parameter for synchronizing two data assets.
{"asset_list": [
        {
            "asset_id":"...-9e2e7f78fd5f"
        },
        {
            "asset_id":"...-9e2e7f74523"
        }
    ]}

Example: synced_objects parameter for synchronizing specific OCI Data Catalog entities within a data asset

The following shows an example synced_objects parameter for synchronizing two entities within the data asset.
{"asset_list": [
        {
            "asset_id":"...-9e2e7f78fd5f",
            "folder_list":[
                "f1",         
                "f2" 
           ]
        }   
    ]}

Example: synced_objects parameter for synchronizing specific OCI Data Catalog folders and entities within a data asset

The following shows an example synced_objects parameter for synchronizing two folders and two entities within the data asset.
{"asset_list":[
        {
            "asset_id":"...-9e2e7f78fd5f",
            "entity_list": [
                "entity1",     
                "entity2"
            ],
             "folder_list": [
                "f1",         
                "f2"
            ]
        }   
    ]}

Example: synced_objects Parameter for Synchronizing All AWS Glue Data Catalog Databases

The following shows an example synced_objects parameter for synchronizing all databases in the AWS Glue Data Catalog.
{"database_list":["*"]}

Example: synced_objects Parameter for Synchronizing Two AWS Glue Data Catalog Databases

The following shows an example synced_objects parameter for synchronizing two AWS Glue Data Catalog databases.
{"database_list":[
    {"database":"tpcdscsv"},
    {"database":"tpcdsparquet"} ]}

Example: synced_objects Parameter for Synchronizing Three AWS Glue Data Catalog Databases

The following shows an example synced_objects parameter for synchronizing three tables from an AWS Glue Data Catalog database.
{"database_list":[
 {"database":"tpcdsparquet",
     "table_list": [ "tpcdsparquet_customer",
                     "tpcdsparquet_item",
                     "tpcdsparquet_web_sales" ] } ]}

CREATE_SYNC_JOB Procedure

This procedure creates a scheduler job to invoke RUN_SYNC periodically

It takes as input the set of objects to be synced, the error semantics, the log level, and a repeat interval. See RUN_SYNC Procedure for further details on how synchronization works.

There can only be a single sync job. The CREATE_SYNC_JOB procedure fails if another job is already specified, unless the force parameter is set to TRUE. If force is set to TRUE the previous job is dropped.

If a scheduler job attempts to run while another sync is in progress, the scheduler job fails.

Note:

On April 4, 2022, the sync_option and grant_read parameters were added to the DBMS_DCAT.RUN_SYNC procedure. To ensure correct performance of scheduled sync jobs created prior to that date, you need to drop and recreate the scheduled sync jobs. See DROP_SYNC_JOB Procedure and CREATE_SYNC_JOB Procedure .
Syntax:
PROCEDURE DBMS_DCAT.CREATE_SYNC_JOB (
    synced_objects   IN CLOB,
    error_semantics  IN VARCHAR2 DEFAULT 'SKIP_ERRORS',
    log_level        IN VARCHAR2 DEFAULT 'INFO',
    repeat_interval  IN VARCHAR2,
    force            IN VARCHAR2 DEFAULT 'FALSE',
    grant_read       IN VARCHAR2 DEFAULT NULL,
    sync_option      IN VARCHAR2 DEFAULT 'SYNC',
    dcat_con_id      IN VARCHAR2 DEFAULT NULL
);
Parameters:
Parameter Description

synced_objects

A JSON object specifying the objects to be synced, as described in the RUN_SYNC procedure. See RUN_SYNC Procedure.

error_semantics

(Optional) Error behavior, as specified for RUN_SYNC. Default is SKIP_ERRORS.

log_level

(Optional) Logging level, as specified for RUN_SYNC. Default is INFO.

repeat_interval

Repeat interval for the job, with the same semantics as the repeat interval parameter of the DBMS_SCHEDULER.CREATE_JOB procedure. For details on the repeat_interval, see Overview of Creating Jobs.

force

(Optional) If TRUE, existing sync jobs are deleted first. If FALSE, the CREATE_SYNC_JOB procedure fails if a sync job already exists. Default is FALSE.

grant_read (Optional) List of users/roles to be granted READ on the synced external tables, as described for procedure RUN_SYNC. See RUN_SYNC Procedure.
sync_option (Optional) Behavior with respect to entities that have already been synced through a previous RUN_SYNC operation, as described for procedure RUN_SYNC. See RUN_SYNC Procedure.
dcat_con_id This parameter is the unique Data Catalog connection identifier that was specified when the connection to Data Catalog was created. See SET_DATA_CATALOG_CONN Procedure. This parameter identifies which connection is used for synchronization and becomes a part of the derived schema name. See Synchronization Mapping for a description of how the schema name is derived. The parameter default is NULL.

DROP_SYNC_JOB Procedure

This procedure drops an existing sync job for the given unique connection identifier.

Syntax:
PROCEDURE DBMS_DCAT.DROP_SYNC_JOB (
    dcat_con_id IN VARCHAR2 DEFAULT NULL
  );
Parameters:
Parameter Description
dcat_con_id The unique Data Catalog connection identifier. The default is NULL.

DROP_SYNCED_SCHEMAS Procedure

This procedure drops all previously synchronized schemas for the given unique connection identifier.

Syntax:
PROCEDURE DBMS_DCAT.DROP_SYNCED_SCHEMAS (
    dcat_con_id IN VARCHAR2 DEFAULT NULL
  );
Parameters:
Parameter Description
dcat_con_id The unique Data Catalog connection identifier. The default is NULL.