Summary of DBMS_DCAT Subprograms

This table lists the DBMS_DCAT package procedures used to create, query and drop Data Catalog connections.

Subprogram	Description
SET_DATA_CATALOG_CREDENTIAL Procedure	Set the data catalog access credential used by a specific connection to the data catalog
SET_OBJECT_STORE_CREDENTIAL Procedure	Set the credential used by the given unique connection identifier for accessing the Object Store
SET_DATA_CATALOG_CONN Procedure	Create a connection to the given data catalog
UNSET_DATA_CATALOG_CONN Procedure	Remove an existing Data Catalog connection

SET_DATA_CATALOG_CREDENTIAL Procedure

This procedure sets the Data Catalog access credential used by a specific connection to the Data Catalog.

Syntax:

PROCEDURE DBMS_DCAT.SET_DATA_CATALOG_CREDENTIAL(
    credential_name VARCHAR2(128) DEFAULT NULL,
    dcat_con_id     VARCHAR2 DEFAULT NULL
  );

Parameters:

Parameter	Description
`credential_name`	The credential used for accessing the Data Catalog.
`dcat_con_id`	The unique Data Catalog connection identifier. This credential is used for the connection identified by `dcat_con_id`. The default is Null.

This credential must have Manage Data Catalog permissions; see Data Catalog Policies.

SET_OBJECT_STORE_CREDENTIAL Procedure

This procedure sets the credential that is used by the given unique connection identifier for accessing the Object Store. Changing the Object Store access credential alters all existing synced tables to use the new credential.

Syntax:

PROCEDURE DBMS_DCAT.SET_OBJECT_STORE_CREDENTIAL(
    credential_name  VARCHAR2(128),
    dcat_con_id      IN VARCHAR2 DEFAULT NULL
  );

Parameters:

Parameter	Description
`credential_name`	The credential used by the external tables for accessing the Object Store.
`dcat_con_id`	The unique Data Catalog connection identifier. The default is NULL.

SET_DATA_CATALOG_CONN Procedure

This procedure creates a connection to the given Data Catalog. The connection is required to synchronize metadata with Data Catalog. An Autonomous AI Database instance can connect to multiple Data Catalog instances and supports connecting to OCI Data Catalogs and AWS Glue Data Catalogs.

Syntax:

PROCEDURE DBMS_DCAT.SET_DATA_CATALOG_CONN (
   region         VARCHAR2 DEFAULT NULL,
   endpoint       VARCHAR2 DEFAULT NULL,
   catalog_id     VARCHAR2 DEFAULT NULL,
   dcat_con_id    VARCHAR2 DEFAULT NULL,
   catalog_type   VARCHAR2 DEFAULT NULL,
   token_endpoint VARCHAR2 DEFAULT Null
 );

Parameters:

Parameter	Description
`region`	The Data Catalog region. If the `endpoint` is specified, `region` is optional. If both `endpoint` and `region` are given, then the `endpoint` takes precedence. Default is NULL.
`endpoint`	The Data Catalog endpoint. If the `region` is specified, `endpoint` is optional. If both `endpoint` and `region` are given, then the `endpoint` takes precedence. Default is NULL.
`catalog_id`	The unique Oracle Cloud Identifier (OCID) for the Data Catalog instance. When connecting to AWS Glue Data Catalogs, `catalog_id` is optional.
`dcat_con_id`	A unique Data Catalog connection identifier. This identifier is required when connecting to multiple Data Catalogs and is optional when connecting to only one. It is used to refer to the Data Catalog connection in subsequent calls or when querying views. If no identifier is specified this procedure generates a NULL connection identifier. The following restrictions apply for `dcat_con_id`: It must be unique within the Autonomous AI Database instance. It must start with a letter. It may contain alphanumeric characters, underscores (_), dollar signs ($), and pound signs (#). It must be at least 16 characters long.
`catalog_type`	The type of data catalog to connect. Allowed values: `OCI_DCAT` - OCI Data Catalog `AWS_GLUE` - AWS Glue Data Catalog `NULL` - The catalog type is automatically detected from the provided region or endpoint.
`token_endpoint`	The access token endpoint for OAuth authentication. Default is NULL.

Usage:

You only need to call this procedure once to set the connection. As part of the connection process, Autonomous AI Database adds custom properties to Data Catalog. These custom properties are accessible to Data Catalog users and allow you to override default names (for schemas, tables and columns) and column data types.

Before creating a connection, credentials must be created and set. For a description of the connection process, see Typical Workflow with Data Catalog for OCI Data Catalogs and User Workflow for Querying with AWS Glue Data Catalog for AWS Glue Data Catalogs.

Example: Connecting with a known OCID:

In this example, Autonomous AI Database is connecting to Data Catalog in the uk-london-1 region. The catalog_id parameter uses the Oracle Cloud Identifier (ocid) for the Data Catalog instance. The type of Data Catalog is automatically determined: AWS Glue Data Catalog or OCI Data Catalog.

BEGIN
  DBMS_DCAT.SET_DATA_CATALOG_CONN(
    region=>'uk-london-1',
    catalog_id=>'ocid1.datacatalog.oc1.uk-london-1...');
END;
/

Example: Connecting to an AWS Glue Data Catalog

A connection is the association between an Autonomous AI Database instance and an AWS Glue Data Catalog. After a successful connection, the Autonomous AI Database instance is able to synchronize with AWS Glue. Each AWS account has one AWS Glue Data Catalog per region and each catalog can be accessed using the corresponding service endpoint for each region. An Autonomous AI Database instance can be associated with an AWS Glue Data Catalog by invoking the API DBMS_DCAT.SET_DATA_CATALOG_CONN and specify the endpoint for the region where the catalog resides.

See AWS Glue endpoints and quotas.

In this example, Autonomous AI Database is connecting to an AWS Glue Data Catalog in the uk-london-1 region. Because this is an AWS Glue Data Catalog connection, the catalog_id parameter is not needed.

BEGIN
  DBMS_DCAT.SET_DATA_CATALOG_CONN(
    region=>'uk-london-1',
    catalog_type=>'AWS_GLUE'
END;
/

UNSET_DATA_CATALOG_CONN Procedure

This procedure removes an existing Data Catalog connection.

Syntax:

Note:

Invoking this procedure drops all of the protected schemas and external tables that were created as part of previous synchronizations. It does not impact the metadata in Data Catalog.

PROCEDURE DBMS_DCAT.UNSET_DATA_CATALOG_CONN (
    dcat_con_id IN VARCHAR2 DEFAULT NULL
);

Parameters:

Parameter	Description
`dcat_con_id`	The unique Data Catalog connection identifier. Default is Null.

Synchronization Subprograms

Running a synchronization, creating and dropping a synchronization job, and dropping synchronized schemas can be performed with the procedures listed in this table.

Note:

On April 4, 2022, the sync_option and grant_read parameters were added to the DBMS_DCAT.RUN_SYNC procedure. To ensure correct performance of scheduled sync jobs created prior to that date, you need to drop and recreate the scheduled sync jobs. See DROP_SYNC_JOB Procedure and CREATE_SYNC_JOB Procedure.

Subprogram	Description
CREATE_SYNC_JOB Procedure	Create a scheduler job to invoke `RUN_SYNC` periodically
DROP_SYNC_JOB Procedure	Drop an existing sync job for the given unique connection identifier
DROP_SYNCED_SCHEMAS Procedure	Drop all previously synchronized schemas for the given unique connection identifier
RUN_SYNC Procedure	Run a synchronization operation

RUN_SYNC Procedure

This procedure runs a synchronization operation and is the entry point to the synchronization. As an input, it takes lists of selected data catalog assets, folders and entities and materializes them by creating, dropping, and altering external tables.

The sync_option parameter specifies which operation the RUN_SYNC procedure performs: SYNC, DELETE or REPLACE. The operation is performed over entities within the scope of the synced_objects parameter.

Every call to the RUN_SYNC procedure returns a unique operation_id that can be used to query the USER_LOAD_OPERATIONS view to obtain information about the status of the sync and the corresponding log_table. The DBMS_DCAT$SYNC_LOG view can be queried for easy access to the log_table for the last sync operation executed by the current user. For further details, see DBMS_DCAT$SYNC_LOG View.

Note:

Synchronizing Partitioned Logical Entities or Glue Tables

The RUN_SYNC procedure creates a partitioned external table for each logical entity or Glue table when all three of the following apply:

The OCI data catalog logical entity or Glue table has one or more partitioned attributes.
For OCI data catalogs, the logical entity is derived from a prefix-based filename pattern. Partitioned logical entities derived from regex-based patterns are not supported.
For OCI data catalogs, the logical entity is based on partitioned data that follows the hive-style or non-hive folder format. Logical entities based on partitioned data that follow the non-hive style format using object names are not supported.
- Example 1. Logical entities based on harvested objects that follow the Hive style partitioning format with prefix-based filename patterns.
  Consider the following objects:
```
Bucket: MYBUCKET
cluster1/db1.db/sales/country=USA/year=2020/month=01/sales1.csv
cluster1/db1.db/sales/country=USA/year=2020/month=01/sales2.csv
cluster1/db1.db/sales/country=USA/year=2020/month=02/sales1.csv
```
  Harvesting the bucket using a filename pattern with a starting folder prefix of cluster1/db1.db generates a logical entity named SALES with three partition attributes: country, year, and month. The type for partitioned attributes is Partition while the type for non-partitioned attributes is Primitive.
- Example 2. Logical entities based on harvested objects that follow the non-Hive style partitioning format with prefix-based filename patterns.
  Consider the following objects:
```
Bucket: MYBUCKET
cluster2/db2.db/sales/USA/2020/01/sales1.csv
cluster2/db2.db/sales/USA/2020/01/sales2.csv
cluster2/db2.db/sales/USA/2020/02/sales1.csv
```
  Harvesting the bucket using a filename pattern with a starting folder prefix of cluster2/db2.db generates a logical entity named SALES with three partition attributes: name0, name1, and name2. The only difference between the generated logical entity compared to Example 1, is that the names of partitioned attributes are auto generated, while in Example 1 they are extracted from the URL (country, year, and month respectively).

For a complete end to end example of synchronizing partitioned logical entities, see Example: A Partitioned Data Scenario.

Incremental Synchronization of Partitioned Logical Entities/Glue Tables

Every call to the RUN_SYNC procedure specifies a set of OCI data catalog logical entities or AWS Glue tables to be synced with the database. When a logical entity or Glue table is present in two RUN_SYNC calls, the second call preserves and possibly alters existing external tables. The following table shows which logical entity or Glue table changes are supported when the logical entity or Glue table is partitioned:

Logical Entity or Glue Table Change	Action
Addition, removal, or update of a partition	All partitions of the external partitioned table are updated, regardless of whether a change has been detected by the data catalog.
Addition of a partitioned attribute	Adding a partitioned column to an external partitioned table is not supported. An exception is raised.
Deletion of a partition attribute	Dropping a partitioned column from an external partitioned table is not supported. An exception is raised.
Renaming of a partitioned attribute	Renaming a partitioned column in an external partitioned table is not supported. An exception is raised.

Syntax:

PROCEDURE DBMS_DCAT.RUN_SYNC (
    synced_objects   IN  CLOB,
    sync_option      IN VARCHAR2 DEFAULT 'SYNC',
    error_semantics  IN VARCHAR2 DEFAULT 'SKIP_ERRORS',
    log_level        IN VARCHAR2 DEFAULT 'INFO',
    grant_read       IN VARCHAR2 DEFAULT NULL,
    dcat_con_id      IN VARCHAR2 DEFAULT NULL
 );
PROCEDURE DBMS_DCAT.RUN_SYNC (
    synced_objects   IN  CLOB,
    sync_option      IN VARCHAR2 DEFAULT 'SYNC',
    error_semantics  IN VARCHAR2 DEFAULT 'SKIP_ERRORS',
    log_level        IN VARCHAR2 DEFAULT 'INFO',
    grant_read       IN VARCHAR2 DEFAULT NULL,
    operation_id     OUT NOCOPY NUMBER,
    dcat_con_id      IN VARCHAR2 DEFAULT NULL
 );

Parameters:

Parameter	Description
`synced_objects`	This parameter is a JSON document that specifies the data catalog objects to synchronize. For OCI Data Catalogs, the JSON document specifies a set of entities in multiple granularity: data assets, folders (Object Store buckets) or logical entities. It contains an `asset_list` that is either an array of asset objects or an array containing a single "*" string that stands for 'sync all (object store) data assets in the catalog'. For AWS Glue Data Catalogs, the JSON document specifies a list of tables in multiple granularity: databases, tables. The document specifies a list of databases. Users can restrict the set of tables to be synced by specifying individual tables within a database.
`sync_option`	(Optional) There are three options: `SYNC` (Default) - This option ensures that what is in the data catalog, over the `synced_objects` scope, is represented in the Autonomous AI Database. If a logical entity or Glue table was deleted from the data catalog, since the last sync operation, then it is deleted in the Autonomous AI Database. The following operations are performed over the `synced_objects` scope: Adds tables for new data catalog entities Removes tables for deleted data catalog entities Updates properties (such as name, columns and data types) for existing tables `DELETE` - Deletes tables within the `synced_objects` scope. `REPLACE` - Replaces all currently synced objects with the objects within the `synced_objects` scope.
`error_semantics`	(Optional) This parameter specifies the error behavior. If set to `SKIP_ERRORS`, the sync attempts to continue despite errors encountered for individual entities. If set to `STOP_ON_ERROR`, the procedure fails on the first encountered error. The default is `SKIP_ERRORS`.
`log_level`	(Optional) This parameter specifies the following values in increasing level of logging detail: (`OFF`, `FATAL`, `ERROR`, `WARN`, `INFO`, `DEBUG`, `TRACE`, `ALL`). The default is `INFO`.
`grant_read`	(Optional) This parameter is a list of users/roles that are automatically granted READ privileges on all external tables processed by this invocation of `RUN_SYNC`. All users/roles in the `grant_read` list are given READ privileges on all new or already existing external tables that correspond to entities specified by the `synced_objects` parameter. The `RUN_SYNC` procedure preserves already granted privileges on synced external tables.
`operation_id`	(Optional) This parameter is used to find the corresponding entry in `USER_LOAD_OPERATIONS` for the sync and determine the name of the log table. Note: A version of `RUN_SYNC` that does not return an `operation_id` is available so users can query `USER_LOAD_OPERATIONS` for the latest sync.
`dcat_con_id`	This parameter is the unique data catalog connection identifier that was specified when the connection to the data catalog was created. SeeSET_DATA_CATALOG_CONN Procedure . This parameter identifies which connection is used for synchronization and becomes a part of the derived schema name. See Synchronization Mapping for a description of how the schema name is derived. The parameter default is NULL.

Example: Synchronize all OCI Data Catalog entities

In the following example, all Data Catalog entities are synchronized.

EXEC DBMS_DCAT.RUN_SYNC(synced_objects=>'{"asset_list":["*"]}');

Example: synced_objects parameter for synchronizing all OCI Data Catalog data assets

The following is an example synced_objects parameter for synchronizing all (Object Storage) data assets in the Data Catalog.

{"asset_list" : ["*"]}

Example: synced_objects parameter for synchronizing specific OCI Data Catalog data assets

The following is an example synced_objects parameter for synchronizing two data assets.

{"asset_list": [
        {
            "asset_id":"...-9e2e7f78fd5f"
        },
        {
            "asset_id":"...-9e2e7f74523"
        }
    ]}

Example: synced_objects parameter for synchronizing specific OCI Data Catalog entities within a data asset

The following shows an example synced_objects parameter for synchronizing two entities within the data asset.

{"asset_list": [
        {
            "asset_id":"...-9e2e7f78fd5f",
            "folder_list":[
                "f1",         
                "f2" 
           ]
        }   
    ]}

Example: synced_objects parameter for synchronizing specific OCI Data Catalog folders and entities within a data asset

The following shows an example synced_objects parameter for synchronizing two folders and two entities within the data asset.

{"asset_list":[
        {
            "asset_id":"...-9e2e7f78fd5f",
            "entity_list": [
                "entity1",     
                "entity2"
            ],
             "folder_list": [
                "f1",         
                "f2"
            ]
        }   
    ]}

Example: synced_objects Parameter for Synchronizing All AWS Glue Data Catalog Databases

The following shows an example synced_objects parameter for synchronizing all databases in the AWS Glue Data Catalog.

{"database_list":["*"]}

Example: synced_objects Parameter for Synchronizing Two AWS Glue Data Catalog Databases

The following shows an example synced_objects parameter for synchronizing two AWS Glue Data Catalog databases.

{"database_list":[
    {"database":"tpcdscsv"},
    {"database":"tpcdsparquet"} ]}

Example: synced_objects Parameter for Synchronizing Three AWS Glue Data Catalog Databases

The following shows an example synced_objects parameter for synchronizing three tables from an AWS Glue Data Catalog database.

{"database_list":[
 {"database":"tpcdsparquet",
     "table_list": [ "tpcdsparquet_customer",
                     "tpcdsparquet_item",
                     "tpcdsparquet_web_sales" ] } ]}

CREATE_SYNC_JOB Procedure

This procedure creates a scheduler job to invoke RUN_SYNC periodically

It takes as input the set of objects to be synced, the error semantics, the log level, and a repeat interval. See RUN_SYNC Procedure for further details on how synchronization works.

There can only be a single sync job. The CREATE_SYNC_JOB procedure fails if another job is already specified, unless the force parameter is set to TRUE. If force is set to TRUE the previous job is dropped.

If a scheduler job attempts to run while another sync is in progress, the scheduler job fails.

Note:

Syntax:

PROCEDURE DBMS_DCAT.CREATE_SYNC_JOB (
    synced_objects   IN CLOB,
    error_semantics  IN VARCHAR2 DEFAULT 'SKIP_ERRORS',
    log_level        IN VARCHAR2 DEFAULT 'INFO',
    repeat_interval  IN VARCHAR2,
    force            IN VARCHAR2 DEFAULT 'FALSE',
    grant_read       IN VARCHAR2 DEFAULT NULL,
    sync_option      IN VARCHAR2 DEFAULT 'SYNC',
    dcat_con_id      IN VARCHAR2 DEFAULT NULL
);

Parameters:

Parameter	Description
`synced_objects`	A JSON object specifying the objects to be synced, as described in the `RUN_SYNC` procedure. See RUN_SYNC Procedure.
`error_semantics`	(Optional) Error behavior, as specified for `RUN_SYNC`. Default is `SKIP_ERRORS`.
`log_level`	(Optional) Logging level, as specified for `RUN_SYNC`. Default is `INFO`.
`repeat_interval`	Repeat interval for the job, with the same semantics as the repeat interval parameter of the `DBMS_SCHEDULER.CREATE_JOB` procedure. For details on the `repeat_interval`, see Overview of Creating Jobs.
`force`	(Optional) If `TRUE`, existing sync jobs are deleted first. If `FALSE`, the `CREATE_SYNC_JOB` procedure fails if a sync job already exists. Default is `FALSE`.
`grant_read`	(Optional) List of users/roles to be granted READ on the synced external tables, as described for procedure `RUN_SYNC`. See RUN_SYNC Procedure.
`sync_option`	(Optional) Behavior with respect to entities that have already been synced through a previous `RUN_SYNC` operation, as described for procedure `RUN_SYNC`. See RUN_SYNC Procedure.
`dcat_con_id`	This parameter is the unique Data Catalog connection identifier that was specified when the connection to Data Catalog was created. See SET_DATA_CATALOG_CONN Procedure. This parameter identifies which connection is used for synchronization and becomes a part of the derived schema name. See Synchronization Mapping for a description of how the schema name is derived. The parameter default is NULL.

DROP_SYNC_JOB Procedure

This procedure drops an existing sync job for the given unique connection identifier.

Syntax:

PROCEDURE DBMS_DCAT.DROP_SYNC_JOB (
    dcat_con_id IN VARCHAR2 DEFAULT NULL
  );

Parameters:

Parameter	Description
`dcat_con_id`	The unique Data Catalog connection identifier. The default is NULL.

DROP_SYNCED_SCHEMAS Procedure

This procedure drops all previously synchronized schemas for the given unique connection identifier.

Syntax:

PROCEDURE DBMS_DCAT.DROP_SYNCED_SCHEMAS (
    dcat_con_id IN VARCHAR2 DEFAULT NULL
  );

Parameters:

Parameter	Description
`dcat_con_id`	The unique Data Catalog connection identifier. The default is NULL.

Title and Copyright Information

Oracle and/or its affiliates.