5.6 Create an Apache Iceberg Connection

Apache Iceberg is an open standard table format that is optimized to manage large analytic datasets. Data Transforms supports the use of Apache Iceberg as a target to load data from any SQL-based data sources.

Data Transforms supports Oracle Object Storage (S3 compatibility) and AWS S3 storage services to store the Parquet files for the Apache Iceberg tables. Note that you use an Apache Iceberg connection only as a target for data loads.

The newly created connections are displayed in the Connections page. See View and Manage Connections.

Creating an Apache Iceberg Connection using Generic REST Catalog

The Data Transforms Apache Iceberg Connector requires that a REST Catalog already exists. This REST Catalog is setup based on Apache Gravitino (external link) with Iceberg Open API specification.

Note:

Data Transforms supports the use of Apache Gravitino version 0.7.0-incubating or lower to bring up the REST service.

You can configure an Apache Iceberg connection with the Generics REST Catalog by providing the REST URL and authentication details such as the username and password. You can also use the more secure OAuth authentication to create the connection.

To create an Apache Iceberg connection:

  1. From the left pane of the Home page, click the Connections tab.

    Connections page appears.

  2. Click Create Connection.

    Create Connection page slides in.

  3. Do one of the following:
    • In the Select Type field, enter the name or part of the name of the connection type.
    • Select the Databases tab.
  4. Select Apache Iceberg as the connection type.
  5. Click Next.
  6. The Connection Name field is pre-populated with a default name. You can edit this value.
  7. Select Generic REST Catalog from the Catalog Provider drop-down list.
  8. In the Catalog Name text box, enter a name.
  9. In the Rest URL text box, enter the URL of the REST server. Enter the value in the <host>:<port>/<ServiceName> format.
  10. From the Authentication drop-down section, do one of the following:
    • Select None.
    • Select Simple and enter the Rest User and Rest Password.
    • Select OAuth and enter the following details:
      • Warehouse Location: The location where you want to store the data. For example, s3://my-bucket/my/table/location
      • Token URI: The URL to obtain the OAuth Token in the format http://<host>:<port>
      • Token Path: The path to the OAuth token. For example, /auth/token.
      • Client ID: The OAuth Client ID.
      • Client Secret: The OAuth Client secret.
      • [Optional] Auth Scope: The permissions granted to a client when accessing the Polaris server. For example, a test Auth Scope value might indicate that the client is authorized to access resources related to the "test" scope within Gravitino.
      • [Optional] Grant Type: The method that the authorization server should use to issue the access token. For example, client_credentials and authorization_code.
  11. In Storage Settings, specify the region and the S3-compatible authentication keys to connect to the storage service. Data Transforms supports both Oracle Object Storage (S3 compatibility) and AWS S3 storage services.
  12. Expand Advanced Options to specify the Batch Update Size, which is the number of rows (records written) in a single Oracle Data Transforms INSERT command. You may want to set this value to improve performance during data loads. The default value is 5000.
  13. Click Test Connection, to test the established connection.
  14. After providing all the required connection details, click Create.

Creating an Apache Iceberg Connection using Snowflake Open Catalog

To create an Apache Iceberg connection you can also use the Snowflake Open Catalog, which is a managed service for Apache Polaris™ (incubating). OAuth is the default authentication mode that is used to create the connection. For Apache Iceberg connections created using the Snowflake Open Catalog, Data Transforms uses the AWS S3 storage services to store the parquet files for the Apache Iceberg tables.

To create an Apache Iceberg connection:

  1. From the left pane of the Home page, click the Connections tab.

    Connections page appears.

  2. Click Create Connection.

    Create Connection page slides in.

  3. Do one of the following:
    • In the Select Type field, enter the name or part of the name of the connection type.
    • Select the Databases tab.
  4. Select Apache Iceberg as the connection type.
  5. Click Next.
  6. The Connection Name field is pre-populated with a default name. You can edit this value.
  7. Select Snowflake Open Catalog from the Catalog Provider drop-down list.
  8. In the Catalog Name text box, enter a name.
  9. In the Rest URL text box, enter the URL of the REST server. Enter the value in the <host>:<port>/<ServiceName>/iceberg format.
  10. OAuth is the default Authentication more. Enter the following:
    • Token URI: The URL to obtain the OAuth Token in the format http://<host>:<port>
    • Token Path: The path to the OAuth token. For example, /polaris/api/catalog/v1/oauth/tokens.
    • Client ID: The OAuth Client ID.
    • Client Secret: The OAuth Client secret.
    • [Optional] Auth Scope: The roles granted to a client when accessing the Polaris server. For example, a PRINCIPAL_ROLE:ALL Auth Scope value allows the OAuth token to assume all catalog roles granted to the authenticated principal/user in Polaris/Open Catalog.
    • [Optional] Grant Type: The method that the authorization server should use to issue the access token. For example, client_credentials and authorization_code.
  11. In Storage Settings, specify the region and the S3-compatible authentication keys to connect to the storage service. Data Transforms supports AWS S3 storage service for Snowflake Open Catalog connections.
  12. Expand Advanced Options to specify the Batch Update Size, which is the number of rows (records written) in a single Oracle Data Transforms INSERT command. You may want to set this value to improve performance during data loads. The default value is 5000.
  13. Click Test Connection, to test the established connection.
  14. After providing all the required connection details, click Create.

Creating an Apache Iceberg Connection using Oracle AI Data Catalog

You can use Data Transforms to write data to Oracle AI Data Catalog (AICAT). To create an Apache Iceberg connection using AICAT, you need to specify the service URL and credentials required to connect to it.

To create an Apache Iceberg connection:

  1. From the left pane of the Home page, click the Connections tab.

    Connections page appears.

  2. Click Create Connection.

    Create Connection page slides in.

  3. Do one of the following:
    • In the Select Type field, enter the name or part of the name of the connection type.
    • Select the Databases tab.
  4. Select Apache Iceberg as the connection type.
  5. Click Next.
  6. The Connection Name field is pre-populated with a default name. You can edit this value.
  7. Select Oracle AI Data Catalog from the Catalog Provider drop-down list.
  8. In the Catalog Name text box, enter a name.
  9. In the Oracle AI Data Catalog URL text box, enter the URL of the AICAT service.
  10. Enter the Database User and Password to connect to the AICAT instance.
  11. In Storage Settings, specify the region and the S3-compatible authentication keys to connect to the storage service.
  12. Expand Advanced Options to specify the Batch Update Size, which is the number of rows (records written) in a single Oracle Data Transforms INSERT command. You may want to set this value to improve performance during data loads. The default value is 5000.
  13. Click Test Connection, to test the established connection.
  14. After providing all the required connection details, click Create.

Creating and Running an Apache Iceberg Data Load

You can create a data load for any SQL-based source data source, such as Oracle, to load data into Apache Iceberg target tables. To use Apache Iceberg as a target data source, you need to provide the name of the connection and the namespace. A namespace in Apache Iceberg is similar to schema in relational databases.

After you create the data load, all the tables in the source schema are listed on the Data Load Detail page along with options to incrementally load, append, and merge the data for each of the selected source tables. When the data load run completes, you can read the data from the Iceberg tables. You can add the data load as a step in a workflow and create a schedule to run the workflows at a predefined time interval. See Create a New Workflow.

To create and run an Apache Iceberg Data Load:

  1. Do one of the following:
    • On the Home page, click Load Data. The Create Data Load wizard appears.

      In the Create Data Load tab, enter a name if you want to replace the default value, add a description, and select a project from the drop-down.

    • On the Home page, click Projects, and then the required project tile. In the left pane, click Data Loads, and then click Create Data Load. The Create Data Load wizard appears.
  2. Enter a name if you want to replace the default value and add a description.
  3. For Load Processing do one of the following:
    • Select the Internal radio button and from the Deployment Type drop-down select Data Transforms (Batch).
    • Select the Delegate radio button and from the Deployment Type drop-down select OCI GoldenGate. From the GoldenGate Deployment Connection select a connection.
  4. Click Next.
  5. In the Source Connection tab,
    1. From the Connection Type drop-down, select a SQL-based data source.
    2. From the Connection drop-down, select the required connection from which you wish to add the data entities.
    3. Click Next.
  6. In the Target Connection tab,
    1. From the Connection Type drop-down, select Apache Iceberg as the connection type.
    2. From the Connection drop-down, select the connection you want to load the data into.
    3. Specify the Namespace. You can either select from existing namespaces or create a new namespace.
    4. Click Save.

    The Data Load Detail page appears listing all the source tables.

  7. Select the required tables to load and the corresponding data load operation. The data load options you can use are Incremental Merge, Incremental Append, Append, and Do Not Load.
  8. Click save icon to save the changes. A green checkmark (green checkmark icon) in the row indicates that the changes are saved.
  9. Click execute iconto run the data load.

    A confirmation prompt appears when the data load starts successfully.

To check the status of the data load, see the Data Load Status panel on the right below the Target Schema details. For details about the panel, see Monitor Status of Data Loads, Data Flows, and Workflows. This panel shows links to the jobs that execute to run this data load. Click the link to monitor the progress on the Job Details page. For more information about jobs, see Create and Manage Jobs.

All the loaded tables along with their details are listed in the Data Entities page. To view the statistics of the data entities, click the Actions icon (Actions icon) next to the data entity, click Preview, and then select the Statistics tab. See View Statistics of Data Entities for information.

See View and Manage Data Loads for information about how to edit, export, and delete data loads.

Parquet File Clustering and Compaction

Apache Iceberg stores data in the tables in the form of Parquet files. Any change or increment to the table is stored as a new Parquet file. Each file has its own metadata with a file size of around 500 bytes in size. With multiple files having only minimum data, the metadata accounts for more storage than the actual data. This redundancy can cause increased storage costs as well as degraded query performance.

Data Transforms includes support for the Iceberg clustering and compaction features that you can use to optimize storage and performance. You can use the Iceberg Clustering and Compaction option as a step in the Workflow editor for compaction and clustering:
  • The compaction features helps combine the Parquet files into a single file of about 1 GB file size. This helps improve storage efficiency.
  • Clustering sorts and groups similar data into a single file. This helps improve query performance. For example, if you have data that is captured over a year, you can cluster data for each month and store it in separate files in the target table. You can then query Iceberg to retrieve data for a particular month.
To use Iceberg clustering and compaction in Data Transforms:
  1. On the Home page, click the required Project title and then click Workflows in the left pane.
  2. On the Workflow page, click Create Workflow.

    The Create Workflow page appears.

  3. In the Name field provide a name for the new workflow and click Create. The Workflow editor opens.
  4. From the left panel select the Apache Iceberg data load that has the target table you want to run compaction and clustering on and drag it onto the design canvas.
  5. Drag and drop Iceberg Clustering and Compaction on the canvas.
  6. Select the data load step and click the Connector icon (Connector icon) next to it to connect it with the Iceberg Clustering and Compaction step.
  7. Double click the Iceberg Clustering and Compaction step to open the step properties page.
  8. From the Connection drop-down select the Iceberg Apache connection you have selected as the target to load the data to.
  9. Specify the Namespace. You can either select from existing namespaces or create a new namespace.
  10. Click Select Tables. The list of tables that are part of the selected namespace are listed in the resulting dialog.
  11. Select the table and the required target action from the drop-down or select multiple tables and select the icon to apply the action to the selection. Choose any of the following actions:
    • CLUSTERING: Sorts and clusters the entire table based on the clustering column and maintains file size according to the largest file present in the table.

      To do this, select the table row, select CLUSTERING as the target action and then select the column you want the sorting to be based on from Clustering Column.

    • COMPACTION: Compacts the entire table into one file of 1GB file size.
    • CLUSTERING and COMPACTION: Clusters the entire table based on clustering column and compacts the files to one file of 1GB file size.
    • INCREMENTAL CLUSTERING: Performs clustering on new data.
    • INCREMENTAL COMPACTION: Compacts the table only for new data. Data that was previously compacted is not considered.
    • INCREMENTAL COMPACTION AND CLUSTERING: Performs compaction and clustering for new data.
  12. Click OK to save the changes and close the dialog.
  13. Click save icon to save the created/designed workflow.
  14. Click execute icon to execute the created workflow.
  15. [Optional] To check the status of the workflow, see the Status panel on the right below the Properties Panel. For details about the Status panel, see Monitor Status of Data Loads, Data Flows, and Workflows. This panel shows the link to the Job ID that you can click to monitor the execution status on the Jobs page. For more information about jobs see Create and Manage Jobs.
  16. [Optional] You can set a schedule to run the workflow. See Schedule Data Flows or Workflows for instructions about creating a schedule.

Collect and Publish Iceberg Table Statistics to External Tables

You can load data into an Iceberg table using data load, collect table and column statistics for the Iceberg table, and publish those statistics to the corresponding Oracle external table. You can use the Iceberg Stats option as a step or as an independent workflow node in the Workflow editor for collecting and publishing Iceberg stats.

You need to create an external table that you want to use to collect stats. To create external tables pointing to Iceberg, you first need to create a credential file, and then use this credential file to create the external table. For instructions on creating the credential file and the external table see CREDENTIAL and CREATE_EXTERNAL_TABLE Procedure. Currently, you can run stats for a single database.

To collect and publish Iceberg statistics using in Data Transforms:
  1. On the Home page, click the required Project title and then click Workflows in the left pane.
  2. On the Workflow page, click Create Workflow.

    The Create Workflow page appears.

  3. In the Name field provide a name for the new workflow and click Create. The Workflow editor opens.
  4. Drag and drop Iceberg Stats on the canvas.
  5. Double click the Iceberg Stats step to open the step properties page in the right panel.
  6. From the Iceberg Connection drop-down select the Iceberg Apache connection you have selected as the target to load the data.
  7. From the Oracle Connection and the Schema for the connection you have selected as the source to load the data to.
  8. Under External Tables select All Tables to select all external tables in the schema or Select Tables and choose the external tables you want to collect the statistics for.
  9. You can either run the Iceberg Stats step independently or drag an Apache Iceberg data load on the design canvas and connect it to the Iceberg Stats step.
  10. Click OK to save the changes and close the dialog.
  11. Click save icon to save the created/designed workflow.
  12. Click execute icon to execute the created workflow.
  13. [Optional] To check the status of the workflow, see the Status panel on the right below the Properties Panel. For details about the Status panel, see Monitor Status of Data Loads, Data Flows, and Workflows. This panel shows the link to the Job ID that you can click to monitor the execution status on the Jobs page. For more information about jobs see Create and Manage Jobs.
  14. [Optional] You can set a schedule to run the workflow. See Schedule Data Flows or Workflows for instructions about creating a schedule.