5.5 Create an Apache Iceberg Connection

Apache Iceberg is an open standard table format that is optimized to manage large analytic datasets. Data Transforms supports the use of Apache Iceberg as a target to load data from any SQL-based data sources.

Data Transforms supports Oracle Object Storage (S3 compatibility) and AWS S3 storage services to store the parquet files for the Apache Iceberg tables.

The Data Transforms Apache Iceberg Connector requires that a REST Catalog already exists. This REST Catalog is setup based on Apache Gravitino (external link) with Iceberg Open API specification.

Note:

Data Transforms supports the use of Apache Gravitino version 0.7.0-incubating or lower to bring up the REST service.

Creating an Apache Iceberg Connection

You can configure an Apache Iceberg connection with the Iceberg REST Catalog by providing the REST URL and authentication details such as the username and password. You can also use the more secure OAuth 2.0 authentication to create the connection.

To create an Apache Iceberg connection:

  1. From the left pane of the Home page, click the Connections tab.

    Connections page appears.

  2. Click Create Connection.

    Create Connection page slides in.

  3. Do one of the following:
    • In the Select Type field, enter the name or part of the name of the connection type.
    • Select the Databases tab.
  4. Select Apache Iceberg as the connection type.
  5. Click Next.
  6. The Connection Name field is pre-populated with a default name. You can edit this value.
  7. In the Catalog Name textbox, enter a name.
  8. In the Rest URL textbox, enter the URL of the REST server. Enter the value in the <host>:<port>/<ServiceName>/iceberg format.
  9. From the Authentication drop-down section, do one of the following:
    • Select None.
    • Select Simple and enter the Rest User and Rest Password.
    • Select OAuth and enter the following details:
      • Warehouse Location: The location where you want to store the data. For example, s3://my-bucket/my/table/location
      • Token URI: The URL to obtain the OAuth Token in the format http://<host>:<port>
      • Token Path: The path to the OAuth token. For example, /oauth2/token.
      • Client ID: The OAuth Client ID.
      • Client Secret: The OAuth Client secret.
      • Auth Scope: The permissions granted to a client when accessing the Gravitino server. For example, a test Auth Scope value might indicate that the client is authorized to access resources related to the test" scope within Gravitino. [Optional]
      • Grant Type: The method that the authorization server should use to issue the access token. For example, client_credentials and authorization_code. [Optional]
  10. Click Test Connection, to test the established connection.
  11. After providing all the required connection details, click Create.

    The Apache Iceberg connection is configured with REST Catalog, which stores the Iceberg data in Oracle Object Storage.

The newly created connections are displayed in the Connections page.

Click the Actions icon (Actions icon) next to the selected connection to perform the following operations:

  • Select Edit, to edit the provided connection details.
  • Select Test Connection, to test the created connection.
  • Click Export to export the connection. See Export Objects.
  • Select Delete Schema, to delete schemas.
  • Select Delete Connection, to delete the created connection.

You can also search for the required Connection to know its details based on the following filters:

  • Name of the Connection.
  • Technology associated with the created Connection.

Creating and Running an Apache Iceberg Data Load

You can create a data load for any SQL-based source data source, such as Oracle, to load data into Apache Iceberg target tables. To use Apache Iceberg as a target data source, you need to provide the name of the connection and the namespace. A namespace in Apache Iceberg is similar to schema in relational databases.

After you create the data load, all the tables in the source schema are listed on the Data Load Detail page along with options to incrementally load, append, and merge the data for each of the selected source tables. When the data load run completes, you can read the data from the Iceberg tables. You can add the data load as a step in a workflow and create a schedule to run the workflows at a predefined time interval. See Create a New Workflow.

To create and run an Apache Iceberg Data Load:

  1. Do one of the following:
    • On the Home page, click Load Data. The Create Data Load wizard appears.

      In the Create Data Load tab, enter a name if you want to replace the default value, add a description, and select a project from the drop-down.

    • On the Home page, click Projects, and then the required project tile. In the left pane, click Data Loads, and then click Create Data Load. The Create Data Load wizard appears.
  2. Enter a name if you want to replace the default value and add a description.
  3. For Load Processing do one of the following:
    • Select the Internal radio button and from the Deployment Type drop-down select Data Transforms (Batch).
    • Select the Delegate radio button and from the Deployment Type drop-down select OCI GoldenGate. From the GoldenGate Deployment Connection select a connection.
  4. Click Next.
  5. In the Source Connection tab,
    1. From the Connection Type drop-down, select a SQL-based data source.
    2. from the Connection drop-down, select the required connection from which you wish to add the data entities.
    3. Click Next.
  6. In the Target Connection tab,
    1. From the Connection Type drop-down, select Apache Iceberg as the connection type.
    2. From the Connection drop-down, select the connection you want to use to load the data into.
    3. Specify the Namespace. You can either select from existing namespaces or create a new namespace.
    4. Click Save.

    The Data Load Detail page appears listing all the source tables.

  7. Select the required tables to load and the corresponding data load operation. The data load options you can use are Incremental Merge, Incremental Append, Append, and Do Not Load.
  8. Click save icon to save the changes. A green checkmark (green checkmark icon) in the row indicates that the changes are saved.
  9. Click execute iconto run the data load.

    A confirmation prompt appears when the data load starts successfully.

To check the status of the data load, see the Data Load Status panel on the right below the Target Schema details. For details about the panel, see Monitor Status of Data Loads, Data Flows, and Workflows. This panel shows links to the jobs that execute to run this data load. Click the link to monitor the progress on the Job Details page. For more information about jobs, see Create and Manage Jobs.

All the loaded tables along with their details are listed in the Data Entities page. To view the statistics of the data entities, click the Actions icon (Actions icon) next to the data entity, click Preview, and then select the Statistics tab. See View Statistics of Data Entities for information.