8.2.18 Google BigQuery

Topics:

8.2.18.1 Using Streaming API

Learn how to use the Google BigQuery Handler, which streams change data capture data from source trail files into Google BigQuery.

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage, see https://cloud.google.com/bigquery/.

8.2.18.1.1 Detailing the Functionality

8.2.18.1.1.1 Data Types

The BigQuery Handler supports the standard SQL data types and most of these data types are supported by the BigQuery Handler. A data type conversion from the column value in the trail file to the corresponding Java type representing the BigQuery column type in the BigQuery Handler is required.

The following data types are supported:

BIGNUMERIC
BOOLEAN
BYTES
DATE
DATETIME
FLOAT
INTEGER
JSON
NUMERIC
STRING
TIME
TIMESTAMP

The BigQuery Handler does not support complex data types, such as ARRAY and STRUCT.

8.2.18.1.1.2 Metadata Support

The BigQuery Handler creates tables in BigQuery if the tables do not exist.

The BigQuery Handler alters tables to add columns which exist in the source metadata or configured metacolumns which do not exist in the target metadata. The BigQuery Handler also adds columns dynamically at runtime if it detects a metadata change.

The BigQuery Handler does not drop columns in the BigQuery table which do not exist into the source table definition. BigQuery neither supports dropping existing columns, nor supports changing the data type of existing columns. Once a column is created in BigQuery, it is immutable.

Truncate operations are not supported.

8.2.18.1.1.3 Operation Modes

You can configure the BigQuery Handler in one of these two modes:

Audit Log Mode = true
gg.handler.name.auditLogMode=true

When the handler is configured to run with audit log mode true, the data is pushed into Google BigQuery without a unique row identification key. As a result, Google BigQuery is not able to merge different operations on the same row. For example, a source row with an insert operation, two update operations, and then a delete operation would show up in BigQuery as four rows, one for each operation.

Also, the order in which the audit log is displayed in the BigQuery data set is not deterministic.

To overcome these limitations, users should specify optype and postion in the meta columns template for the handler. This adds two columns of the same names in the schema for the table in Google BigQuery. For example: gg.handler.bigquery.metaColumnsTemplate = ${optype}, ${position}

The optype is important to determine the operation type for the row in the audit log.

To view the audit log in order of the operations processed in the trail file, specify position which can be used in the ORDER BY clause while querying the table in Google BigQuery. For example:

SELECT * FROM [projectId:datasetId.tableId] ORDER BY position
auditLogMode = false

gg.handler.name.auditLogMode=false

When the handler is configured to run with audit log mode false, the data is pushed into Google BigQuery using a unique row identification key. The Google BigQuery is able to merge different operations for the same row. However, the behavior is complex. The Google BigQuery maintains a finite deduplication period in which it will merge operations for a given row. Therefore, the results can be somewhat non-deterministic.

The trail source needs to have a full image of the records in order to merge correctly.

Example 1

An insert operation is sent to BigQuery and before the deduplication period expires, an update operation for the same row is sent to BigQuery. The resultant is a single row in BigQuery for the update operation.

Example 2

An insert operation is sent to BigQuery and after the deduplication period expires, an update operation for the same row is sent to BigQuery. The resultant is that both the insert and the update operations show up in BigQuery.

This behavior has confounded many users, as this is the documented behavior when using the BigQuery SDK and a feature as opposed to a defect. The documented length of the deduplication period is at least one minute. However, Oracle testing has shown that the period is significantly longer. Therefore, unless users can guarantee that all operations for a give row occur within a very short period, it is likely there will be multiple entries for a given row in BigQuery. It is therefore just as important for users to configure meta columns with the optype and position so they can determine the latest state for a given row. To read more about audit log mode read the following Google BigQuery documentation:Streaming data into BigQuery.

8.2.18.1.1.4 Operation Processing Support

The BigQuery Handler pushes operations to Google BigQuery using synchronous API. Insert, update, and delete operations are processed differently in BigQuery than in a traditional RDBMS.

The following explains how insert, update, and delete operations are interpreted by the handler depending on the mode of operation:

auditLogMode = true
  • insert – Inserts the record with optype as an insert operation in the BigQuery table.

  • update –Inserts the record with optype as an update operation in the BigQuery table.

  • delete – Inserts the record with optype as a delete operation in the BigQuery table.

  • pkUpdate—When pkUpdateHandling property is configured as delete-insert, the handler sends out a delete operation followed by an insert operation. Both these rows have the same position in the BigQuery table, which helps to identify it as a primary key operation and not a separate delete and insert operation.

auditLogMode = false
  • insert – If the row does not already exist in Google BigQuery, then an insert operation is processed as an insert. If the row already exists in Google BigQuery, then an insert operation is processed as an update. The handler sets the deleted column to false.

  • update – If a row does not exist in Google BigQuery, then an update operation is processed as an insert. If the row already exists in Google BigQuery, then an update operation is processed as update. The handler sets the deleted column to false.

  • delete – If the row does not exist in Google BigQuery, then a delete operation is added. If the row exists in Google BigQuery, then a delete operation is processed as a delete. The handler sets the deleted column to true.

  • pkUpdate—When pkUpdateHandling property is configured as delete-insert, the handler sets the deleted column to true for the row whose primary key is updated. It is followed by a separate insert operation with the new primary key and the deleted column set to false for this row.

Do not toggle the audit log mode because it forces the BigQuery handler to abend as Google BigQuery cannot alter schema of an existing table. The existing table needs to be deleted before switching audit log modes.

Note:

The BigQuery Handler does not support the truncate operation. It abends when it encounters a truncate operation.

8.2.18.1.1.5 Proxy Settings

To connect to BigQuery using a proxy server, you must configure the proxy host and the proxy port in the properties file as follows:

jvm.bootoptions= -Dhttps.proxyHost=proxy_host_name -Dhttps.proxyPort=proxy_port_number
8.2.18.1.1.6 Mapping to Google Datasets

A dataset is contained within a specific Google cloud project. Datasets are top-level containers that are used to organize and control access to your tables and views.

A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.

The Big Query handler can use existing datasets or create datasets if not found.

The Big Query Handler maps the table's schema name to the dataset name. For three-part table names, the dataset is constructed by concatenating catalog and schema.

8.2.18.1.2 Setting Up and Running the BigQuery Handler

The Google BigQuery Handler uses the Java BigQuery client libraries to connect to Big Query.

These client libraries are located using the following Maven coordinates:
  • Group ID: com.google.cloud
  • Artifact ID: google-cloud-bigquery
  • Version: 2.7.1

The BigQuery Client libraries do not ship with Oracle GoldenGate for Big Data. Additionally, Google appears to have removed the link to download the BigQuery Client libraries. You can download the BigQuery Client libraries using Maven and the Maven coordinates listed above. However, this requires proficiency with Maven. The Google BigQuery client libraries can be downloaded using the Dependency downloading scripts. For more information, see Google BigQuery Dependencies.

For more information about Dependency Downloader, see Dependency Downloader.

8.2.18.1.2.1 Schema Mapping for BigQuery

The table schema name specified in the replicat map statement is mapped to the BigQuery dataset name. For example: map QASOURCE.*, target "dataset_US".*;

This map statement replicates tables to the BigQuery dataset "dataset_US". Oracle GoldenGate for Big Data normalizes schema and table names to uppercase. Lowercase and mixed case dataset and table names are supported, but need to be quoted in the Replicat mapping statement.

8.2.18.1.2.2 Understanding the BigQuery Handler Configuration

The following are the configurable values for the BigQuery Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

To enable the selection of the BigQuery Handler, you must first configure the handler type by specifying gg.handler.name.type=bigquery and the other BigQuery properties as follows:

Properties Required/ Optional Legal Values Default Explanation

gg.handlerlist

Required

Any string

None

Provides a name for the BigQuery Handler. The BigQuery Handler name then becomes part of the property names listed in this table.

gg.handler.name.type=bigquery

Required

bigquery

None

Selects the BigQuery Handler for streaming change data capture into Google BigQuery.

gg.handler.name.credentialsFile

Optional

Relative or absolute path to the credentials file

None

The credentials file downloaded from Google BigQuery for authentication. If you do not specify the path to the credentials file, you need to set it as an environment variable, see Configuring Handler Authentication.

gg.handler.name.projectId

Required

Any string

None

The name of the project in Google BigQuery. The handler needs project ID to connect to Google BigQuery store.

gg.handler.name.batchSize

Optional

Any number

500

The maximum number of operations to be batched together. This is applicable for all target table batches.

gg.handler.name.batchFlushFrequency

Optional

Any number

1000

The maximum amount of time in milliseconds to wait before executing the next batch of operations. This is applicable for all target table batches.

gg.handler.name.skipInvalidRows

Optional

true | false

false

Sets whether to insert all valid rows of a request, even if invalid rows exist. If not set, the entire insert request fails if it contains an invalid row.

gg.handler.name.ignoreUnknownValues

Optional

true | false

false

Sets whether to accept rows that contain values that do not match the schema. If not set, rows with unknown values are considered to be invalid.

gg.handler.name.connectionTimeout

Optional

Positive integer

20000

The maximum amount of time, in milliseconds, to wait for the handler to establish a connection with Google BigQuery.

gg.handler.name.readTimeout

Optional

Positive integer

30000

The maximum amount of time in milliseconds to wait for the handler to read data from an established connection.

gg.handler.name.metaColumnsTemplate

Optional

A legal string

None

A legal string specifying the metaColumns to be included. If you set auditLogMode to true, it is important that you set the metaColumnsTemplate property to view the operation type for the row inserted in the audit log, see Metacolumn Keywords.

gg.handler.name.auditLogMode

Optional

true | false

false

Set to true, the handler writes each record to the target without any primary key. Everything is processed as insert.

Set to false, the handler tries to merge incoming records into the target table if they have the same primary key. Primary keys are needed for this property. The trail source records need to have a full image updates to merge correctly.

gg.handler.name.pkUpdateHandling

Optional

abend | delete-insert

abend

Sets how the handler handles update operations that change a primary key. Primary key operations can be problematic for the BigQuery Handler and require special consideration:

  • abend- indicates the process abends.

  • delete-insert- indicates the process treats the operation as a delete and an insert. The full before image is required for this property to work correctly. Without full before and after row images the insert data are incomplete. Oracle recommends this option.

gg.handler.name.adjustScale Optional

true | false

false The BigQuery numeric data type supports a maximum scale of 9 digits. If a field is mapped into a BigQuery numeric data type, then it fails if the scale is larger than 9 digits. Set this property to true to round fields mapped to BigQuery numeric data types to a scale of 9 digits. Enabling this property results in a loss of precision for source data values with a scale larger than 9.
gg.handler.name.includeDeletedColumn Optional

true | false

false Set to true to include a boolean column in the output called deleted. The value of this column is set to false for insert and update operations, and is set to true for delete operations.
gg.handler.name.enableAlter Optional true | false false Set to true to enable altering the target BigQuery table. This will allow the BigQuery Handler to add columns or metacolumns configured on the source, which are not currently in the target BigQuery table.
gg.handler.name.clientId Optional String None Use to set the client id if the configuration property gg.handler.name.credentialsFile to resolve the Google BigQuery credentials is not set. You may wish to use this property instead of the credentials file in order to use Oracle Wallet to secure credentials.
gg.handler.name.clientEmail Optional String None Use to set the client email if the configuration property gg.handler.name.credentialsFile to resolve the Google BigQuery credentials is not set. You may wish to use this property instead of the credentials file inorder to use Oracle Wallet to secure credentials.
gg.handler.name.privateKey Optional String None Use to set the private key if the configuration property gg.handler.name.credentialsFile to resolve the Google BigQuery credentials is not set. You may wish to use this property instead of the credentials file inorder to use Oracle Wallet to secure credentials.
gg.handler.name.privateKeyId Optional String None Use to set the private key id if the configuration property gg.handler.name.credentialsFile to resolve the Google BigQuery credentials is not set. You may wish use this property instead of the credentials file in order to use Oracle Wallet to secure credentials.
gg.handler.name.url Optional A legal URL to connect to BigQuery including scheme, server name and port (if not the default port). The default is https://www.googleapis.com. https://www.googleapis.com Allows the user to set a URL for a private endpoint to connect to BigQuery.

To be able to connect GCS to the Google Cloud Service account, ensure that either of the following is configured: the credentials file property with the relative or absolute path to credentials JSON file or the properties for individual credentials keys. The configuration property that is used to individually add google service account credential key enables them to be encrypted using the Oracle wallet.

8.2.18.1.2.3 Review a Sample Configuration

The following is a sample configuration for the BigQuery Handler:

gg.handlerlist = bigquery

#The handler properties
gg.handler.bigquery.type = bigquery
gg.handler.bigquery.projectId = festive-athlete-201315
gg.handler.bigquery.credentialsFile = credentials.json
gg.handler.bigquery.auditLogMode = true
gg.handler.bigquery.pkUpdateHandling = delete-insert

gg.handler.bigquery.metaColumnsTemplate =${optype}, ${position}
8.2.18.1.2.4 Configuring Handler Authentication

You have to configure the BigQuery Handler authentication using the credentials in the JSON file downloaded from Google BigQuery.

Download the credentials file:

  1. Login into your Google account at cloud.google.com.

  2. Click Console, and then to go to the Dashboard where you can select your project.

  3. From the navigation menu, click APIs & Services then select Credentials.

  4. From the Create Credentials menu, choose Service account key.

  5. Choose the JSON key type to download the JSON credentials file for your system.

After you have the credentials file, you can authenticate the handler in one of the following methods listed here:

  • Specify the path to the credentials file in the properties file with the gg.handler.name.credentialsFile configuration property.

    The path of the credentials file must contain the path with no wildcard appended. If you include the * wildcard in the path to the credentials file, the file is not recognized.

    Or

  • Set the credentials file keys (clientId, ClientEmail, privateKeyId, and privateKey) into the corresponding handler properties.

    Or

  • Set the GOOGLE_APPLICATION_CREDENTIALS environment variable on your system. For example:

    export GOOGLE_APPLICATION_CREDENTIALS = credentials.json

    Then restart the Oracle GoldenGate manager process.

8.2.18.1.3 Google BigQuery Dependencies

The Google BigQuery client libraries are required for integration with BigQuery.

The maven coordinates are as follows:

Maven groupId: com.google.cloud

Maven artifactId: google-cloud-bigquery

Version: 2.7.1

8.2.18.1.3.1 BigQuery 2.7.1

The required BigQuery Client libraries for the 2.7.1 version are as follows:

api-common-2.1.3.jar
checker-compat-qual-2.5.5.jar
checker-qual-3.21.1.jar
commons-codec-1.15.jar
commons-logging-1.2.jar
error_prone_annotations-2.11.0.jar
failureaccess-1.0.1.jar
gax-2.11.0.jar
gax-httpjson-0.96.0.jar
google-api-client-1.33.1.jar
google-api-services-bigquery-v2-rev20211129-1.32.1.jar
google-auth-library-credentials-1.4.0.jar
google-auth-library-oauth2-http-1.4.0.jar
google-cloud-bigquery-2.7.1.jar
google-cloud-core-2.4.0.jar
google-cloud-core-http-2.4.0.jar
google-http-client-1.41.2.jar
google-http-client-apache-v2-1.41.2.jar
google-http-client-appengine-1.41.2.jar
google-http-client-gson-1.41.2.jar
google-http-client-jackson2-1.41.2.jar
google-oauth-client-1.33.0.jar
grpc-context-1.44.0.jar
gson-2.8.9.jar
guava-31.0.1-jre.jar
httpclient-4.5.13.jar
httpcore-4.4.15.jar
j2objc-annotations-1.3.jar
jackson-core-2.13.1.jar
javax.annotation-api-1.3.2.jar
jsr305-3.0.2.jar
listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
opencensus-api-0.31.0.jar
opencensus-contrib-http-util-0.31.0.jar
protobuf-java-3.19.3.jar
protobuf-java-util-3.19.3.jar
proto-google-common-protos-2.7.2.jar
proto-google-iam-v1-1.2.1.jar

8.2.18.2 Google BigQuery Stage and Merge

Topics:

8.2.18.2.1 Overview

BigQuery is Google Cloud’s fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real time.

8.2.18.2.2 Detailed Functionality

The BigQuery Event handler uses the stage and merge data flow.

The change data is staged in a temporary location in microbatches and eventually merged into to the target table. Google Cloud Storage (GCS) is used as the staging area for change data.

This Event handler is used as a downstream Event handler connected to the output of the GCS Event handler.

The GCS Event handler loads files generated by the File Writer Handler into Google Cloud Storage.

The Event handler runs BigQuery Query jobs to execute MERGE SQL. The SQL operations are performed in batches providing better throughput.

Note:

The BigQuery Event handler doesn't use the Google BigQuery streaming API.

8.2.18.2.3 Prerequisites

  • Target table existence: Ensure that the target tables exist in the BigQuery dataset.
  • Google Cloud Storage(GCS) bucket and dataset location: Ensure that the GCS bucket and the BigQuery dataset exist in the same location/region.

8.2.18.2.4 Differences between BigQuery Handler and Stage and Merge BigQuery Event Handler

Table 8-23 BigQuery Handler v/s Stage and Merge BigQuery Event Handler

Feature/Limitation BigQuery Handler Stage And Merge BigQuery Event Handler
Compressed update support Partially supported with limitations. YES
Audit log mode Process all the operations as INSERT. No need to enable audit log mode.
GCP Quotas/Limits Maximum rows per second per table: 100000. See Google BigQuery Documentation. Daily destination table update limit — 1500 updates per table per day. See Google BigQuery Documentation.
Approximate pricing with 1TB Storage (for exact pricing refer GCP Pricing calculator) Streaming Inserts for 1TB costs ~72.71 USD per month Query job for 1TB costs ~20.28 USD per month.
Duplicate rows replicated to BigQuery YES NO
Replication of TRUNCATE operation Not supported Supported
API used BigQuery Streaming API BigQuery Query job

8.2.18.2.5 Authentication or Authorization

For more information about using the Google service account key, see Authentication and Authorization in the Google Cloud Service (GCS) Event Handler topic. In addition to the permissions needed to access GCS, the service account also needs permissions to access BigQuery. You may choose to use a pre-defined IAM role, such as roles/bigquery.dataEditor or roles/bigquery.dataOwner. When creating a custom role, the following are the IAM permissions used to run BigQuery Event handler. For more information, see Configuring Handler Authentication.

8.2.18.2.5.1 BigQuery Permissions

Table 8-24 BigQuery Permissions

Permission Description
bigquery.connections.create Create new connections in a project.
bigquery.connections.delete Delete a connection.
bigquery.connections.get Gets connection metadata. Credentials are excluded.
bigquery.connections.list List connections in a project.
bigquery.connections.update Update a connection and its credentials.
bigquery.connections.use Use a connection configuration to connect to a remote data source.
bigquery.datasets.create Create new datasets.
bigquery.datasets.get Get metadata about a dataset.
bigquery.connections.export Export table data out of BigQuery.
bigquery.connections.get Get table metadata. To get table data, you need bigquery.tables.getData.
bigquery.connections.list List connections in a project.
bigquery.connections.update Update a connection and its credentials.
bigquery.datasets.create Create new empty datasets.
bigquery.datasets.get Get metadata about a dataset.
bigquery.datasets.getIamPolicy Reserved for future use.
bigquery.datasets.update Update metadata for a dataset.
bigquery.datasets.updateTag Update tags for a dataset.
bigquery.jobs.create Run jobs (including queries) within the project.
bigquery.jobs.get Get data and metadata on any job.
bigquery.jobs.list List all jobs and retrieve metadata on any job submitted by any user. For jobs submitted by other users, details and metadata are redacted.
bigquery.jobs.listAll List all jobs and retrieve metadata on any job submitted by any user.
bigquery.jobs.update Cancel any job.
bigquery.readsessions.create Create a new read session via the BigQuery Storage API.
bigquery.readsessions.getData Read data from a read session via the BigQuery Storage API.
bigquery.readsessions.update Update a read session via the BigQuery Storage API.
bigquery.reservations.create Create a reservation in a project.
bigquery.reservations.delete Delete a reservation.
bigquery.reservations.get Retrieve details about a reservation.
bigquery.reservations.list List all reservations in a project.
bigquery.reservations.update Update a reservation’s properties.
bigquery.reservationAssignments.create Create a reservation assignment. This permission is required on the owner project and assignee resource. To move a reservation assignment, you need bigquery.reservationAssignments.create on the new owner project and assignee resource.
bigquery.reservationAssignments.delete Delete a reservation assignment. This permission is required on the owner project and assignee resource. To move a reservation assignment, you need bigquery.reservationAssignments.delete on the old owner project and assignee resource.
bigquery.reservationAssignments.list List all reservation assignments in a project.
bigquery.reservationAssignments.search Search for a reservation assignment for a given project, folder, or organization.
bigquery.routines.create Create new routines (functions and stored procedures).
bigquery.routines.delete Delete routines.
bigquery.routines.list List routines and metadata on routines.
bigquery.routines.update Update routine definitions and metadata.
bigquery.savedqueries.create Create saved queries.
bigquery.savedqueries.delete Delete saved queries.
bigquery.savedqueries.get Get metadata on saved queries.
bigquery.savedqueries.list Lists saved queries.
bigquery.savedqueries.update Updates saved queries.
bigquery.tables.create Create new tables.
bigquery.tables.delete Delete tables
bigquery.tables.export Export table data out of BigQuery.
bigquery.tables.get Get table metadata. To get table data, you need bigquery.tables.getData.
bigquery.tables.getData Get table data. This permission is required for querying table data. To get table metadata, you need bigquery.tables.get.
bigquery.tables.getIamPolicy Read a table’s IAM policy.
bigquery.tables.list List tables and metadata on tables.
bigquery.tables.setCategory Set policy tags in table schema.
bigquery.tables.setIamPolicy Changes a table’s IAM policy.
bigquery.tables.update Update table metadata. To update table data, you need bigquery.tables.updateData.
bigquery.tables.updateData Update table data. To update table metadata, you need bigquery.tables.update.
bigquery.tables.updateTag Update tags for a table.

In addition to these permissions, ensure that resourcemanager.projects.get/list is always granted as a pair.

8.2.18.2.6 Configuration

8.2.18.2.6.1 Automatic Configuration

Replication to BigQuery involves configuring of multiple components, such as File Writer handler, Google Cloud Storae (GCS) Event handler and BigQuery Event handler.

The Automatic Configuration functionality helps to auto configure these components so that the user configuration is minimal.

The properties modified by auto configuration is also logged in the handler log file. To enable auto configuration to replicate to BigQuery target, set the parameter gg.target=bq.

When replicating to BigQuery target, you cannot customize GCS Event handler name and BigQuery Event handler name.

8.2.18.2.6.1.1 File Writer Handler Configuration

File Writer handler name is preset to the value bq. The following is an example to edit a property of File Writer handler: gg.handler.bq.pathMappingTemplate=./dirout.

8.2.18.2.6.1.2 GCS Event Handler Configuration

The GCS Event handler name is preset to the value gcs. The following is an example to edit a property of GCS Event handler: gg.eventhandler.gcs.concurrency=5.

8.2.18.2.6.1.3 BigQuery Event Handler Configuration

BigQuery Event handler name is preset to the value bq. There are no mandatory parameters required for BigQuery Event handler. Mostly, auto configure derives the required parameters.

The following are the BigQuery Event handler configurations:

Properties Required/ Optional Legal Values Default Explanation
gg.eventhandler.bq.credentialsFile Optional Relative or absolute path to the service account key file. Value from property gg.eventhandler.gcs.credentialsFile Sets the path to the service account key file. Autoconfigure will automatically configure this property based on the configuration gg.eventhandler.gcs.credentialsFile, unless the user wants to use a different service account key file for BigQuery access. Alternatively, if the environment variable GOOGLE_APPLICATION_CREDENTIALS is set to the path to the service account key file, this parameter need not be set.
gg.eventhandler.bq.projectId Optional The Google project-id project-id associated with the service account. Sets the project-id of the Google Cloud project that houses BigQuery. Autoconfigure will automatically configure this property by accessing the service account key file unless user wants to override this explicitly.
gg.eventhandler.bq.kmsKey Optional Key names in the format: projects/<PROJECT>/locations/<LOCATION>/keyRings/<RING_NAME>/cryptoKeys/<KEY_NAME>
  • <PROJECT>: Google project-id
  • <LOCATION>: Location of the BigQuery dataset.
  • <RING_NAME>: Google Cloud KMS key ring name.
  • <KEY_NAME>: Google Cloud KMS key name.
Value from property gg.eventhandler.gcs.kmsKey Set a customer managed Cloud KMS key to encrypt data in BigQuery. Autoconfigure will automatically configure this property based on the configuration gg.eventhandler.gcs.kmsKey.
gg.eventhandler.bq.connectionTimeout Optional Positive integer. 20000 The maximum amount of time, in milliseconds, to wait for the handler to establish a connection with Google BigQuery.
gg.eventhandler.bq.readTimeout Optional Positive integer. 30000 The maximum amount of time in milliseconds to wait for the handler to read data from an established connection.
gg.eventhandler.bq.totalTimeout Optional Positive integer. 120000 The total timeout parameter in seconds. The TotalTimeout parameter has the ultimate control over how long the logic should keep trying the remote call until it gives up completely.
gg.eventhandler.bq.retries Optional Positive integer. 3 The maximum number of retry attempts to perform.
gg.eventhandler.bq.createDataset Optional true | false true Set to true to automatically create the BigQuery dataset if it does not exist.
gg.eventhandler.bq.createTable Optional true | false true Set to true to automatically create the BigQuery target table if it does not exist.
gg.aggregate.operations.flush.interval Optional Integer 30000 The flush interval parameter determines how often the data will be merged into Snowflake. The value is set in milliseconds.

Caution:

The higher this value, more data will be stored in the memory of the Replicat process..

Note:

Use the flush interval parameter with caution. Increasing its default value will increase the amount of data stored in the internal memory of the Replicat. This can cause out of memory errors and stop the Replicat if it runs out of memory.
gg.compressed.update Optional true or false true If set the true, then this indicates that the source trail files contain compressed update operations. If set to true, then the source trail files are expected to contain uncompressed update operations.
gg.eventhandler.bq.connectionRetryIntervalSeconds Optional Integer Value 30 Specifies the delay in seconds between connection retry attempts.
gg.eventhandler.bq.connectionRetries Optional Integer Value 3 Specifies the number of times connections to the target data warehouse will be retried.
gg.eventhandler.bq.url Optional An absolute URL to connect to Google BigQuery. https://googleapis.com A legal URL to connect to Google BigQuery including scheme, server name and port (if not the default port). The default is https://googleapis.com.
8.2.18.2.6.2 Classpath Configuration

The GCS Event handler and the BigQuery Event handler use the Java SDK provided by Google. Google does not provide a direct link to download the SDK.

You can download the SDKs using the following maven co-ordinates:

Google Cloud Storage
 <dependency>
        <groupId>com.google.cloud</groupId>
        <artifactId>google-cloud-storage</artifactId>
        <version>1.113.9</version>
    </dependency>

To download the GCS dependencies, execute the following script <OGGDIR>/DependencyDownloader/gcs.sh.

BigQuery
 <dependency>
        <groupId>com.google.cloud</groupId>
        <artifactId>google-cloud-bigquery</artifactId>
        <version>1.111.1</version>
    </dependency>

To download the BigQuery dependencies, execute the following script <OGGDIR>/DependencyDownloader/bigquery.sh. For more information, see gcs.sh in Dependency Downloader Scripts.

Set the path to the GCS and BigQuery SDK in the gg.classpath configuration parameter. For example: gg.classpath=./gcs-deps/*:./bq-deps/*.

For more information, see Dependency Downloader Scripts.

8.2.18.2.6.3 Proxy Configuration

When the replicat process is run behind a proxy server, you can use the jvm.bootoptions property to set the proxy server configuration. For example: jvm.bootoptions=-Dhttps.proxyHost=some-proxy-address.com -Dhttps.proxyPort=80.

8.2.18.2.6.4 INSERTALLRECORDS Support

Stage and merge targets supports INSERTALLRECORDS parameter.

See INSERTALLRECORDS in Reference for Oracle GoldenGate. Set the INSERTALLRECORDS parameter in the Replicat parameter file (.prm). Set the INSERTALLRECORDS parameter in the Replicat parameter file (.prm)

Setting this property directs the Replicat process to use bulk insert operations to load operation data into the target table.

To process initial load trail files, set the INSERTALLRECORDS parameter in the Replicat parameter file (.prm). Setting this property directs the Replicat process to use bulk insert operations to load operation data into the target table. You can tune the batch size of bulk inserts using the gg.handler.bq.maxFileSize File Writer property. The default value is set to 1GB.

The frequency of bulk inserts can be tuned using the File Writer gg.handler.bq.fileRollInterval property, the default value is set to 3m (three minutes).

8.2.18.2.6.5 BigQuery Dataset and GCP ProjectId Mapping
The BigQuery Event handler maps the table schema name to the BigQuery dataset.

The table catalog name is mapped to the GCP projectId.

8.2.18.2.6.5.1 Three-Part Table Names
If the tables use distinct catalog names, then the BigQuery datasets would reside in multiple GCP projects. The GCP service account key should have the required privileges in the respective GCP projects. See BigQuery Permissions.
8.2.18.2.6.5.2 Mapping Table

Table 8-25 Mapping Table

MAP statement in the Replicat parameter file BigQuery Dataset GCP ProjectId
MAP SCHEMA1.*, TARGET "bq-project-1".*.*; SCHEMA1 bq-project-1
MAP "bq-project-2".SCHEMA2.*, TARGET *.*.*; SCHEMA2 bq-project-2
MAP SCHEMA3.*, TARGET *.*; SCHEMA3 The default projectId from the GCP service account key file or the configuration gg.eventhandler.bq.projectId.
8.2.18.2.6.6 End-to-End Configuration

The following is an end-end configuration example which uses auto configuration for File Writer (FW) handler, GCS, and BigQuery Event handlers.

This sample properties file is located at: AdapterExamples/big-data/bigquery-via-gcs/bq.props.
 # Configuration to load GoldenGate trail operation records
 # into Google Big Query by chaining
 # File writer handler -> GCS Event handler -> BQ Event handler.
 # Note: Recommended to only edit the configuration marked as TODO
 # The property gg.eventhandler.gcs.credentialsFile need not be set if
 # the GOOGLE_APPLICATION_CREDENTIALS environment variable is set.

 gg.target=bq

 ## The GCS Event handler
 #TODO: Edit the GCS bucket name
 gg.eventhandler.gcs.bucketMappingTemplate=<gcs-bucket-name>
 #TODO: Edit the GCS credentialsFile
 gg.eventhandler.gcs.credentialsFile=/path/to/gcp/credentialsFile

## The BQ Event handler
## No mandatory configuration required.

#TODO: Edit to include the GCS Java SDK and BQ Java SDK.
gg.classpath=/path/to/gcs-deps/*:/path/to/bq-deps/*
#TODO: Edit to provide sufficient memory (at least 8GB).
jvm.bootoptions=-Xmx8g -Xms8g
#TODO: If running OGGBD behind a proxy server.
#jvm.bootoptions=-Xmx8g -Xms512m -Dhttps.proxyHost=<ip-address> -Dhttps.proxyPort=<port> 
8.2.18.2.6.7 Compressed Update Handling

A compressed update record contains values for the key columns and the modified columns.

An uncompressed update record contains values for all the columns.

Oracle GoldenGate trails may contain compressed or uncompressed update records. The default extract configuration writes compressed updates to the trails.

The parameter gg.compressed.update can be set to true or false to indicate compressed/uncompressed update records.

8.2.18.2.6.7.1 MERGE Statement with Uncompressed Updates

In some use cases, if the trail contains uncompressed update records, then the MERGE SQL statement can be optimized for better performance by setting gg.compressed.update=false.

8.2.18.2.7 Troubleshooting and Diagnostics

  • DDL not applied on the target table: Oracle GoldenGate for BigData does not support DDL replication.
  • SQL Errors: In case there are any errors while executing any SQL, the entire SQL statement along with the bind parameter values are logged into the Oracle GoldenGate for Big Data handler log file.
  • Co-existence of the components: The location/region of the machine where Replicat process is running and the BigQuery dataset/GCS bucket impacts the overall throughput of the apply process.

    Data flow is as follows: GoldenGate -> GCS bucket -> BigQuery. For best throughput, ensure that the components are located as close as possible.

  • com.google.cloud.bigquery.BigQueryException: Access Denied: Project <any-gcp-project>: User does not have bigquery.datasets.create permission in project <any-gcp-project>. The service account key used by Oracle GoldenGate for BigData does not have permission to create datasets in this project. Grant the permission bigquery.datasets.create and restart the Replicat process. The privileges are listed in BigQuery Permissions.