2 Using the BigQuery Handler

Learn how to use the Google BigQuery Handler, which streams change data capture data from source trail files into Google BigQuery.

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage, see https://cloud.google.com/bigquery/.

2.1 Detailing the Functionality

2.1.1 Data Types

The BigQuery Handler supports the standard SQL data types and most of these data types are supported by the BigQuery Handler. A data type conversion from the column value in the trail file to the corresponding Java type representing the BigQuery column type in the BigQuery Handler is required.

The following data types are supported:

STRING
BYTES
INTEGER
FLOAT
NUMERIC
BOOLEAN
TIMESTAMP
DATE
TIME
DATETIME

The BigQuery Handler does not support complex data types, such as ARRAY and STRUCT.

2.1.2 Operation Modes

You can configure the BigQuery Handler in one of these two modes:

Audit Log Mode = true
gg.handler.name.auditLogMode=true

When the handler is configured to run with audit log mode true, the data is pushed into Google BigQuery without a unique row identification key. As a result, Google BigQuery is not able to merge different operations on the same row. For example, a source row with an insert operation, two update operations, and then a delete operation would show up in BigQuery as four rows, one for each operation.

Also, the order in which the audit log is displayed in the BigQuery data set is not deterministic.

To overcome these limitations, users should specify optype and postion in the meta columns template for the handler. This adds two columns of the same names in the schema for the table in Google BigQuery. For example: gg.handler.bigquery.metaColumnsTemplate = ${optype}, ${position}

The optype is important to determine the operation type for the row in the audit log.

To view the audit log in order of the operations processed in the trail file, specify position which can be used in the ORDER BY clause while querying the table in Google BigQuery. For example:

SELECT * FROM [projectId:datasetId.tableId] ORDER BY position
auditLogMode = false

gg.handler.name.auditLogMode=false

When the handler is configured to run with audit log mode false, the data is pushed into Google BigQuery using a unique row identification key. The Google BigQuery is able to merge different operations for the same row. However, the behavior is complex. The Google BigQuery maintains a finite deduplication period in which it will merge operations for a given row. Therefore, the results can be somewhat non-deterministic.

The trail source needs to have a full image of the records in order to merge correctly.

Example 1

An insert operation is sent to BigQuery and before the deduplication period expires, an update operation for the same row is sent to BigQuery. The resultant is a single row in BigQuery for the update operation.

Example 2

An insert operation is sent to BigQuery and after the deduplication period expires, an update operation for the same row is sent to BigQuery. The resultant is that both the insert and the update operations show up in BigQuery.

This behavior has confounded many users, as this is the documented behavior when using the BigQuery SDK and a feature as opposed to a defect. The documented length of the deduplication period is at least one minute. However, Oracle testing has shown that the period is significantly longer. Therefore, unless users can guarantee that all operations for a give row occur within a very short period, it is likely there will be multiple entries for a given row in BigQuery. It is therefore just as important for users to configure meta columns with the optype and position so they can determine the latest state for a given row. To read more about audit log mode read the following Google BigQuery documentation:Streaming data into BigQuery.

2.1.3 Operation Processing Support

The BigQuery Handler pushes operations to Google BigQuery using synchronous API. Insert, update, and delete operations are processed differently in BigQuery than in a traditional RDBMS.

The following explains how insert, update, and delete operations are interpreted by the handler depending on the mode of operation:

auditLogMode = true
  • insert – Inserts the record with optype as an insert operation in the BigQuery table.

  • update –Inserts the record with optype as an update operation in the BigQuery table.

  • delete – Inserts the record with optype as a delete operation in the BigQuery table.

  • pkUpdate—When pkUpdateHandling property is configured as delete-insert, the handler sends out a delete operation followed by an insert operation. Both these rows have the same position in the BigQuery table, which helps to identify it as a primary key operation and not a separate delete and insert operation.

auditLogMode = false
  • insert – If the row does not already exist in Google BigQuery, then an insert operation is processed as an insert. If the row already exists in Google BigQuery, then an insert operation is processed as an update. The handler sets the deleted column to false.

  • update –If a row does not exist in Google BigQuery, then an update operation is processed as an insert. If the row already exists in Google BigQuery, then an update operation is processed as update. The handler sets the deleted column to false.

  • delete – If the row does not exist in Google BigQuery, then a delete operation has no effect. If the row exists in Google BigQuery, then a delete operation is processed as a delete. The handler sets the deleted column to true.

  • pkUpdate—When pkUpdateHandling property is configured as delete-insert, the handler sets the deleted column to true for the row whose primary key is updated. It is followed by a separate insert operation with the new primary key and the deleted column set to false for this row.

Do not toggle the audit log mode because it forces the BigQuery handler to abend as Google BigQuery cannot alter schema of an existing table. The existing table needs to be deleted before switching audit log modes.

Note:

The BigQuery Handler does not support the truncate operation. It abends when it encounters a truncate operation.

2.1.4 Proxy Settings

To connect to BigQuery using a proxy server, you must configure the proxy host and the proxy port in the properties file as follows:

javawriter.bootoptions= -Dhttps.proxyHost=proxy_host_name 
-Dhttps.proxyPort=proxy_port_number

2.2 Setting Up and Running the BigQuery Handler

The Google BigQuery Handler uses the Java BigQuery client libraries to connect to Big Query.

These client libraries are located using the following Maven coordinates:
  • Group ID: com.google.cloud
  • Artifact ID: google-cloud-bigquery
  • Version: 1.111.1

The BigQuery Client libraries do not ship with Oracle GoldenGate for Big Data. Additionally, Google appears to have removed the link to download the BigQuery Client libraries. You can download the BigQuery Client libraries using Maven and the Maven coordinates listed above. However, this requires proficiency with Maven. If you cannot download the Google BigQuery client libraries, then contact Oracle Support for assistance. See Google BigQuery Dependancies.

2.2.1 Schema Mapping for BigQuery

The table schema name specified in the replicat map statement is mapped to the BigQuery dataset name. For example: map QASOURCE.*, target "dataset_US".*;

This map statement replicates tables to the BigQuery dataset "dataset_US"

2.2.2 Understanding the BigQuery Handler Configuration

The following are the configurable values for the BigQuery Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

To enable the selection of the BigQuery Handler, you must first configure the handler type by specifying gg.handler.name.type=bigquery and the other BigQuery properties as follows:

Properties Required/ Optional Legal Values Default Explanation

gg.handlerlist

Required

Any string

None

Provides a name for the BigQuery Handler. The BigQuery Handler name then becomes part of the property names listed in this table.

gg.handler.name.type=bigquery

Required

bigquery

None

Selects the BigQuery Handler for streaming change data capture into Google BigQuery.

gg.handler.name.credentialsFile

Optional

Relative or absolute path to the credentials file

None

The credentials file downloaded from Google BigQuery for authentication. If you do not specify the path to the credentials file, you need to set it as an environment variable, see Configuring Handler Authentication.

gg.handler.name.projectId

Required

Any string

None

The name of the project in Google BigQuery. The handler needs project ID to connect to Google BigQuery store.

gg.handler.name.batchSize

Optional

Any number

500

The maximum number of operations to be batched together. This is applicable for all target table batches.

gg.handler.name.batchFlushFrequency

Optional

Any number

1000

The maximum amount of time in milliseconds to wait before executing the next batch of operations. This is applicable for all target table batches.

gg.handler.name.skipInvalidRows

Optional

true | false

false

Sets whether to insert all valid rows of a request, even if invalid rows exist. If not set, the entire insert request fails if it contains an invalid row.

gg.handler.name.ignoreUnknownValues

Optional

true | false

false

Sets whether to accept rows that contain values that do not match the schema. If not set, rows with unknown values are considered to be invalid.

gg.handler.name.connectionTimeout

Optional

Positive integer

20000

The maximum amount of time, in milliseconds, to wait for the handler to establish a connection with Google BigQuery.

gg.handler.name.readTimeout

Optional

Positive integer

30000

The maximum amount of time in milliseconds to wait for the handler to read data from an established connection.

gg.handler.name.metaColumnsTemplate

Optional

A legal string

None

A legal string specifying the metaColumns to be included. If you set auditLogMode to true, it is important that you set the metaColumnsTemplate property to view the operation type for the row inserted in the audit log, see Setting Metacolumn Output.

gg.handler.name.auditLogMode

Optional

true | false

false

Set to true, the handler writes each record to the target without any primary key. Everything is processed as insert.

Set to false, the handler tries to merge incoming records into the target table if they have the same primary key. Primary keys are needed for this property. The trail source records need to have a full image updates to merge correctly.

gg.handler.name.pkUpdateHandling

Optional

abend | delete-insert

abend

Sets how the handler handles update operations that change a primary key. Primary key operations can be problematic for the BigQuery Handler and require special consideration:

  • abend- indicates the process abends.

  • delete-insert- indicates the process treats the operation as a delete and an insert. The full before image is required for this property to work correctly. Without full before and after row images the insert data are incomplete. Oracle recommends this option.

gg.handler.name.adjustScale Optional

true | false

false The BigQuery numeric data type supports a maximum scale of 9 digits. If a field is mapped into a BigQuery numeric data type, then it fails if the scale is larger than 9 digits. Set this property to true to round fields mapped to BigQuery numeric data types to a scale of 9 digits. Enabling this property results in a loss of precision for source data values with a scale larger than 9.
gg.handler.name.includeDeletedColumn Optional

true | false

false Set to true to include a boolean column in the output called deleted. The value of this column is set to false for insert and update operations, and is set to true for delete operations.
gg.handler.name.enableAlter Optional true | false false Set to true to enable altering the target BigQuery table. This will allow the BigQuery Handler to add columns or metacolumns configured on the source, which are not currently in the target BigQuery table.

2.2.3 Review a Sample Configuration

The following is a sample configuration for the BigQuery Handler:

gg.handlerlist = bigquery

#The handler properties
gg.handler.bigquery.type = bigquery
gg.handler.bigquery.projectId = festive-athlete-201315
gg.handler.bigquery.credentialsFile = credentials.json
gg.handler.bigquery.auditLogMode = true
gg.handler.bigquery.pkUpdateHandling = delete-insert

gg.handler.bigquery.metaColumnsTemplate =${optype}, ${position}

2.2.4 Configuring Handler Authentication

You have to configure the BigQuery Handler authentication using the credentials in the JSON file downloaded from Google BigQuery.

Download the credentials file:

  1. Login into your Google account at cloud.google.com.

  2. Click Console, and then to go to the Dashboard where you can select your project.

  3. From the navigation menu, click APIs & Services then select Credentials.

  4. From the Create Credentials menu, choose Service account key.

  5. Choose the JSON key type to download the JSON credentials file for your system.

Once you have the credentials file, you can authenticate the handler in one of these two ways:

  • Specify the path to the credentials file in the properties file with the gg.handler.name.credentialsFile configuration property.

    The path of the credentials file must contain the path with no wildcard appended. If you include the * wildcard in the path to the credentials file, the file is not recognized.

    Or

  • Set the GOOGLE_APPLICATION_CREDENTIALS environment variable on your system. For example:

    export GOOGLE_APPLICATION_CREDENTIALS = credentials.json

    Then restart the Oracle GoldenGate manager process.