2 Using the BigQuery Handler
Learn how to use the Google BigQuery Handler, which streams change data capture data from source trail files into Google BigQuery.
BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage, see https://cloud.google.com/bigquery/.
- Detailing the Functionality
- Setting Up and Running the BigQuery Handler
The Google BigQuery Handler uses the Java BigQuery client libraries to connect to Big Query.
2.1 Detailing the Functionality
Parent topic: Using the BigQuery Handler
2.1.1 Data Types
The BigQuery Handler supports the standard SQL data types and most of these data types are supported by the BigQuery Handler. A data type conversion from the column value in the trail file to the corresponding Java type representing the BigQuery column type in the BigQuery Handler is required.
The following data types are supported:
STRING
BYTES
INTEGER
FLOAT
NUMERIC
BOOLEAN
TIMESTAMP
DATE
TIME
DATETIME
The BigQuery Handler does not support complex data types, such as ARRAY
and STRUCT
.
Parent topic: Detailing the Functionality
2.1.2 Operation Modes
You can configure the BigQuery Handler in one of these two modes:
-
Audit Log Mode = true
gg.handler.name.auditLogMode=true
When the handler is configured to run with audit log mode true, the data is pushed into Google BigQuery without a unique row identification key. As a result, Google BigQuery is not able to merge different operations on the same row. For example, a source row with an insert operation, two update operations, and then a delete operation would show up in BigQuery as four rows, one for each operation.
Also, the order in which the audit log is displayed in the BigQuery data set is not deterministic.
To overcome these limitations, users should specify optype and postion in the meta columns template for the handler. This adds two columns of the same names in the schema for the table in Google BigQuery. For example:
gg.handler.bigquery.metaColumnsTemplate = ${optype}, ${position}
The
optype
is important to determine the operation type for the row in the audit log.To view the audit log in order of the operations processed in the trail file, specify position which can be used in the ORDER BY clause while querying the table in Google BigQuery. For example:
SELECT * FROM [projectId:datasetId.tableId] ORDER BY position
-
auditLogMode = false
-
gg.handler.name.auditLogMode=false
When the handler is configured to run with audit log mode
false
, the data is pushed into Google BigQuery using a unique row identification key. The Google BigQuery is able to merge different operations for the same row. However, the behavior is complex. The Google BigQuery maintains a finite deduplication period in which it will merge operations for a given row. Therefore, the results can be somewhat non-deterministic.The trail source needs to have a full image of the records in order to merge correctly.
Example 1
An insert operation is sent to BigQuery and before the deduplication period expires, an update operation for the same row is sent to BigQuery. The resultant is a single row in BigQuery for the update operation.
Example 2
An insert operation is sent to BigQuery and after the deduplication period expires, an update operation for the same row is sent to BigQuery. The resultant is that both the insert and the update operations show up in BigQuery.
This behavior has confounded many users, as this is the documented behavior when using the BigQuery SDK and a feature as opposed to a defect. The documented length of the deduplication period is at least one minute. However, Oracle testing has shown that the period is significantly longer. Therefore, unless users can guarantee that all operations for a give row occur within a very short period, it is likely there will be multiple entries for a given row in BigQuery. It is therefore just as important for users to configure meta columns with the optype and position so they can determine the latest state for a given row. To read more about audit log mode read the following Google BigQuery documentation:Streaming data into BigQuery.
Parent topic: Detailing the Functionality
2.1.3 Operation Processing Support
The BigQuery Handler pushes operations to Google BigQuery using synchronous API. Insert, update, and delete operations are processed differently in BigQuery than in a traditional RDBMS.
The following explains how insert, update, and delete operations are interpreted by the handler depending on the mode of operation:
-
auditLogMode = true
-
-
insert
– Inserts the record withoptype
as an insert operation in the BigQuery table. -
update
–Inserts the record withoptype
as an update operation in the BigQuery table. -
delete
– Inserts the record withoptype
as a delete operation in the BigQuery table. -
pkUpdate
—WhenpkUpdateHandling
property is configured asdelete-insert
, the handler sends out a delete operation followed by an insert operation. Both these rows have the same position in the BigQuery table, which helps to identify it as a primary key operation and not a separate delete and insert operation.
-
-
auditLogMode = false
-
-
insert
– If the row does not already exist in Google BigQuery, then an insert operation is processed as aninsert
. If the row already exists in Google BigQuery, then an insert operation is processed as anupdate
. The handler sets the deleted column tofalse
. -
update
–If a row does not exist in Google BigQuery, then an update operation is processed as aninsert
. If the row already exists in Google BigQuery, then an update operation is processed asupdate
. The handler sets the deleted column tofalse
. -
delete
– If the row does not exist in Google BigQuery, then a delete operation has no effect. If the row exists in Google BigQuery, then a delete operation is processed as adelete
. The handler sets the deleted column totrue
. -
pkUpdate
—WhenpkUpdateHandling
property is configured asdelete-insert
, the handler sets the deleted column totrue
for the row whose primary key is updated. It is followed by a separate insert operation with the new primary key and the deleted column set tofalse
for this row.
-
Do not toggle the audit log mode because it forces the BigQuery handler to abend as Google BigQuery cannot alter schema of an existing table. The existing table needs to be deleted before switching audit log modes.
Note:
The BigQuery Handler does not support the truncate
operation. It abends when it encounters a truncate
operation.
Parent topic: Detailing the Functionality
2.1.4 Proxy Settings
To connect to BigQuery using a proxy server, you must configure the proxy host and the proxy port in the properties file as follows:
javawriter.bootoptions= -Dhttps.proxyHost=proxy_host_name
-Dhttps.proxyPort=proxy_port_number
Parent topic: Detailing the Functionality
2.2 Setting Up and Running the BigQuery Handler
The Google BigQuery Handler uses the Java BigQuery client libraries to connect to Big Query.
- Group ID: com.google.cloud
- Artifact ID: google-cloud-bigquery
- Version: 1.111.1
The BigQuery Client libraries do not ship with Oracle GoldenGate for Big Data. Additionally, Google appears to have removed the link to download the BigQuery Client libraries. You can download the BigQuery Client libraries using Maven and the Maven coordinates listed above. However, this requires proficiency with Maven. If you cannot download the Google BigQuery client libraries, then contact Oracle Support for assistance. See Google BigQuery Dependancies.
- Schema Mapping for BigQuery
- Understanding the BigQuery Handler Configuration
- Review a Sample Configuration
- Configuring Handler Authentication
Parent topic: Using the BigQuery Handler
2.2.1 Schema Mapping for BigQuery
The table schema name specified in the replicat map statement is mapped to the BigQuery
dataset name. For example: map QASOURCE.*, target "dataset_US".*;
This map statement replicates tables to the BigQuery dataset "dataset_US"
Parent topic: Setting Up and Running the BigQuery Handler
2.2.2 Understanding the BigQuery Handler Configuration
The following are the configurable values for the BigQuery Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the BigQuery Handler, you must first configure the
handler type by specifying gg.handler.name.type=bigquery
and
the other BigQuery properties as follows:
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
Any string |
None |
Provides a name for the BigQuery Handler. The BigQuery Handler name then becomes part of the property names listed in this table. |
|
Required |
|
None |
Selects the BigQuery Handler for streaming change data capture into Google BigQuery. |
|
Optional |
Relative or absolute path to the credentials file |
None |
The credentials file downloaded from Google BigQuery for authentication. If you do not specify the path to the credentials file, you need to set it as an environment variable, see Configuring Handler Authentication. |
|
Required |
Any string |
None |
The name of the project in Google BigQuery. The handler needs project ID to connect to Google BigQuery store. |
|
Optional |
Any number |
|
The maximum number of operations to be batched together. This is applicable for all target table batches. |
|
Optional |
Any number |
|
The maximum amount of time in milliseconds to wait before executing the next batch of operations. This is applicable for all target table batches. |
|
Optional |
|
|
Sets whether to insert all valid rows of a request, even if invalid rows exist. If not set, the entire insert request fails if it contains an invalid row. |
gg.handler. name. ignoreUnknownValues |
Optional |
|
|
Sets whether to accept rows that contain values that do not match the schema. If not set, rows with unknown values are considered to be invalid. |
gg.handler. name. connectionTimeout |
Optional |
Positive integer |
|
The maximum amount of time, in milliseconds, to wait for the handler to establish a connection with Google BigQuery. |
gg.handler. name. readTimeout |
Optional |
Positive integer |
|
The maximum amount of time in milliseconds to wait for the handler to read data from an established connection. |
|
Optional |
A legal string |
None |
A legal string specifying the |
gg.handler. name. auditLogMode |
Optional |
|
|
Set to Set to |
gg.handler. name. pkUpdateHandling |
Optional |
|
|
Sets how the handler handles update operations that change a primary key. Primary key operations can be problematic for the BigQuery Handler and require special consideration:
|
gg.handler.name.adjustScale |
Optional |
|
false |
The BigQuery numeric data type supports a maximum scale of 9 digits.
If a field is mapped into a BigQuery numeric data type, then it fails if the scale is
larger than 9 digits. Set this property to true to round fields
mapped to BigQuery numeric data types to a scale of 9 digits. Enabling this property
results in a loss of precision for source data values with a scale larger than
9.
|
gg.handler.name.includeDeletedColumn |
Optional |
|
false |
Set to true to include a boolean column in the
output called deleted. The value of this column is set to false for
insert and update operations, and is set to true for delete
operations.
|
gg.handler.name.enableAlter |
Optional | true | false |
false |
Set to true to enable altering the target BigQuery
table. This will allow the BigQuery Handler to add columns or metacolumns configured
on the source, which are not currently in the target BigQuery table.
|
Parent topic: Setting Up and Running the BigQuery Handler
2.2.3 Review a Sample Configuration
The following is a sample configuration for the BigQuery Handler:
gg.handlerlist = bigquery #The handler properties gg.handler.bigquery.type = bigquery gg.handler.bigquery.projectId = festive-athlete-201315 gg.handler.bigquery.credentialsFile = credentials.json gg.handler.bigquery.auditLogMode = true gg.handler.bigquery.pkUpdateHandling = delete-insert gg.handler.bigquery.metaColumnsTemplate =${optype}, ${position}
Parent topic: Setting Up and Running the BigQuery Handler
2.2.4 Configuring Handler Authentication
You have to configure the BigQuery Handler authentication using the credentials in the JSON file downloaded from Google BigQuery.
Download the credentials file:
-
Login into your Google account at cloud.google.com.
-
Click Console, and then to go to the Dashboard where you can select your project.
-
From the navigation menu, click APIs & Services then select Credentials.
-
From the Create Credentials menu, choose Service account key.
-
Choose the JSON key type to download the JSON credentials file for your system.
Once you have the credentials file, you can authenticate the handler in one of these two ways:
-
Specify the path to the credentials file in the properties file with the
gg.handler.name.credentialsFile
configuration property.The path of the credentials file must contain the path with no wildcard appended. If you include the
*
wildcard in the path to the credentials file, the file is not recognized.Or
-
Set the
GOOGLE_APPLICATION_CREDENTIALS
environment variable on your system. For example:export GOOGLE_APPLICATION_CREDENTIALS = credentials.json
Then restart the Oracle GoldenGate manager process.
Parent topic: Setting Up and Running the BigQuery Handler