8.2.3 Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The purpose of the Redshift Event Handler is to apply operations into Redshift tables.

See Flat Files.

8.2.3.1 Detailed Functionality

Ensure to use the Redshift Event handler as a downstream Event handler connected to the output of the S3 Event handler. The S3 Event handler loads files generated by the File Writer Handler into Amazon S3.

Redshift Event handler uses the COPY SQL to bulk load operation data available in S3 into temporary Redshift staging tables. The staging table data is then used to update the target table. All the SQL operations are performed in batches providing better throughput.

8.2.3.2 Operation Aggregation

8.2.3.2.1 Aggregation In Memory

Before loading the operation data into S3, the operations in the trail file are aggregated. Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.

Table 8-3 Configuration Properties

Properties Required/ Optional Legal Values Default Explanation

gg.aggregate.operations

Optional

true | false

false

Aggregate operations based on the primary key of the operation record.

8.2.3.2.2 Aggregation using SQL post loading data into the staging table

In this aggregation operation, the in-memory operation aggregation need not be performed. The operation data loaded into the temporary staging table is aggregated using SQL queries, such that the staging table contains just one row per key.

Table 8-4 Configuration Properties

Properties Required/ Optional Legal Values Default Explanation
gg.eventhandler.name.aggregateStagingTableRows Optional True| False False

Use SQL to aggregate staging table data before updating the target table.

8.2.3.3 Unsupported Operations and Limitations

The following operations are not supported by the Redshift Handler:

  • DDL changes are not supported.
  • Timestamp and Timestamp with Time zone data types: The maximum precision supported is up to microseconds, the nanoseconds portion will be truncated. This is a limitation we have observed with the Redshift COPY SQL.
  • Redshift COPY SQL has a limitation on the maximum size of a single input row from any source is 4MB.

8.2.3.4 Uncompressed UPDATE records

It is mandatory that the trail files used to apply to Redshift contain uncompressed UPDATE operation records, which means that the UPDATE operations contain full image of the row being updated.

If UPDATE records have missing columns, then such columns are updated in the target as null. By setting the parameter gg.abend.on.missing.columns=true, replicat can fail fast on detecting a compressed update trail record. This is the recommended setting.

8.2.3.5 Error During the Data Load Proces

Staging operation data from AWS S3 onto temporary staging tables and updating the target table occurs inside a single transaction. In case of any error(s), the entire transaction is rolled back and the replicat process will ABEND.

If there are errors with the COPY SQL, then the Redshift system table stl_load_errors is also queried and the error traces are made available in the handler log file.

8.2.3.6 Troubleshooting and Diagnostics

  • Connectivity issues to Redshift
    • Validate JDBC connection URL, user name and password.
    • Check if http/https proxy is enabled. Generally, Redshift endpoints cannot be accessed via proxy.
  • DDL and Truncate operations not applied on the target table: The Redshift handler will ignore DDL and truncate records in the source trail file.
  • Target table existence: It is expected that the Redshift target table exists before starting the apply process. Target tables need to be designed with primary keys, sort keys, partition distribution key columns. Approximations based on the column metadata in the trail file may not be always correct. Therefore, Redshift apply will ABEND if the target table is missing.
  • Operation aggregation in-memory (gg.aggregagte.operations=true) is memory intensive where as operation aggregation using SQL(gg.eventhandler.name.aggregateStagingTableRows=true) requires more SQL processing on the Redshift database. These configurations are mutually exclusive and only one of them should be enabled at a time. Tests within Oracle have revealed that operation aggregation in memory delivers better apply rate. This may not always be the case on all the customer deployments.
  • Diagnostic information on the apply process is logged onto the handler log file.
    • Operation aggregation time (in milli-seconds) in-memory:

INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Merge statistics ********START*********************************
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Number of update operations merged into an existing update operation: [232653]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Time spent aggregating operations : [22064]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Time spent flushing aggregated operations : [36382]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Merge statistics ********END***********************************
  • Stage and load processing time (in milli-seconds) for SQL queries

INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Stage and load statistics ********START*********************************
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Time spent for staging process [277093]
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Time spent for load process [32650]
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Stage and load statistics ********END***********************************
  • Stage time (in milli-seconds) will also include additional statistics if operation aggregation using SQL is enabled.
  • Co-existence of the components: The location/region of the machine where replicat process is running, AWS S3 bucket region and the Redshift cluster region would impact the overall throughput of the apply process. Data flow is as follows: GoldenGate => AWS S3 => AWS Redshift. For best throughput, the components need to be located as close as possible.

8.2.3.7 Classpath

Redshift apply relies on the upstream File Writer handler and the S3 Event handler.

Include the required jars needed to run the S3 Event handler in gg.classpath. See Amazon S3. Redshift Event handler uses the Redshift JDBC driver. Ensure to include the jar file in gg.classpath as shown in the following example:

gg.classpath=aws-java-sdk-1.11.356/lib/*:aws-java-sdk-1.11.356/third-party/lib/*:./RedshiftJDBC42-no-awssdk-1.2.8.1005.jar

8.2.3.8 Configuration

Automatic Configuration

AWS Redshift Data warehouse replication involves configuring of multiple components, such as file writer handler, S3 event handler and Redshift event handler. The Automatic Configuration feature auto configures these components so that you need to perform minimal configurations. The properties modified by auto configuration will also be logged in the handler log file.

To enable auto configuration to replicate to Redshift target, set the parameter: gg.target=redshift
gg.target
Required
Legal Value: redshift
Default:  None
Explanation: Enables replication to Redshift target

When replicating to Redshift target, the customization of S3 event hander name and Redshift event handler name is not allowed.

File Writer Handler Configuration

File writer handler name is pre-set to the value redshift. The following is an example to edit a property of file writer handler: gg.handler.redshift.pathMappingTemplate=./dirout

S3 Event Handler Configuration

S3 event handler name is pre-set to the value s3. The following is an example to edit a property of the S3 event handler: gg.eventhandler.s3.bucketMappingTemplate=bucket1.

Redshift Event Handler Configuration

The Redshift event handler name is pre-set to the value redshift.

Table 8-5 Properties

Properties Required/Optional Legal Value Default Explanation
gg.eventhandler.redshift.connectionURL Required Redshift JDBC Connection URL None

Sets the Redshift JDBC connection URL.

Example: jdbc:redshift://aws-redshift-instance.cjoaij3df5if.us-east-2.redshift.amazonaws.com:5439/mydb

gg.eventhandler.redshift.UserName Required JDBC User Name None Sets the Redshift database user name.
gg.eventhandler.redshift.Password Required JDBC Password None Sets the Redshift database password.
gg.eventhandler.redshift.awsIamRole Optional AWS role ARN in the format: arn:aws:iam::<aws_account_id>:role/<role_name> None AWS IAM role ARN that the Redshift cluster uses for authentication and authorization for executing COPY SQL to access objects in AWS S3 buckets.
gg.eventhandler.redshift.useAwsSecurityTokenService Optional true | false Value is set from the configuration property set in the upstream s3 Event handler gg.eventhandler.s3.enableSTS Use AWS Security Token Service for authorization. For more information, see Redshift COPY SQL Authorization.
gg.eventhandler.redshift.awsSTSEndpoint Optional A valid HTTPS URL. Value is set from the configuration property set in the upstream s3 Event handler gg.eventhandler.s3.stsURL. The AWS STS endpoint string. For example: https://sts.us-east-1.amazonaws.com. For more information, see Redshift COPY SQL Authorization.
gg.eventhandler.redshift.awsSTSRegion Optional A valid AWS region. Value is set from the configuration property set in the upstream s3 Event handler gg.eventhandler.s3.stsRegion. The AWS STS region. For example, us-east-1. For more information, see Redshift COPY SQL Authorization.
gg.initialLoad Optional true | false false If set to true, initial load mode is enabled. See INSERTALLRECORDS Support.
gg.operation.aggregator.validate.keyupdate Optional true or false false If set to true, Operation Aggregator will validate key update operations (optype 115) and correct to normal update if no key values have changed. Compressed key update operations do not qualify for merge.

End-to-End Configuration

The following is an end-end configuration example which uses auto configuration for FW handler, S3 and Redshift Event handlers.

The sample properties are available at the following location

  • In an Oracle GoldenGate Classic install: <oggbd_install_dir>/AdapterExamples/big-data/redshift-via-s3/rs.props
  • In an Oracle GoldenGate Microservices install: <oggbd_install_dir>/opt/AdapterExamples/big-data/redshift-via-s3/rs.props
# Configuration to load GoldenGate trail operation records
# into Amazon Redshift by chaining
# File writer handler -> S3 Event handler -> Redshift Event handler. 
# Note: Recommended to only edit the configuration marked as  TODO
gg.target=redshift
#The S3 Event Handler
#TODO: Edit the AWS region
gg.eventhandler.s3.region=<aws region>
#TODO: Edit the AWS S3 bucket 
gg.eventhandler.s3.bucketMappingTemplate<s3bucket>
#The Redshift Event Handler
#TODO: Edit ConnectionUrl
gg.eventhandler.redshift.connectionURL=jdbc:redshift://aws-redshift-instance.cjoaij3df5if.us-east-2.redshift.amazonaws.com:5439/mydb
#TODO: Edit Redshift user name
gg.eventhandler.redshift.UserName=<db user name>
#TODO: Edit Redshift password
gg.eventhandler.redshift.Password=<db password>
#TODO:Set the classpath to include AWS Java SDK and Redshift JDBC driver.
gg.classpath=aws-java-sdk-1.11.356/lib/*:aws-java-sdk-1.11.356/third-party/lib/*:./RedshiftJDBC42-no-awssdk-1.2.8.1005.jar
jvm.bootoptions=-Xmx8g -Xms32m

8.2.3.9 INSERTALLRECORDS Support

Stage and merge targets supports INSERTALLRECORDS parameter.

See INSERTALLRECORDS in Reference for Oracle GoldenGate. Set the INSERTALLRECORDS parameter in the Replicat parameter file (.prm).

Setting this property directs the Replicat process to use bulk insert operations to load operation data into the target table. You can tune the batch size of bulk inserts using the File Writer property gg.handler.redshift.maxFileSize. The default value is set to 1GB. The frequency of bulk inserts can be tuned using the File Writer property gg.handler.redshift.fileRollInterval, the default value is set to 3m (three minutes).

Note:

8.2.3.10 Redshift COPY SQL Authorization

The Redshift event handler uses COPY SQL to read staged files in Amazon Web Services (AWS) S3 buckets. The COPY SQL query may need authorization credentials to access files in AWS S3.

Authorization can be provided by using an AWS Identity and Access Management (IAM) role that is attached to the Redshift cluster or by providing a AWS access key and a secret for the access key. As a security consideration, it is a best practise to use role-based access when possible.

AWS Key-Based Authorization

With key-based access control, you provide the access key ID and secret access key for an AWS IAM user that is authorized to access AWS S3. The access key id and secret access key are retrieved by looking up the credentials as follows:

  1. Environment variables - AWS_ACCESS_KEY/AWS_ACCESS_KEY_ID and AWS_SECRET_KEY/AWS_SECRET_ACCESS_KEY.
  2. Java System Properties - aws.accessKeyId and aws.secretKey.
  3. Credential profiles file at the default location (~/.aws/credentials).
  4. Amazon Elastic Container Service (ECS) container credentials loaded from Amazon ECS if the environment variable AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is set.
  5. Instance profile credentials retrieved from Amazon Elastic Compute Cloud (EC2) metadata service.

Running Replicat on an AWS EC2 Instance

If the replicat process is started on an AWS EC2 instance, then the access key ID and secret access key are automatically retrieved by Oracle GoldenGate for BigData and no explicit user configuration is required.

Temporary Security Credentials using AWS Security Token Service (STS)

If you use the key-based access control, then you can further limit the access users have to your data by retrieving temporary security credentials using AWS Security Token Service. The auto configure feature of the Redshift event handler automatically picks up the AWS Security Token Service (STS) configuration from S3 event handler.

Table 8-6 S3 Event Handler Configuration and Redshift Event Handler Configuration

S3 Event Handler Configuration Redshift Event Handler Configuration
enableSTS useAwsSTS
stsURL awsSTSEndpoint
stsRegion awsSTSRegion

AWS IAM Role-based Authorization

With role-based authorization, Redshift cluster temporarily assumes an IAM role when executing COPY SQL. You need to provide the role Amazon Resource Number (ARN) as a configuration value as follows: gg.eventhandler.redshift.AwsIamRole. For example: gg.eventhandler.redshift.AwsIamRole=arn:aws:iam::<aws_account_id>:role/<role_name>. The role needs to be authorized to read the respective S3 bucket. Ensure that the trust relationship of the role contains the AWS redshift service. Additionally, attach this role to the Redshift cluster before starting the Redshift cluster. For example, AWS IAM policy that can be used in the the trust relationship of the role.

{
  "Version": "2012-10-17",
  "Statement": [
  {
   "Effect": "Allow",
   "Principal": {
   "Service": [
    "redshift.amazonaws.com"
   ]
  },
  "Action": "sts:AssumeRole"
 }
 ]
}

If the role-based authorization is configured (gg.eventhandler.redshift.AwsIamRole), then it is given priority over key-based authorization.

8.2.3.11 Co-ordinated Apply Support

To enable co-ordinated apply for Redshift, ensure that the Redshift database's isolation level is set to SNAPSHOT. The Redshift SNAPSHOT ISOLATION option allows higher concurrency, where concurrent modifications to different rows in the same table can complete successfully.

SQL Query to Alter the Database's Isolation Level

ALTER DATABASE <sampledb> ISOLATION LEVEL SNAPSHOT;