8.4 Realtime Parquet Ingestion into AWS S3 buckets with Oracle GoldenGate for Distributed Applications and Analytics

Overview

This Quickstart covers a step-by-step process showing how to ingest parquet files into AWS S3 buckets in real-time with Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA).

Amazon Simple Storage Service (Amazon S3) is an object storage service provided by Amazon Web Services.

GG for DAA S3 handler works in conjunction with File Writer Handler and Parquet Handler (if parquet is required). File Writer Handler produces files locally, optionally Parquet Handler converts to parquet format and S3 Handler loads into S3 buckets.

8.4.1 Prerequisites

To successfully complete this Quickstart, you must have the following:

In this Quickstart, a sample trail file (named tr) which is shipped with GG for DAA is used. If you want to continue with sample trail file, then it is at GG_HOME/opt/AdapterExamples/trail/ in your GG for DAA instance.

8.4.2 Install Dependency Files

Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) uses client libraries in the replication process and these libraries need to be downloaded before setting up the replication process. You can use dependency downloader to download the client libraries. Dependency Downloader is a set of shell scripts that downloads dependency jar files from Maven and other repositories.

GG for DAA uses a 3-step process to ingest parquet into s3 buckets:

  • Generating local files from trail files
  • Converting local files to Parquet format
  • Loading files into AWS s3 buckets

For generating local parquet files with GG for DAA, replicat uses File Writer Handler and Parquet Event Handler. To load the parquet files into AWS s3, GG for DAA uses S3 Event Handler in conjunction with File Writer and Parquet Event Handler.

GG for DAA uses 3 different set of client libraries to create parquet files and loading into AWS s3:
  1. In your GG for DAA VM, go to Dependency Downloader utility. It is located at: GG_HOME/opt/DependencyDownloader/
  2. Execute parquet.sh and hadoop.sh with and aws.sh with the required versions.

    Figure 8-19 II. Executing parquet.sh and hadoop.sh with and aws.sh with the required versions

    II. Execute parquet.sh and hadoop.sh with and aws.sh with the required versions
  3. 3 directories are created in GG_HOME/opt/DependencyDownloader/dependencies. Note the directories. For example:
    • /u01/app/ogg/opt/DependencyDownloader/dependencies/aws_sdk_1.12.309/*
    • /u01/app/ogg/opt/DependencyDownloader/dependencies/hadoop_3.3.0/*
    • /u01/app/ogg/opt/DependencyDownloader/dependencies/parquet_1.12.3/*

8.4.3 Create a Replicat in Oracle GoldenGate for Distributed Applications and Analytics

To create a replicat in Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA):

  1. In the Oracle GoldenGate for Distributed Applications and Analytics UI, in the Administration Service tab, click the + sign to add a replicat.

    Figure 8-20 Click the Administration Service tab

    Click the Admininstration Service tab.
  2. Select the Classic Replicat Replicat Type and click Next. There are two different Replicat types available: Classic and Coordinated. Classic Replicat is a single threaded process whereas Coordinated Replicat is a multithreaded one that applies transactions in parallel.

    Figure 8-21 Add Replicat

    Add Replicat
  3. Enter the Replicat information, and click Next:
    1. Replicat Trail: Name of the required trail file. For sample trail, enter tr.
    2. Target: Amazon s3

    Figure 8-22 Enter Replicat Details

    Enter Replicat Details.
  4. Leave Managed Options as is and click Next.

    Figure 8-23 Add Replicat - Managed Options

    Leave Managed Options as is and click Next.
  5. Enter Parameter File details and click Next. In the Parameter File, you can either specify source to target mapping or leave it as-is with a wildcard selection. If Coordinated Replicat is selected as the Replicat Type, then an additional parameter needs to be provided: TARGETDB LIBFILE libggjava.so SET property=<ggbd-deployment_home>/etc/conf/ogg/your_replicat_name.properties

    Figure 8-24 Parameter File

    Parameter File.
  6. In the Properties File, remove all the pre-configured properties; but not the first row marked with the replicat name (# Properties file for Replicat <replicat_name>). Copy and paste the following property list into the properties file, update the properties marked as #TODO and click Create and Run.
    #The File Writer Handler – no need to change
    gg.handlerlist=filewriter
    gg.handler.filewriter.type=filewriter
    gg.handler.filewriter.mode=op
    gg.handler.filewriter.pathMappingTemplate=./dirout
    gg.handler.filewriter.stateFileDirectory=./dirsta
    gg.handler.filewriter.fileRollInterval=7m
    gg.handler.filewriter.inactivityRollInterval=5s
    gg.handler.filewriter.fileWriteActiveSuffix=.tmp
    gg.handler.filewriter.finalizeAction=delete
    
    
    ### Avro OCF – no need to change
    gg.handler.filewriter.format=avro_row_ocf
    gg.handler.filewriter.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.avro
    gg.handler.filewriter.format.pkUpdateHandling=delete-insert
    gg.handler.filewriter.format.metaColumnsTemplate=${optype},${position}
    gg.handler.filewriter.format.iso8601Format=false
    gg.handler.filewriter.partitionByTable=true
    gg.handler.filewriter.rollOnShutdown=true
    
    #The Parquet Event Handler – no need to change
    gg.handler.filewriter.eventHandler=parquet
    gg.eventhandler.parquet.type=parquet
    gg.eventhandler.parquet.pathMappingTemplate=./dirparquet
    gg.eventhandler.parquet.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.parquet
    gg.eventhandler.parquet.writeToHDFS=false
    gg.eventhandler.parquet.finalizeAction=delete
    
    #TODO Select S3 Event Handler – no need to change
    gg.eventhandler.parquet.eventHandler=s3
    
    #TODO Set S3 Event Handler- please update as needed
    gg.eventhandler.s3.type=s3
    gg.eventhandler.s3.region=<your-aws-region>
    gg.eventhandler.s3.bucketMappingTemplate=<target_s3_bucket_name>
    gg.eventhandler.s3.pathMappingTemplate=ogg/data/${fullyQualifiedTableName}
    gg.eventhandler.s3.accessKeyId=<provide_key>
    gg.eventhandler.s3.secretKey=<provide_secret>
    
    #TODO Set the classpath to the paths you noted in step1
    gg.classpath=path_to/aws_sdk_1.12.309/*:path_to/ hadoop_3.3.0/*:path_to/parquet_1.12.3/*
    jvm.bootoptions=-Xmx512m -Xms32m
    
  7. If replicat starts successfully, then it is in running state. You can go to Replicats/Statistics to see the replication statistics.

    Figure 8-25 Replication Statistics

    Replication Statistics
  8. Go to AWS s3 console and check the files.

    Figure 8-26 AWS S3 console

    AWS s3 console

Note:

  • If target s3 does not exist, then it will be auto created by GG for DAA. You can use Template Keywords to dynamically assign the container names.
  • s3 Handler can be configured for proxy server. For more information, see S3 Event Handler.
  • You can use different properties to control the behaviour of file writing. You can set file sizes, inactivity periods and more. You can get more details in the File Writer blog post.