7.5 Realtime Parquet Ingestion into AWS S3 Buckets with Oracle GoldenGate for Big Data

This topic covers a step-by-step process that shows how to ingest parquet files into AWS S3 buckets in real-time with Oracle GoldenGate for Big Data.

7.5.1 Install Dependency Files

Oracle GoldenGate for Big Data uses client libraries in the replication process. You need to download these libraries by using the Dependency Downloader utility available in Oracle GoldenGate for Big Data before setting up the replication process. Dependency downloader is a set of shell scripts that downloads dependency jar files from Maven and other repositories.

Oracle GoldenGate for Big Data uses a 3-step process to ingest parquet into S3 buckets:
  • Generating local files from trail files (these local files are not accessible in OCI GoldenGate)
  • Converting local files to Parquet format
  • Loading files into AWS S3 buckets
For generating local parquet files with Oracle GoldenGate for Big Data, replicat uses File Writer Handler (see Flat Files) and Parquet Event Handler (see Parquet). To load the parquet files into AWS S3, Oracle GoldenGate for Big Data uses S3 Event Handler (see Amazon S3) in conjunction with File Writer and Parquet Event Handler.

Oracle GoldenGate for Big Data uses 3 different set of client libraries to create parquet files and loading into AWS S3.

To install the required dependency files:

  1. Go to installation location of Dependency Downloader: GG_HOME/opt/DependencyDownloader/.
  2. Execute parquet.sh (see Parquet), hadoop.sh (see HDFS Event Handler, and aws.sh with the required versions.
    A directory is created in GG_HOME/opt/DependencyDownloader/dependencies. For example,/u01/app/ogg/opt/DependencyDownloader/dependencies/aws_sdk_1.12.30.

    Figure 7-26 Executing parquet.sh, hadoop.sh, and aws.sha with the required versions.

    Execute parquet.sh, hadoop.sh, and aws.sha with the required versions.
    The following directories are created in GG_HOME/opt/DependencyDownloader/dependencies:
    • /u01/app/ogg/opt/DependencyDownloader/dependencies/aws_sdk_1.12.309
    • /u01/app/ogg/opt/DependencyDownloader/dependencies/hadoop_3.3.0
    • /u01/app/ogg/opt/DependencyDownloader/dependencies/parquet_1.12.3

    Figure 7-27 S3 Directories

    S3 Directories

7.5.2 Create a Replicat in Oracle GoldenGate for Big Data

To create a replicat in Oracle GoldenGate for Big Data:

  1. In the Oracle GoldenGate for Big Data UI, in the Administration Service tab, click the + sign to add a replicat.

    Figure 7-28 Click + in the Administration Service tab.

    Click + in the Administration Service tab.
  2. Select the Replicat Type and click Next.

    There are two different Replicat types here: Classic and Coordinated. Classic Replicat is a single threaded process whereas Coordinated Replicat is a multithreaded one that applies transactions in parallel.

    Figure 7-29 Select the Replicat Type and click Next.

    Select the Replicat Type and click Next.
  3. Enter the basic information, and click Next:
    1. Process Name: Name of the Replicat
    2. Trail Name: Name of the required trail file
    3. Target: Do not fill this field.

      Figure 7-30 Process Name, Trail Name, and Target Names

      Enter the basic information and click Next.
  4. Enter Parameter File details and click Next. In the Parameter File, you can either specify source to target mapping or leave it as is with a wildcard selection. If you select Coordinated Replicat as the Replicat Type, then provide the following additional parameter: TARGETDB LIBFILE libggjava.so SET property=<ggbd-deployment_home>/etc/conf/ogg/your_replicat_name.properties

    Figure 7-31 Provide Parameter File details and click Next.

    Provide Parameter File details and click Next.
  5. Copy and paste the following property list into the properties file, update as needed, and click Create and Run.
    #The File Writer Handler – no need to change
    gg.handlerlist=filewriter
    gg.handler.filewriter.type=filewriter
    gg.handler.filewriter.mode=op
    gg.handler.filewriter.pathMappingTemplate=./dirout
    gg.handler.filewriter.stateFileDirectory=./dirsta
    gg.handler.filewriter.fileRollInterval=7m
    gg.handler.filewriter.inactivityRollInterval=30s
    gg.handler.filewriter.fileWriteActiveSuffix=.tmp
    gg.handler.filewriter.finalizeAction=delete
    
    ### Avro OCF – no need to change
    gg.handler.filewriter.format=avro_row_ocf
    gg.handler.filewriter.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.avro
    gg.handler.filewriter.format.pkUpdateHandling=delete-insert
    gg.handler.filewriter.format.metaColumnsTemplate=${optype},${position}
    gg.handler.filewriter.format.iso8601Format=false
    gg.handler.filewriter.partitionByTable=true
    gg.handler.filewriter.rollOnShutdown=true
    
    #The Parquet Event Handler – no need to change
    gg.handler.filewriter.eventHandler=parquet
    gg.eventhandler.parquet.type=parquet
    gg.eventhandler.parquet.pathMappingTemplate=./dirparquet
    gg.eventhandler.parquet.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.parquet
    gg.eventhandler.parquet.writeToHDFS=false
    gg.eventhandler.parquet.finalizeAction=delete
    
    #TODO Select S3 Event Handler – no need to change
    gg.eventhandler.parquet.eventHandler=s3
    
    #TODO Set S3 Event Handler- please update as needed
    gg.eventhandler.s3.type=s3
    gg.eventhandler.s3.region=your-aws-region
    gg.eventhandler.s3.bucketMappingTemplate=target s3 bucketname
    gg.eventhandler.s3.pathMappingTemplate=ogg/data/${fullyQualifiedTableName}
    gg.eventhandler.s3.accessKeyId=XXXXXXXXX
    gg.eventhandler.s3.secretKey=XXXXXXXX
    
    #TODO Set the classpath to the Parquet client libries and the Hadoop client with the path you noted in step1
    gg.classpath=
    jvm.bootoptions=-Xmx512m -Xms32m
    
  6. If replicat starts successfully, then the replicat is in running state. You can go to action/details/statistics to see the replication statistics.

    Figure 7-32 Replication Statistics

    Replication Statistics
  7. Go to the AWS S3 console and check the bucket.

    Figure 7-33 S3 Bucket

    S3 Bucket

Note:

  • If target S3 bucket does not exist, then it will be auto created by Oracle GoldenGate for Big Data. Use Template Keywords to dynamically assign S3 bucket names.
  • S3 Event Handler can be configured for the Proxy server. For more information, see Amazon S3.
  • You can use different properties to control the behaviour of file writing. You can set file sizes, inactivity periods, and more. For more information, see the blog:Oracle GoldenGate for Big Data File Writer Handler Behavior.
  • For Kafka connection issues, see Oracle Support.