Realtime Parquet Ingestion into AWS S3 Buckets with Oracle GoldenGate for Big Data

7.5 Realtime Parquet Ingestion into AWS S3 Buckets with Oracle GoldenGate for Big Data

This topic covers a step-by-step process that shows how to ingest parquet files into AWS S3 buckets in real-time with Oracle GoldenGate for Big Data.

Parent topic: Quickstarts

7.5.1 Install Dependency Files

Oracle GoldenGate for Big Data uses client libraries in the replication process. You need to download these libraries by using the Dependency Downloader utility available in Oracle GoldenGate for Big Data before setting up the replication process. Dependency downloader is a set of shell scripts that downloads dependency jar files from Maven and other repositories.

Oracle GoldenGate for Big Data uses a 3-step process to ingest parquet into S3 buckets:

Generating local files from trail files (these local files are not accessible in OCI GoldenGate)
Converting local files to Parquet format
Loading files into AWS S3 buckets

For generating local parquet files with Oracle GoldenGate for Big Data, replicat uses File Writer Handler (see Flat Files) and Parquet Event Handler (see Parquet). To load the parquet files into AWS S3, Oracle GoldenGate for Big Data uses S3 Event Handler (see Amazon S3) in conjunction with File Writer and Parquet Event Handler.

Oracle GoldenGate for Big Data uses 3 different set of client libraries to create parquet files and loading into AWS S3.

To install the required dependency files:

Go to installation location of Dependency Downloader: GG_HOME/opt/DependencyDownloader/.
Execute parquet.sh (see Parquet), hadoop.sh (see HDFS Event Handler, and aws.sh with the required versions.
A directory is created in GG_HOME/opt/DependencyDownloader/dependencies. For example,/u01/app/ogg/opt/DependencyDownloader/dependencies/aws_sdk_1.12.30.

Figure 7-26 Executing parquet.sh, hadoop.sh, and aws.sha with the required versions.
The following directories are created in GG_HOME/opt/DependencyDownloader/dependencies:
- /u01/app/ogg/opt/DependencyDownloader/dependencies/aws_sdk_1.12.309
- /u01/app/ogg/opt/DependencyDownloader/dependencies/hadoop_3.3.0
- /u01/app/ogg/opt/DependencyDownloader/dependencies/parquet_1.12.3
Figure 7-27 S3 Directories

Parent topic: Realtime Parquet Ingestion into AWS S3 Buckets with Oracle GoldenGate for Big Data

7.5.2 Create a Replicat in Oracle GoldenGate for Big Data

To create a replicat in Oracle GoldenGate for Big Data:

In the Oracle GoldenGate for Big Data UI, in the Administration Service tab, click the + sign to add a replicat.

Figure 7-28 Click + in the Administration Service tab.
Select the Replicat Type and click Next.
There are two different Replicat types here: Classic and Coordinated. Classic Replicat is a single threaded process whereas Coordinated Replicat is a multithreaded one that applies transactions in parallel.

Figure 7-29 Select the Replicat Type and click Next.
Enter the basic information, and click Next:
1. Process Name: Name of the Replicat
2. Trail Name: Name of the required trail file
3. Target: Do not fill this field.
  
  Figure 7-30 Process Name, Trail Name, and Target Names
Enter Parameter File details and click Next. In the Parameter File, you can either specify source to target mapping or leave it as is with a wildcard selection. If you select Coordinated Replicat as the Replicat Type, then provide the following additional parameter: TARGETDB LIBFILE libggjava.so SET property=<ggbd-deployment_home>/etc/conf/ogg/your_replicat_name.properties

Figure 7-31 Provide Parameter File details and click Next.

Copy and paste the following property list into the properties file, update as needed, and click Create and Run.

#The File Writer Handler – no need to change
gg.handlerlist=filewriter
gg.handler.filewriter.type=filewriter
gg.handler.filewriter.mode=op
gg.handler.filewriter.pathMappingTemplate=./dirout
gg.handler.filewriter.stateFileDirectory=./dirsta
gg.handler.filewriter.fileRollInterval=7m
gg.handler.filewriter.inactivityRollInterval=30s
gg.handler.filewriter.fileWriteActiveSuffix=.tmp
gg.handler.filewriter.finalizeAction=delete

### Avro OCF – no need to change
gg.handler.filewriter.format=avro_row_ocf
gg.handler.filewriter.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.avro
gg.handler.filewriter.format.pkUpdateHandling=delete-insert
gg.handler.filewriter.format.metaColumnsTemplate=${optype},${position}
gg.handler.filewriter.format.iso8601Format=false
gg.handler.filewriter.partitionByTable=true
gg.handler.filewriter.rollOnShutdown=true

#The Parquet Event Handler – no need to change
gg.handler.filewriter.eventHandler=parquet
gg.eventhandler.parquet.type=parquet
gg.eventhandler.parquet.pathMappingTemplate=./dirparquet
gg.eventhandler.parquet.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.parquet
gg.eventhandler.parquet.writeToHDFS=false
gg.eventhandler.parquet.finalizeAction=delete

#TODO Select S3 Event Handler – no need to change
gg.eventhandler.parquet.eventHandler=s3

#TODO Set S3 Event Handler- please update as needed
gg.eventhandler.s3.type=s3
gg.eventhandler.s3.region=your-aws-region
gg.eventhandler.s3.bucketMappingTemplate=target s3 bucketname
gg.eventhandler.s3.pathMappingTemplate=ogg/data/${fullyQualifiedTableName}
gg.eventhandler.s3.accessKeyId=XXXXXXXXX
gg.eventhandler.s3.secretKey=XXXXXXXX

#TODO Set the classpath to the Parquet client libries and the Hadoop client with the path you noted in step1
gg.classpath=
jvm.bootoptions=-Xmx512m -Xms32m

If replicat starts successfully, then the replicat is in running state. You can go to action/details/statistics to see the replication statistics.

Figure 7-32 Replication Statistics
Go to the AWS S3 console and check the bucket.

Figure 7-33 S3 Bucket

Note:

If target S3 bucket does not exist, then it will be auto created by Oracle GoldenGate for Big Data. Use Template Keywords to dynamically assign S3 bucket names.
S3 Event Handler can be configured for the Proxy server. For more information, see Amazon S3.
You can use different properties to control the behaviour of file writing. You can set file sizes, inactivity periods, and more. For more information, see the blog:Oracle GoldenGate for Big Data File Writer Handler Behavior.
For Kafka connection issues, see Oracle Support.

Parent topic: Realtime Parquet Ingestion into AWS S3 Buckets with Oracle GoldenGate for Big Data