Realtime Parquet Ingestion into Azure Data Lake Storage with Oracle GoldenGate for Distributed Applications and Analytics

8.8 Realtime Parquet Ingestion into Azure Data Lake Storage with Oracle GoldenGate for Distributed Applications and Analytics

Overview

This Quickstart covers a step-by-step process showing how to ingest parquet files into Azure Storage containers in real-time with Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA).

Azure Data Lake Storage (ADLS) is a centralized repository provided by Azure where you can store all your data, both structured and unstructured.

GG for DAA ADLS handler works in conjunction with File Writer Handler and Parquet Handler (if parquet is required). File Writer Handler produces files locally, optionally, Parquet Handler converts to parquet format and Azure Data Lake Storage (ADLS) Handler loads into Azure Storage containers.

GG for DAA provides different alternatives for ADLS connection. This Quickstart uses BLOB endpoint for connection.

Parent topic: Quickstarts

8.8.1 Prerequisites

To successfully complete this Quickstart, you must have the following:

Azure Storage Account and a Container
Azure Storage Account login credentials

In this Quickstart, a sample trail file (named tr), which is shipped with GG for DAA is used. The sample trail file is located at GG_HOME/opt/AdapterExamples/trail/ in your GG for DAA instance.

Parent topic: Realtime Parquet Ingestion into Azure Data Lake Storage with Oracle GoldenGate for Distributed Applications and Analytics

8.8.2 Install Dependency Files

Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) uses Java SDK provided by Azure. You can download the SDKs using Dependency Downloader utility shipped with GG for DAA. Dependency downloader is a set of shell scripts that downloads dependency jar files from Maven and other repositories.

GG for DAA uses a 3-step process to ingest parquet into Azure Storage containers:

Generating local files from trail files
Converting local files to Parquet format
Loading files into into Azure Storage containers

For generating local parquet files with GG for DAA, replicat uses File Writer Handler and Parquet Handler. To load the parquet files into Azure Storage, GG for DAA uses Azure BLOB Handler in conjunction with File Writer and Parquet Event Handler.

GG for DAA uses 3 different set of client libraries to create parquet files and loading into Azure Storage:

In your GG for DAA VM, go to Dependency Downloader utility. It is located at GG_HOME/opt/DependencyDownloader/
Run parquet.sh and hadoop.sh with and azure_blob_storage.sh with the required versions.

Figure 8-51 II. Run parquet.sh and hadoop.sh with and azure_blob_storage.sh with the required versions.
3 directories are created in GG_HOME/opt/DependencyDownloader/dependencies. Note the directories. For example:
- /u01/app/ogg/opt/DependencyDownloader/dependencies/azure_blob_storage_12.21.2
- /u01/app/ogg/opt/DependencyDownloader/dependencies/hadoop_3.4.0
- /u01/app/ogg/opt/DependencyDownloader/dependencies/parquet_1.12.3

Parent topic: Realtime Parquet Ingestion into Azure Data Lake Storage with Oracle GoldenGate for Distributed Applications and Analytics

8.8.3 Create a Replicat in Oracle GoldenGate for Distributed Applications and Analytics

To create a replicat in Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA):

In the GG for DAA UI, in the Administration Service tab, click the + sign to add a replicat.

Figure 8-52 Click the Administration Service tab
Select the Classic Replicat Replicat Type and click Next. There are two different Replicat types available: Classic and Coordinated. Classic Replicat is a single threaded process whereas Coordinated Replicat is a multithreaded one that applies transactions in parallel.

Figure 8-53 Add Replicat
Enter the Replicat information, and click Next:
1. Replicat Trail: Name of the required trail file. For sample trail, provide tr.
2. Subdirectory: Enter GG_HOME/opt/AdapterExamples/trail/ if using the sample trail.
3. Target: Azure Data Lake Storage
Figure 8-54 Replicat Options
Leave Managed Options as is and click Next.

Figure 8-55 Managed Options
Enter Parameter File details and click Next. In the Parameter File, you can either specify source to target mapping or leave it as-is with a wildcard selection. If Coordinated Replicat is selected as the Replicat Type, then an additional parameter needs to be provided: TARGETDB LIBFILE libggjava.so SET property=<ggbd-deployment_home>/etc/conf/ogg/your_replicat_name.properties

Figure 8-56 Parameter File

In Properties File, remove all the pre-configured properties; but not the first row marked with the replicat name (

# Properties file for
                        Replicat <replicat_name>

). Copy and paste below property list into properties file, update the properties marked as #TODO and click Create and Run.

#The File Writer Handler – no need to change
gg.handlerlist=filewriter
gg.handler.filewriter.type=filewriter
gg.handler.filewriter.mode=op
gg.handler.filewriter.pathMappingTemplate=./dirout
gg.handler.filewriter.stateFileDirectory=./dirsta
gg.handler.filewriter.fileRollInterval=7m
gg.handler.filewriter.inactivityRollInterval=5s
gg.handler.filewriter.fileWriteActiveSuffix=.tmp
gg.handler.filewriter.finalizeAction=delete


### Avro OCF – no need to change 
gg.handler.filewriter.format=avro_row_ocf
gg.handler.filewriter.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.avro
gg.handler.filewriter.format.pkUpdateHandling=delete-insert
gg.handler.filewriter.format.metaColumnsTemplate=${optype},${position}
gg.handler.filewriter.format.iso8601Format=false
gg.handler.filewriter.partitionByTable=true
gg.handler.filewriter.rollOnShutdown=true


#The Parquet Event Handler – no need to change
gg.handler.filewriter.eventHandler=parquet
gg.eventhandler.parquet.type=parquet
gg.eventhandler.parquet.pathMappingTemplate=./dirparquet
gg.eventhandler.parquet.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.parquet
gg.eventhandler.parquet.writeToHDFS=false
gg.eventhandler.parquet.finalizeAction=delete

#TODO Select ABS Event Handler  – no need to change
gg.eventhandler.parquet.eventHandler=abs

#TODO Set ABS Event Handler - please update as needed
gg.eventhandler.abs.type=abs
gg.eventhandler.abs.bucketMappingTemplate= <abs-container-name>
gg.eventhandler.abs.pathMappingTemplate=ogg/data/${fullyQualifiedTableName}
gg.eventhandler.abs.accountName= <storage-account-name>
#TODO: Edit the Azure storage account key if access key is used
#gg.eventhandler.abs.accountKey=<storage-account-key>
#TODO: Edit the Azure shared access signature (SAS) to if SAS is used.
#gg.eventhandler.abs.sasToken=<sas-token>
#TODO: Edit the the tenant ID, Client ID and Secret of the application if LDAP is used.
#gg.eventhandler.abs.tenantId=<azure-tenant-id>
#gg.eventhandler.abs.clientId=<azure-client-id>
#gg.eventhandler.abs.clientSecret=<azure-client-secret>

#TODO Set the classpath to the paths you noted in step1
gg.classpath= path_to/ gcs_12.29.1/: path_to /hadoop_3.4.0/:path_to/parquet_1.12.3/* jvm.bootoptions=-Xmx512m -Xms32m

If replicat starts successfully, then it is in running state. You can go to Replicats/Statistics to see the replication statistics.

Figure 8-57 Replicats Statistics
Go to Azure Storage console and check the files.

Figure 8-58 Azure Storage console

Note:

If target Azure container does not exist, it will be auto crated by GG for DAA. You can use Template Keywords to dynamically assign the container names.
ABS Event Handler can be configured for proxy server. For more information, see Azure Blob Storage
You can use different properties to control the behaviour of file writing. You can set file sizes, inactivity periods and more. You can get more details in the File Writer blog post.

Parent topic: Realtime Parquet Ingestion into Azure Data Lake Storage with Oracle GoldenGate for Distributed Applications and Analytics