8.3 Realtime Parquet Ingestion into Google Cloud Storage with GoldenGate for Distributed Applications and Analytics 23.8 and later

This Quickstart covers a step-by-step process showing how to ingest parquet files into Google Cloud Storage buckets in real-time with GoldenGate for Distributed Applications and Analytics (GG for DAA).

This Quickstart is valid for GG for DAA 23.7 and later versions.

Google Cloud Storage (GCS) is a service for storing objects in Google Cloud Platform.

GG for DAA GCS handler works in conjunction with File Writer Handler and Parquet Handler (if parquet is required). File Writer Handler produces files locally, optionally Parquet Handler converts to parquet format and GCS Handler loads into GCS buckets.

8.3.1 Prerequisites

To successfully complete this quicktart, you must have the following:

  • Google Cloud Storage Bucket
  • Google Service Account Key with Bucket and Object Permissions. For more information, see Replicate Data
  • Public access to your bucket (GG for DAA supports private bucket access, you can refer to GG for DAA document for details)

In this Quickstart, a sample trail file, such as tr) which is shipped with GG for DAA is used. If you want to continue with sample trail file, then it is located at GG_HOME/opt/AdapterExamples/trail/ in your GG for DAA instance.

8.3.2 Install Required Dependency Files

GG for DAA uses client libraries in the replication process and these libraries need to be downloaded before setting up the replication process. You can use dependency downloader to download the client libraries. Dependency downloader is a set of shell scripts that downloads dependency jar files from Maven and other repositories.

GG for DAA uses a 3-step process to ingest parquet into GCS buckets:

  • Generating local files from trail files
  • Converting local files to Parquet format
  • Loading files into GCS

For generating local parquet files with GG for DAA, replicat uses File Writer Handler and Parquet Handler. To load the parquet files into GCS, GG for DAA uses Google Cloud Storage Handler in conjunction with File Writer and Parquet Event Handler.

  1. In your GG for DAA VM, go to dependency downloader utility. It is located at GG_HOME/opt/DependencyDownloader/.
  2. Run parquet.sh, hadoop.sh, and gcs.sh with the required versions. You can check the version and reported vulnerabilities in Maven Central.

    Figure 8-11 Install Required Dependency Files

    Install Required Dependency Files
  3. 3 directories are created in GG_HOME/opt/DependencyDownloader/dependencies. Make a note of the directory. For example:
    • /u01/app/ogg/opt/DependencyDownloader/dependencies/gcs_12.29.1
    • /u01/app/ogg/opt/DependencyDownloader/dependencies/hadoop_3.3.0
    • /u01/app/ogg/opt/DependencyDownloader/dependencies/parquet_1.12.3

8.3.3 Create a Replicat in Oracle GoldenGate for Distributed Applications and Analytics

  1. Go to Administration Service and click + sign to add a replicat.

    Figure 8-12 Administration Service

    Go to Administration Service and click + to add a replicat.
  2. Select the Replicat Type, enter the Process Name and click Next. There are two different Replicat types available: Classic and Coordinated. Classic Replicat is a single threaded process whereas Coordinated Replicat is a multithreaded one that applies transactions in parallel.

    Figure 8-13 Replicat Type

    Select the Replicat Type, enter the Process Name and click Next.
  3. Enter Replicat Options and click Next.
    • Replicat Trail: Name of the required trail file. For sample trail, enter tr.
    • Subdirectory: Enter as GG_HOME/opt/AdapterExamples/trail/ if using the sample trail.
    • Target: Google Cloud Storage
    • Format: Select the file format

    Figure 8-14 Replicat Options

    Enter Replicat Options.
  4. Leave Managed Options as is and click Next.

    Figure 8-15 Managed Options

    Leave Managed Options as is and click Next.
  5. Provide Parameter File details and click Next. In the Parameter File, you can either specify source to target mapping or leave it as-is with a wildcard selection. If Coordinated Replicat is selected as the Replicat Type, then you need to provide an additional parameter as follows:

    TARGETDB LIBFILE libggjava.so SET property=<ggbd-deployment_home>/etc/conf/ogg/your_replicat_name.properties

    Figure 8-16 Parameter File

    Enter Parameter File Details.
  6. In Properties File, update the properties marked as #TODO and click Create and Run. For example:
    # Properties file for Replicat 
    gg.target=gcs
    
    #TODO: format can be 'parquet' or 'orc' or one of the pluggable formatter types. Default is 'parquet'.
    gg.format=parquet
    
    #The GCS Event handler
    gg.eventhandler.gcs.pathMappingTemplate=${fullyQualifiedTableName}
    #TODO: Edit the GCS bucket name
    gg.eventhandler.gcs.bucketMappingTemplate=<gcs-bucket-name>
    #TODO: Edit the GCS credentialsFile
    gg.eventhandler.gcs.credentialsFile=/path/to/gcp/credentialsFile
    gg.eventhandler.gcs.finalizeAction=none
    
    #TODO: Edit to include the GCS Java SDK.
    gg.classpath=/path/to/gcs-deps/:/path/to/hadoop-deps/:/path/to/parquet_deps/*
    
  7. If replicat starts successfully, then it is in running state. You can go to replicat/statistics to see the replication statistics.

    Figure 8-17 Replication Statistics

    Check Replication Statistics.
  8. You can go to GCP Cloud Storage bucket and check the files.

    Figure 8-18 Google Cloud Storage Bucket

    Go to GCP Cloud Storage bucket and check the files.

Note:

  • If target GCS bucket does not exist, it will be auto created by GG for DAA. You can use Template Keywords to dynamically assign the container names.
  • GCS Event Handler can be configured for proxy server. For more information, see Google Cloud Storage Event Handler.