7 Using the Optimized Row Columnar Event Handler

The Optimized Row Columnar (ORC) Event Handler to generate data files in ORC format.

This topic describes how to use the ORC Event Handler.

7.1 Overview

ORC is a row columnar format that can substantially improve data retrieval times and the performance of Big Data analytics. You can use the ORC Event Handler to write ORC files to either a local file system or directly to HDFS. For information, see https://orc.apache.org/.

7.2 Detailing the Functionality

7.2.1 About the Upstream Data Format

The ORC Event Handler can only convert Avro Object Container File (OCF) generated by the File Writer Handler. The ORC Event Handler cannot convert other formats to ORC data files. The format of the File Writer Handler must be avro_row_ocf or avro_op_ocf, see Using the File Writer Handler.

7.2.2 About the Library Dependencies

Generating ORC files requires both the Apache ORC libraries and the HDFS client libraries, see Optimized Row Columnar Event Handler Client Dependencies and HDFS Handler Client Dependencies.

Oracle GoldenGate for Big Data does not include the Apache ORC libraries nor does it include the HDFS client libraries. You must configure the gg.classpath variable to include the dependent libraries.

7.2.3 Requirements

The ORC Event Handler can write ORC files directly to HDFS. You must set the writeToHDFS property to true:

gg.eventhandler.orc.writeToHDFS=true

Ensure that the directory containing the HDFS core-site.xml file is in gg.classpath. This is so the core-site.xml file can be read at runtime and the connectivity information to HDFS can be resolved. For example:

gg.classpath=/{HDFS_install_directory}/etc/hadoop

If you enable Kerberos authentication is on the HDFS cluster, you have to configure the Kerberos principal and the location of the keytab file so that the password can be resolved at runtime:

gg.eventHandler.name.kerberosPrincipal=principal
gg.eventHandler.name.kerberosKeytabFile=path_to_the_keytab_file

7.2.4 Using Templated Strings

Templated strings can contain a combination of string constants and keywords that are dynamically resolved at runtime. The ORC Event Handler makes extensive use of templated strings to generate the ORC directory names, data file names, and ORC bucket names. This gives you the flexibility to select where to write data files and the names of those data files.

Supported Templated Strings

Keyword Description
${fullyQualifiedTableName}

The fully qualified source table name delimited by a period (.). For example, MYCATALOG.MYSCHEMA.MYTABLE.

${catalogName}

The individual source catalog name. For example, MYCATALOG.

${schemaName}

The individual source schema name. For example, MYSCHEMA.

${tableName}

The individual source table name. For example, MYTABLE.

${groupName}

The name of the Replicat process (with the thread number appended if you’re using coordinated apply).

${emptyString}

Evaluates to an empty string. For example,“”

${operationCount}

The total count of operations in the data file. It must be used either on rename or by the event handlers or it will be zero (0) because nothing is written yet. For example, “1024”.

${insertCount}

The total count of insert operations in the data file. It must be used either on rename or by the event handlers or it will be zero (0) because nothing is written yet. For example, “125”.

${updateCount}

The total count of update operations in the data file. It must be used either on rename or by the event handlers or it will be zero (0) because nothing is written yet. For example, “265”.

${deleteCount}

The total count of delete operations in the data file. It must be used either on rename or by the event handlers or it will be zero (0) because nothing is written yet. For example, “11”.

${truncateCount}

The total count of truncate operations in the data file. It must be used either on rename or by the event handlers or it will be zero (0) because nothing is written yet. For example, “5”.

${currentTimestamp}

The current timestamp. The default output format for the date time is yyyy-MM-dd_HH-mm-ss.SSS. For example, 2017-07-05_04-31-23.123. Alternatively, you can customize the format of the current timestamp by inserting the format inside square brackets like:

${currentTimestamp[MM-dd_HH]}

This format uses the syntax defined in the Java SimpleDateFormat class, see https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.

${toUpperCase[]}

Converts the contents inside the square brackets to uppercase. For example, ${toUpperCase[${fullyQualifiedTableName}]}.

${toLowerCase[]}

Converts the contents inside the square brackets to lowercase. For example, ${toLowerCase[${fullyQualifiedTableName}]}.

Configuration of template strings can use a mix of keywords and static strings to assemble path and data file names at runtime.

Path Configuration Example
/usr/local/${fullyQualifiedTableName}
Data File Configuration Example
${fullyQualifiedTableName}_${currentTimestamp}_${groupName}.txt