10 Using the Parquet Event Handler

Learn how to use the Parquet Event Handler to load files generated by the File Writer Handler into HDFS.

See Using the File Writer Handler.

10.1 Overview

The Parquet Event Handler enables you to generate data files in Parquet format. Parquet files can be written to either the local file system or directly to HDFS. Parquet is a columnar data format that can substantially improve data retrieval times and improve the performance of Big Data analytics, see https://parquet.apache.org/.

10.2 Detailing the Functionality

10.2.1 Configuring the Parquet Event Handler to Write to HDFS

The Apache Parquet framework supports writing directly to HDFS. The Parquet Event Handler can write Parquet files directly to HDFS. These additional configuration steps are required:

The Parquet Event Handler dependencies and considerations are the same as the HDFS Handler, see HDFS Additional Considerations.

Set the writeToHDFS property to true:

gg.eventhandler.parquet.writeToHDFS=true

Ensure that gg.classpath includes the HDFS client libraries.

Ensure that the directory containing the HDFS core-site.xml file is in gg.classpath. This is so the core-site.xml file can be read at runtime and the connectivity information to HDFS can be resolved. For example:

gg.classpath=/{HDFS_install_directory}/etc/hadoop

If Kerberos authentication is enabled on the HDFS cluster, you have to configure the Kerberos principal and the location of the keytab file so that the password can be resolved at runtime:

gg.eventHandler.name.kerberosPrincipal=principal
gg.eventHandler.name.kerberosKeytabFile=path_to_the_keytab_file

10.2.2 About the Upstream Data Format

The Parquet Event Handler can only convert Avro Object Container File (OCF) generated by the File Writer Handler. The Parquet Event Handler cannot convert other formats to Parquet data files. The format of the File Writer Handler must be avro_row_ocf or avro_op_ocf, see Using the File Writer Handler.

10.2.3 Using Templated Strings

Templated strings can contain a combination of string constants and keywords that are dynamically resolved at runtime. The Parquet Event Handler makes extensive use of templated strings to generate the HDFS directory names, data file names, and HDFS bucket names. This gives you the flexibility to select where to write data files and the names of those data files.

Supported Templated Strings

Keyword Description
${fullyQualifiedTableName}

The fully qualified source table name delimited by a period (.). For example, MYCATALOG.MYSCHEMA.MYTABLE.

${catalogName}

The individual source catalog name. For example, MYCATALOG.

${schemaName}

The individual source schema name. For example, MYSCHEMA.

${tableName}

The individual source table name. For example, MYTABLE.

${groupName}

The name of the Replicat process (with the thread number appended if you’re using coordinated apply).

${emptyString}

Evaluates to an empty string. For example,“”

${operationCount}

The total count of operations in the data file. It must be used either on rename or by the event handlers or it will be zero (0) because nothing is written yet. For example, “1024”.

${insertCount}

The total count of insert operations in the data file. It must be used either on rename or by the event handlers or it will be zero (0) because nothing is written yet. For example, “125”.

${updateCount}

The total count of update operations in the data file. It must be used either on rename or by the event handlers or it will be zero (0) because nothing is written yet. For example, “265”.

${deleteCount}

The total count of delete operations in the data file. It must be used either on rename or by the event handlers or it will be zero (0) because nothing is written yet. For example, “11”.

${truncateCount}

The total count of truncate operations in the data file. It must be used either on rename or by the event handlers or it will be zero (0) because nothing is written yet. For example, “5”.

${currentTimestamp}

The current timestamp. The default output format for the date time is yyyy-MM-dd_HH-mm-ss.SSS. For example, 2017-07-05_04-31-23.123. Alternatively, you can customize the format of the current timestamp by inserting the format inside square brackets like:

${currentTimestamp[MM-dd_HH]}

This format uses the syntax defined in the Java SimpleDateFormat class, see https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.

${toUpperCase[]}

Converts the contents inside the square brackets to uppercase. For example, ${toUpperCase[${fullyQualifiedTableName}]}.

${toLowerCase[]}

Converts the contents inside the square brackets to lowercase. For example, ${toLowerCase[${fullyQualifiedTableName}]}.

Configuration of template strings can use a mix of keywords and static strings to assemble path and data file names at runtime.

Path Configuration Example
/usr/local/${fullyQualifiedTableName}
Data File Configuration Example
${fullyQualifiedTableName}_${currentTimestamp}_${groupName}.txt

10.3 Configuring the Parquet Event Handler

You configure the Parquet Event Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

The Parquet Event Handler works only in conjunction with the File Writer Handler.

To enable the selection of the Parquet Event Handler, you must first configure the handler type by specifying gg.eventhandler.name.type=parquet and the other Parquet Event properties as follows:

Table 10-1 Parquet Event Handler Configuration Properties

Properties Required/ Optional Legal Values Default Explanation

gg.eventhandler.name.type

Required

parquet

None

Selects the Parquet Event Handler for use.

gg.eventhandler.name.writeToHDFS

Optional

true | false

false

Set to false to write to the local file system. Set to true to write directly to HDFS.

gg.eventhandler.name.pathMappingTemplate

Required

A string with resolvable keywords and constants used to dynamically generate the path to write generated Parquet files.

None

Use keywords interlaced with constants to dynamically generate a unique path names at runtime. Typically, path names follow the format, /ogg/data/${groupName}/${fullyQualifiedTableName}.

gg.eventhandler.name.fileNameMappingTemplate

Optional

A string with resolvable keywords and constants used to dynamically generate the Parquet file name at runtime

None

Sets the Parquet file name. If not set, the upstream file name is used.

gg.eventhandler.name.compressionCodec

Optional

GZIP | LZO | SNAPPY | UNCOMPRESSED

UNCOMPRESSED

Sets the compression codec of the generated Parquet file.

gg.eventhandler.name.finalizeAction

Optional

none | delete

none

Indicates what the Parquet Event Handler should do at the finalize action.

none

Leave the data file in place.

delete

Delete the data file (such as, if the data file has been converted to another format or loaded to a third party application).

gg.eventhandler.name.dictionaryEncoding

Optional

true | false

The Parquet default.

Set to true to enable Parquet dictionary encoding.

gg.eventhandler.name.validation

Optional

true | false

The Parquet default.

Set to true to enable Parquet validation.

gg.eventhandler.name.dictionaryPageSize

Optional

Integer

The Parquet default.

Sets the Parquet dictionary page size.

gg.eventhandler.name.maxPaddingSize

Optional

Integer

The Parquet default.

Sets the Parquet padding size.

gg.eventhandler.name.pageSize

Optional

Integer

The Parquet default.

Sets the Parquet page size.

gg.eventhandler.name.rowGroupSize

Optional

Integer

The Parquet default.

Sets the Parquet row group size.

gg.eventhandler.name.kerberosPrincipal

Optional

The Kerberos principal name.

None

Set to the Kerberos principal when writing directly to HDFS and Kerberos authentication is enabled.

gg.eventhandler.name.kerberosKeytabFile

Optional

The path to the Keberos keytab file.

The Parquet default.

Set to the path to the Kerberos keytab file with writing directly to HDFS and Kerberos authentication is enabled.

gg.eventhandler.name.eventHandler

Optional

A unique string identifier cross referencing a child event handler.

No event handler configured.

The event handler that is invoked on the file roll event. Event handlers can do file roll event actions like loading files to S3, converting to Parquet or ORC format, or loading files to HDFS.

gg.eventhandler.name.writerVersion Optional v1|v2 The Parquet library default which is up through Parquet version 1.11.0 is v1. Allows the ability to set the Parquet writer version.