5 Using the File Writer Handler
The File Writer Handler and associated event handlers enables you to write data files to a local system.
This chapter describes how to use the File Writer Handler.
- Overview
You can use the File Writer Handler and the event handlers to transform data.
5.1 Overview
You can use the File Writer Handler and the event handlers to transform data.
The File Writer Handler supports generating data files in delimited text, XML, JSON, Avro, and Avro Object Container File formats. It is intended to fulfill an extraction, load, and transform use case. Data files are staged on your local file system. Then when writing to a data file is complete, you can use a third party application to read the file to perform additional processing.
The File Writer Handler also supports the event handler framework. The event handler framework allows data files generated by the File Writer Handler to be transformed into other formats, such as Optimized Row Columnar (ORC) or Parquet. Data files can be loaded into third party applications, such as HDFS or Amazon S3. The event handler framework is extensible allowing more event handlers performing different transformations or loading to different targets to be developed. Additionally, you can develop a custom event handler for your big data environment.
Oracle GoldenGate for Big Data provides two handlers to write to HDFS. Oracle recommends that you use the HDFS Handler or the File Writer Handler in the following situations:
- The HDFS Event Handler is designed to stream data directly to HDFS.
-
No post write processing is occurring in HDFS. The HDFS Event Handler does not change the contents of the file, it simply uploads the existing file to HDFS.
Analytical tools are accessing data written to HDFS in real time including data in files that are open and actively being written to.
- The File Writer Handler is designed to stage data to the local file system and then to load completed data files to HDFS when writing for a file is complete.
-
Analytic tools are not accessing data written to HDFS in real time.
Post write processing is occurring in HDFS to transform, reformat, merge, and move the data to a final location.
You want to write data files to HDFS in ORC or Parquet format.
Parent topic: Using the File Writer Handler
5.1.1 Detailing the Functionality
- Using File Roll Events
- Automatic Directory Creation
- About the Active Write Suffix
- Maintenance of State
- Using Templated Strings
Parent topic: Overview
5.1.1.1 Using File Roll Events
A file roll event occurs when writing to a specific data file is completed. No more data is written to that specific data file.
Finalize Action Operation
You can configure the finalize action operation to clean up a specific data file after a successful file roll action using the finalizeaction
parameter with the following options:
-
none
-
Leave the data file in place (removing any active write suffix, see About the Active Write Suffix).
-
delete
-
Delete the data file (such as, if the data file has been converted to another format or loaded to a third party application).
-
move
-
Maintain the file name (removing any active write suffix), but move the file to the directory resolved using the
movePathMappingTemplate
property. -
rename
-
Maintain the current directory, but rename the data file using the
fileRenameMappingTemplate
property. -
move-rename
-
Rename the file using the file name generated by the
fileRenameMappingTemplate
property and move the file the file to the directory resolved using themovePathMappingTemplate
property.
Typically, event handlers offer a subset of these same actions.
A sample Configuration of a finalize action operation:
gg.handlerlist=filewriter
#The File Writer Handler
gg.handler.filewriter.type=filewriter
gg.handler.filewriter.mode=op
gg.handler.filewriter.pathMappingTemplate=./dirout/evActParamS3R
gg.handler.filewriter.stateFileDirectory=./dirsta
gg.handler.filewriter.fileNameMappingTemplate=${fullyQualifiedTableName}_${currentTimestamp}.txt
gg.handler.filewriter.fileRollInterval=7m
gg.handler.filewriter.finalizeAction=delete
gg.handler.filewriter.inactivityRollInterval=7m
File Rolling Actions
Any of the following actions trigger a file roll event.
-
A metadata change event.
-
The maximum configured file size is exceeded
-
The file roll interval is exceeded (the current time minus the time of first file write is greater than the file roll interval).
-
The inactivity roll interval is exceeded (the current time minus the time of last file write is greater than the file roll interval).
-
The File Writer Handler is configured to roll on shutdown and the Replicat process is stopped.
Operation Sequence
The file roll event triggers a sequence of operations to occur. It is important that you understand the order of the operations that occur when an individual data file is rolled:
-
The active data file is switched to inactive, the data file is flushed, and state data file is flushed.
-
The configured event handlers are called in the sequence that you specified.
-
The finalize action is executed on all the event handlers in the reverse order in which you configured them. Any finalize action that you configured is executed.
-
The finalize action is executed on the data file and the state file. If all actions are successful, the state file is removed. Any finalize action that you configured is executed.
For example, if you configured the File Writer Handler with the Parquet Event Handler and then the S3 Event Handler, the order for a roll event is:
-
The active data file is switched to inactive, the data file is flushed, and state data file is flushed.
-
The Parquet Event Handler is called to generate a Parquet file from the source data file.
-
The S3 Event Handler is called to load the generated Parquet file to S3.
-
The finalize action is executed on the S3 Parquet Event Handler. Any finalize action that you configured is executed.
-
The finalize action is executed on the Parquet Event Handler. Any finalize action that you configured is executed.
-
The finalize action is executed for the data file in the File Writer Handler
Parent topic: Detailing the Functionality
5.1.1.2 Automatic Directory Creation
Parent topic: Detailing the Functionality
5.1.1.3 About the Active Write Suffix
A common use case is using a third party application to monitor the write directory to read data files. Third party application can only read a data file when writing to that file has completed. These applications need a way to determine if writing to a data file is active or complete. The File Writer Handler allows you to configure an active write suffix using this property:
gg.handler.name.fileWriteActiveSuffix=.tmp
The value of this property is appended to the generated file name. When writing to the file is complete, the data file is renamed and the active write suffix is removed from the file name. You can set your third party application to monitor your data file names to identify when the active write suffix is removed.
Parent topic: Detailing the Functionality
5.1.1.4 Maintenance of State
Previously, all Oracle GoldenGate for Big Data Handlers have been stateless. These stateless handlers only maintain state in the context of the Replicat process that it was running. If the Replicat process was stopped and restarted, then all the state was lost. With a Replicat restart, the handler began writing with no contextual knowledge of the previous run.
The File Writer Handler provides the ability of maintaining state between invocations of the Replicat process. By default with a restart:
-
the state saved files are read,
-
the state is restored,
-
and appending active data files continues where the previous run stopped.
You can change this default action to require all files be rolled on shutdown by setting this property:
gg.handler.name.rollOnShutdown=true
Parent topic: Detailing the Functionality
5.1.1.5 Using Templated Strings
Templated strings can contain a combination of string constants and keywords that are dynamically resolved at runtime. The ORC Event Handler makes extensive use of templated strings to generate the ORC directory names, data file names, and ORC bucket names. These strings give you the flexibility to select where to write data files and the names of those data files. You should exercise caution when choosing file and directory names to avoid file naming collisions that can result in an abend.
Supported Templated Strings
Keyword | Description |
---|---|
${fullyQualifiedTableName} |
The fully qualified source table name delimited by a period ( |
${catalogName} |
The individual source catalog name. For example, |
${schemaName} |
The individual source schema name. For example, |
${tableName} |
The individual source table name. For example, |
${groupName} |
The name of the Replicat process (with the thread number appended if you’re using coordinated apply). |
${emptyString} |
Evaluates to an empty string. For example, |
${operationCount} |
The total count of operations in the data file. It must be used either on rename or by the event handlers or it will be zero ( |
${insertCount} |
The total count of insert operations in the data file. It must be used either on rename or by the event handlers or it will be zero ( |
${updateCount} |
The total count of update operations in the data file. It must be used either on rename or by the event handlers or it will be zero ( |
${deleteCount} |
The total count of delete operations in the data file. It must be used either on rename or by the event handlers or it will be zero ( |
${truncateCount} |
The total count of truncate operations in the data file. It must be used either on rename or by the event handlers or it will be zero ( |
${currentTimestamp} |
The current timestamp. The default output format for the date time is
This format uses the syntax defined in the Java |
${toUpperCase[]} |
Converts the contents inside the square brackets to uppercase. For example, |
${toLowerCase[]} |
Converts the contents inside the square brackets to lowercase. For example, |
Configuration of template strings can use a mix of keywords and static strings to assemble path and data file names at runtime.
Requirements
The directory and file names generated using the templates must be legal on the system being written to. File names must be unique to avoid a file name collision. You can avoid a collision by adding a current timestamp using the ${currentTimestamp}
keyword. If you are using coordinated apply, then adding ${groupName}
into the data file name is recommended.
Parent topic: Detailing the Functionality
5.1.2 Configuring the File Writer Handler
Lists the configurable values for the File Writer Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file)
To enable the selection of the File Writer Handler, you must first configure the
handler type by specifying gg.handler.name.type=filewriter
and the other File Writer properties as follows:
Table 5-1 File Writer Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the File Writer Handler for use. |
|
Optional |
Default unit of measure is bytes. You can stipulate |
1g |
Sets the maximum file size of files generated by the File Writer Handler. When the file size is exceeded, a roll event is triggered. |
|
Optional |
The default unit of measure is milliseconds. You can stipulate |
File rolling on time is off. |
The timer starts when a file is created. If the file is still open when the interval elapses then the a file roll event will be triggered. |
|
Optional |
The default unit of measure is milliseconds. You can stipulate |
File inactivity rolling is turned off. |
The timer starts from the latest write to a generated file. New writes to a generated file restart the counter. If the file is still open when the timer elapses a roll event is triggered.. |
|
Required |
A string with resolvable keywords and constants used to dynamically generate File Writer Handler data file names at runtime. |
None |
Use keywords interlaced with constants to dynamically generate a unique path names
at runtime. Typically, path names follow the format,
|
|
Required |
A string with resolvable keywords and constants used to dynamically generate the directory to which a file is written. |
None |
Use keywords interlaced with constants to dynamically generate a unique path names
at runtime. Typically, path names follow the format,
|
|
Optional |
A string. |
None |
An optional suffix that is appended to files generated by the File Writer Handler to indicate that writing to the file is active. At the finalize action the suffix is removed. |
|
Required |
A directory on the local machine to store the state files of the File Writer Handler. |
None |
Sets the directory on the local machine to store the state files of the File Writer Handler. The group name is appended to the directory to ensure that the functionality works when operating in a coordinated apply environment. |
|
Optional |
|
|
Set to |
|
Optional |
|
|
Indicates what the File Writer Handler should do at the finalize action.
|
|
Optional |
|
|
Set to |
|
Optional |
|
No event handler configured. |
A unique string identifier cross referencing an event handler. The event handler will be invoked on the file roll event. Event handlers can do thing file roll event actions like loading files to S3, converting to Parquet or ORC format, or loading files to HDFS. |
|
Required if |
A string with resolvable keywords and constants used to dynamically generate File Writer Handler data file names for file renaming in the finalize action. |
None. |
Use keywords interlaced with constants to dynamically generate unique file names at
runtime. Typically, file names follow the format,
|
|
Required if |
A string with resolvable keywords and constants used to dynamically generate the directory to which a file is written. |
None |
Use keywords interlaced with constants to dynamically generate a unique path names at runtime. Typically, path names typically follow the format, |
|
Required |
|
|
Selects the formatter for the HDFS Handler for how output data will be formatted
If you want to use the Parquet or ORC Event Handlers, then the selected format must be |
|
Optional |
An even number of hex characters. |
None |
Enter an even number of hex characters where every two characters correspond to a single byte in the byte order mark (BOM). For example, the string |
|
Optional |
|
|
Set to |
|
Optional |
Any string |
new line ( |
Allows you to control the delimiter separating file names in the control file. You can use |
|
Optional |
A path to a directory to hold the control file. |
A period ( |
Set to specify where you want to write the control file. |
|
Optional |
|
|
Set to |
|
Optional |
One or more times to trigger a roll action of all open files. |
None |
Configure one or more trigger times in the following format: HH:MM,HH:MM,HH:MM Entries are based on a 24 hour clock. For example, an entry to configure rolled actions at three discrete times of day is: gg.handler.fw.atTime=03:30,21:00,23:51 |
|
Optional |
no compression. |
|
Enables the corresponding compression algorithm for generated Avro
OCF files. The corresponding compression library must be added to
the |
|
Optional |
|
Positive Integer >= 512 |
Sets the size the |
Parent topic: Overview
5.1.3 Review a Sample Configuration
This File Writer Handler configuration example is using the Parquet Event Handler to convert data files to Parquet, and then for the S3 Event Handler to load Parquet files into S3:
gg.handlerlist=filewriter #The handler properties gg.handler.name.type=filewriter gg.handler.name.mode=op gg.handler.name.pathMappingTemplate=./dirout gg.handler.name.stateFileDirectory=./dirsta gg.handler.name.fileNameMappingTemplate=${fullyQualifiedTableName}_${currentTimestamp}.txt gg.handler.name.fileRollInterval=7m gg.handler.name.finalizeAction=delete gg.handler.name.inactivityRollInterval=7m gg.handler.name.format=avro_row_ocf gg.handler.name.includetokens=true gg.handler.name.partitionByTable=true gg.handler.name.eventHandler=parquet gg.handler.name.rollOnShutdown=true gg.eventhandler.parquet.type=parquet gg.eventhandler.parquet.pathMappingTemplate=./dirparquet gg.eventhandler.parquet.writeToHDFS=false gg.eventhandler.parquet.finalizeAction=delete gg.eventhandler.parquet.eventHandler=s3 gg.eventhandler.parquet.fileNameMappingTemplate=${tableName}_${currentTimestamp}.parquet gg.handler.filewriter.eventHandler=s3 gg.eventhandler.s3.type=s3 gg.eventhandler.s3.region=us-west-2 gg.eventhandler.s3.proxyServer=www-proxy.us.oracle.com gg.eventhandler.s3.proxyPort=80 gg.eventhandler.s3.bucketMappingTemplate=tomsfunbucket gg.eventhandler.s3.pathMappingTemplate=thepath gg.eventhandler.s3.finalizeAction=none goldengate.userexit.writers=javawriter
Parent topic: Overview