This chapter explains the HDFS functionality, and includes examples that you can use to understand this functionality. The Oracle GoldenGate for Big Data Handler for HDFS is designed to stream change capture data into the Hadoop Distributed File System (HDFS).
This chapter includes the following sections:
Hadoop Distributed File System (HDFS) is the primary application for Big Data. Hadoop is typically installed on multiple machines which work together as a Hadoop cluster. Hadoop allows users to store very large amounts of data in the cluster that is horizontally scaled across the machines in the cluster. You can then perform analytics on that data using a variety of Big Data applications.
The Oracle GoldenGate for Big Data 12.2.0.1 release does not include a Hive Handler as was included in the Oracle GoldenGate for Big Data 12.1.2.1.x releases. The 12.1.2.1.x Hive Handler actually provided no direct integration with Hive. The functionality of the Hive Handler was to load operation data from the source trail file into HDFS, partitioned by table, in a Hive friendly delimited text format. The 12.2.0.1 HDFS Handler provides all of the functionality of the previous 12.1.2.1.x Hive Handler.
Hive integration to create tables and update table definitions in the case of DDL events is possible. This functionality is limited to only data formatted as Avro Object Container File format. For more information, see Writing in HDFS in Avro Object Container File Format and HDFS Handler Configuration.
The HDFS SequenceFile is a flat file consisting of binary key and value pairs. You can enable writing data in SequenceFile format by setting the gg.handler.name.format
property to sequencefile
. The key
part of the record is set to null and the actual data is set in the value
part.
For information about Hadoop SequenceFile, see https://wiki.apache.org/hadoop/SequenceFile.
DDL to create Hive tables should include STORED as sequencefile
for Hive to consume Sequence Files. Following is a sample create table script:
CREATE EXTERNAL TABLE table_name (
col1 string,
...
...
col2 string)
ROW FORMAT DELIMITED
STORED as sequencefile
LOCATION '/path/to/hdfs/file';
Note:
If files are intended to be consumed by Hive, then the gg.handler.name.partitionByTable
property should be set to true
.
The data written in the value
part of each record and is in delimited text format. All of the options described in the Delimited Text Formatter section are applicable to HDFS SequenceFile when writing data to it.
For example:
gg.handler.name.format=sequencefile gg.handler.name.format.includeColumnNames=true gg.handler.name.format.includeOpType=true gg.handler.name.format.includeCurrentTimestamp=true gg.handler.name.format.updateOpKey=U
In order to successfully run the HDFS Handler, a Hadoop single instance or Hadoop cluster must be installed, running, and network accessible from the machine running the HDFS Handler. Apache Hadoop is open source and available for download at http://hadoop.apache.org/
. Follow the Getting Started links for information on how to install a single-node cluster (also called pseudo-distributed operation mode) or a clustered setup (also called fully-distributed operation mode).
Two things must be configured in the gg.classpath
configuration variable in order for the HDFS Handler to connect to HDFS and run. The first thing is the HDFS core-site.xml
file and the second are the HDFS client jars. The HDFS client jars must match the version of HDFS that the HDFS Handler is connecting. For a listing of the required client JAR files by version, see HDFS Handler Client Dependencies.
The default location of the core-site.xml
file is the follow:
Hadoop_Home
/etc/hadoop
The default location of the HDFS client jars are the following directories:
Hadoop_Home
/share/hadoop/common/lib/*
Hadoop_Home
/share/hadoop/common/*
Hadoop_Home
/share/hadoop/hdfs/lib/
*
Hadoop_Home
/share/hadoop/hdfs/*
The gg.classpath
must be configured exactly as shown. Pathing to the core-site.xml
should simply contain the path to the directory containing the core-site.xml
file with no wild card appended. The inclusion of the * wildcard in the path to the core-site.xml
file will cause it not to be picked up. Conversely, pathing to the dependency jars should include the * wild card character in order to include all of the jar files in that directory in the associated classpath. Do not use *.jar
. An example of a correctly configured gg.classpath
variable is the following:
gg.classpath=/ggwork/hadoop/hadoop-2.6.0/etc/hadoop:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/lib/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/hdfs/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/hdfs/lib/*
The HDFS configuration file hdfs-site.xml
is also required to be in the classpath if Kerberos security is enabled. The hdfs-site.xml
file is by default located in the Hadoop_Home
/etc/hadoop
directory. Either or both files can be copied to another machine if the HDFS Handler is not collocated with Hadoop.
The HDFS Handler supports all of the Big Data pluggable handlers including which includes:
JSON
Delimited Text
Avro Row
Avro Operation
Avro Object Container File Row
Avro Object Container File Operation
XML
For more information about formatters, see Using the Pluggable Formatters
The configuration properties of the Oracle GoldenGate for Big Data HDFS Handler are detailed in this section.
Table 2-1 HDFS Handler Configuration Properties
Property | Optional / Required | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
Any string |
None |
Provides a name for the HDFS Handler. The HDFS Handler name then becomes part of the property names listed in this table. |
|
Required |
|
|
Selects the HDFS Handler for streaming change data capture into HDFS. |
|
Optional |
|
|
Selects operation ( |
|
Optional |
Default unit of measure is bytes. You can stipulate |
|
Selects the maximum file size of created HDFS files. |
|
Optional |
Any path name legal in HDFS. |
|
The HDFS Handler will create subdirectories and files under this directory in HDFS to store the data streaming into HDFS. |
|
Optional |
The default unit of measure is milliseconds. You can stipulate |
File rolling on time is off. |
The timer starts when an HDFS file is created. If the file is still open when the interval elapses then the file will be closed. A new file will not be immediately opened. New HDFS files are created on a just in time basis. |
|
Optional |
The default unit of measure is milliseconds. You can stipulate |
File inactivity rolling on time is off. |
The timer starts from the latest write to an HDFS file. New writes to an HDFS file restart the counter. If the file is still open when the counter elapses the HDFS file will be closed. A new file will not be immediately opened. New HDFS files are created on a just in time basis. |
|
Optional |
Any string conforming to HDFS file name restrictions. |
|
This is a suffix that is added on to the end of the HDFS file names. File names typically follow the format, |
|
Optional |
|
|
Determines if data written into HDFS should be partitioned by table. If set to Must be set to |
|
Optional |
|
|
Determines if HDFS files should be rolled in the case of a metadata change. True means the HDFS file is rolled, false means the HDFS file is not rolled. Must be set to |
|
Optional |
|
|
Selects the formatter for the HDFS Handler for how output data will be formatted
|
|
Optional |
|
|
Set to |
Equals one or more column names separated by commas. |
Optional |
Fully qualified table name and column names must exist. |
|
This partitions the data into subdirectories in HDFS in the following format, |
|
Optional |
kerberos |
|
Setting this property to
kerberosenables Kerberos authentication. |
|
Optional (Required if
authType=Kerberos) |
Relative or absolute path to a Kerberos |
|
The |
|
Optional (Required if
authType=Kerberos) |
A legal Kerberos principal name like |
|
The Kerberos principal name for Kerberos authentication. |
|
Optional |
|
Set to a legal path in HDFS so that schemas (if available) are written in that HDFS directory. Schemas are currently only available for Avro and JSON formatters. In the case of a metadata change event, the schema will be overwritten to reflect the schema change. |
|
Applicable to Sequence File Format only. |
Optional |
|
|
Hadoop Sequence File Compression Type. applicable only if |
Applicable to Sequence File and writing to HDFS is Avro OCF formats only. |
Optional |
|
|
Hadoop Sequence File Compression Codec. applicable only if |
Optional |
|
|
Avro OCF Formatter Compression Code. This configuration controls the selection of the compression library to be used for Avro OCF files generated. Snappy includes native binaries in the Snappy JAR file and performs a Java-native traversal when performing compression or decompression. Use of Snappy may introduce runtime issue and platform porting issues that you may not experience when working with Java. You may need to perform additional testing to ensure Snappy works on all of your required platforms. Snappy is an open source library so Oracle cannot guarantee its ability to operate on all of your required platforms. |
|
|
Optional |
A legal URL for connecting to Hive using the Hive JDBC interface. |
|
Only applicable to the Avro Object Container File (OCF) Formatter. This configuration value provides a JDBC URL for connectivity to Hive through the Hive JDBC interface. Use of this property requires that you include the Hive JDBC library in the Hive JDBC connectivity can be secured through basic credentials, SSL/TLS, or Kerberos. Configuration properties are provided for the user name and password for basic credentials. See the Hive documentation for how to generate a Hive JDBC URL for SSL/TLS. See the Hive documentation for how to generate a Hive JDBC URL for Kerberos. (If Kerberos is used for Hive JDBC security, it must be enabled for HDFS connectivity. Then the Hive JDBC connection can piggyback on the HDFS Kerberos functionality by using the same Kerberos principal.) |
|
Optional |
A legal user name if the Hive JDBC connection is secured through credentials. |
Java call result from |
Only applicable to the Avro Object Container File (OCF) Formatter. This property is only relevant if the |
|
Optional |
The fully qualified Hive JDBC driver class name. |
|
Only applicable to the Avro Object Container File (OCF) Formatter. This property is only relevant if the |
The following is sample configuration for the HDFS Handler from the Java Adapter properties file:
gg.handlerlist=hdfs gg.handler.hdfs.type=hdfs gg.handler.hdfs.mode=tx gg.handler.hdfs.includeTokens=false gg.handler.hdfs.maxFileSize=1g gg.handler.hdfs.rootFilePath=/ogg gg.handler.hdfs.fileRollInterval=0 gg.handler.hdfs.inactivityRollInterval=0 gg.handler.hdfs.fileSuffix=.txt gg.handler.hdfs.partitionByTable=true gg.handler.hdfs.rollOnMetadataChange=true gg.handler.hdfs.authType=none gg.handler.hdfs.format=delimitedtext
A sample Replicat configuration and a Java Adapter Properties file for an HDFS integration can be found at the following directory:
GoldenGate_install_directory
/AdapterExamples/big-data/hdfs
Troubleshooting of the HDFS Handler begins with the contents for the Java log4j
file. Follow the directions in the Java Logging Configuration to configured the runtime to correctly generate the Java log4j
log file.
As previously stated, issues with the Java classpath are one of the most common problems. The usual indication of a Java classpath problem is a ClassNotFoundException
in the Java log4j
log file. The Java log4j
log file can be used to troubleshoot this issue. Setting the log level to DEBUG
allows for logging of each of the jars referenced in the gg.classpath
object to be logged to the log file. In this way, you can ensure that all of the required dependency jars are resolved. Simply enable DEBUG
level logging and search the log file for messages like the following:
2015-09-21 10:05:10 DEBUG ConfigClassPath:74 - ...adding to classpath: url="file:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/lib/guava-11.0.2.jar
The contents of the HDFS core-site.xml
file (including default settings) are output to the Java log4j
log file when the logging level is set to DEBUG
or TRACE
. This will show the connection properties to HDFS. Search for the following in the Java log4j
log file:
2015-09-21 10:05:11 DEBUG HDFSConfiguration:58 - Begin - HDFS configuration object contents for connection troubleshooting.
If the fs.defaultFS
property is set as follows (pointing at the local file system) then the core-site.xml
file is not properly set in the gg.classpath
property.
Key: [fs.defaultFS] Value: [file:///].
This shows to the fs.defaultFS
property properly pointed at and HDFS host and port.
Key: [fs.defaultFS] Value: [hdfs://hdfshost:9000].
The Java log4j
log file contains information on the configuration state of the HDFS Handler and the selected formatter. This information is output at the INFO
log level. Sample output appears as follows:
2015-09-21 10:05:11 INFO AvroRowFormatter:156 - **** Begin Avro Row Formatter - Configuration Summary **** Operation types are always included in the Avro formatter output. The key for insert operations is [I]. The key for update operations is [U]. The key for delete operations is [D]. The key for truncate operations is [T]. Column type mapping has been configured to map source column types to an appropriate corresponding Avro type. Created Avro schemas will be output to the directory [./dirdef]. Created Avro schemas will be encoded using the [UTF-8] character set. In the event of a primary key update, the Avro Formatter will ABEND. Avro row messages will not be wrapped inside a generic Avro message. No delimiter will be inserted after each generated Avro message. **** End Avro Row Formatter - Configuration Summary **** 2015-09-21 10:05:11 INFO HDFSHandler:207 - **** Begin HDFS Handler - Configuration Summary **** Mode of operation is set to tx. Data streamed to HDFS will be partitioned by table. Tokens will be included in the output. The HDFS root directory for writing is set to [/ogg]. The maximum HDFS file size has been set to 1073741824 bytes. Rolling of HDFS files based on time is configured as off. Rolling of HDFS files based on write inactivity is configured as off. Rolling of HDFS files in the case of a metadata change event is enabled. HDFS partitioning information: The HDFS partitioning object contains no partitioning information. HDFS Handler Authentication type has been configured to use [none] **** End HDFS Handler - Configuration Summary ****
The HDFS Handler calls the HDFS flush method on the HDFS write stream to flush data to the HDFS datanodes at the end of each transaction in order to maintain write durability. This is an expensive call. Performance can be adversely affected especially in the case of transactions of one or few operations that results in numerous HDFS flush calls.
Performance of the HDFS Handler can be greatly improved by batching multiple small transactions into a single larger transaction. If you have requirements for high performance, you should configure batching functionality provided by either the Extract
process or the Replicat
process. For more information, see the Replicat Grouping section.
The HDFS client libraries spawn threads for every HDFS file stream opened by the HDFS Handler. The result is that the number threads executing in the JMV grows proportionally to the number HDFS file streams that are open. Performance of the HDFS Handler can degrade as more HDFS file streams are opened. Configuring the HDFS Handler to write to many HDFS files due to many source replication tables or extensive use of partitioning can result in degraded performance. If the use case requires writing to many tables, then you are advised to enable the roll on time or roll on inactivity features to close HDFS file streams. Closing an HDFS file stream causes the HDFS client threads to terminate and the associated resources can be reclaimed by the JVM.
The HDFS cluster can be secured using Kerberos authentication. Refer to the HDFS documentation for how to secure a Hadoop cluster using Kerberos. The HDFS Handler can connect to Kerberos secured cluster. The HDFS core-site.xml
should be in the handlers classpath with the hadoop.security.authentication
property set to kerberos
and hadoop.security.authorization
property set to true
. Additionally, you must set the following properties in the HDFS Handler Java configuration file:
gg.handler.name.authType=kerberos gg.handler.name.keberosPrincipalName=legal Kerberos principal name gg.handler.name.kerberosKeytabFile=path to a keytab file that contains the password for the Kerberos principal so that the HDFS Handler can programmatically perform the Kerberos kinit operations to obtain a Kerberos ticket
The HDFS Handler includes specialized functionality to write to HDFS in Avro Object Container File (OCF) format. This Avro OCF is part of the Avro specification and is detailed in the Avro Documentation at
https://avro.apache.org/docs/current/spec.html#Object+Container+Files
Avro OCF format may be a good choice for you because it
integrates with Apache Hive (raw Avro written to HDFS is not supported by Hive)
and provides good support for schema evolution. Configure the following to enable writing to HDFS in Avro OCF format.
To write row data to HDFS in Avro OCF format configure the gg.handler.name.format=avro_row_ocf
property.
To write operation data to HDFS is Avro OCF format configure the gg.handler.name.format=avro_op_ocf
property.
The HDFS/Avro OCF integration includes optional functionality to create the corresponding tables in Hive and update the schema for metadata change events. The configuration section provides information on the properties to enable integration with Hive. The Oracle GoldenGate Hive integration accesses Hive using the JDBC interface so the Hive JDBC server must be running to enable this integration.
The 12.2.0.1 Oracle GoldenGate for Big Data HDFS Handler is designed to work with the following versions of Apache Hadoop:
2.7.x
2.6.0
2.5.x
2.4.x
2.3.0
2.2.0
The HDFS Handler also works with the following versions of the Hortonworks Data Platform (HDP) that simply packages Apache Hadoop with it:
HDP 2.4 (HDFS 2.7.1)
HDP 2.3 (HDFS 2.7.1)
HDP 2.2 (HDFS 2.6.0)
HDP 2.1 (HDFS 2.4.0)
HDP 2.0 (HDFS 2.2.0)
The HDFS Handler also works with the following versions of Cloudera Distribution including Apache Hadoop (CDH):
CDH 5.7.x (HDFS 2.6.0)
CDH 5.6.x (HDFS 2.6.0)
CDH 5.5.x (HDFS 2.6.0)
CDH 5.4.x (HDFS 2.6.0)
CDH 5.3. (HDFS 2.5.0)
CDH 5.2.x (HDFS 2.5.0)
CDH 5.1.x (HDFS 2.3.0)
Metadata change events are now handled in the HDFS Handler. The default behavior of the HDFS Handler is to roll the current relevant file in the event of a metadata change event. This behavior allows for the results of metadata changes to at least be separated into different files. File rolling on metadata change is configurable and can be turned off.
To support metadata change events the process capturing changes in the source database must support both DDL changes and metadata in trail. Oracle GoldenGate does not support DDL replication for all database implementations. You should consult the Oracle GoldenGate documentation for their database implementation to understand if DDL replication is supported.
The HDFS Handler supports partitioning of table data by one or more column values. The configuration syntax to enable partitioning is the following:
gg.handler.name.partitioner.fully qualified table name=one mor more column names separated by commas
Consider the following example:
gg.handler.hdfs.partitioner.dbo.orders=sales_region
This example can result in the following breakdown of files in HDFS:
/ogg/dbo.orders/par_sales_region=west/data files /ogg/dbo.orders/par_sales_region=east/data files /ogg/dbo.orders/par_sales_region=north/data files /ogg/dbo.orders/par_sales_region=south/data files
Care should be exercised when choosing columns for partitioning. The key is to choose columns that contain only a few (10 or less) possible values and those values are also meaningful for the grouping and analysis of the data. An example of a good partitioning column might be sales regions. An example of a poor partitioning column might be customer date of birth. Configuring partitioning on a column that has many possible values can be problematic. A poor choice can result in hundreds of HDFS file streams being opened and performance can degrade for the reasons discussed in the Performance section. Additionally, poor partitioning can result in problems while performing analysis on the data. Apache Hive requires that all where clauses specify partition criteria if the Hive data is partitioned.
The most common problems encountered are Java classpath issues. The Oracle HDFS Handler requires certain HDFS client libraries to be resolved in its classpath as a prerequisite for streaming data to HDFS.
For a listing of the required client JAR files by version, see HDFS Handler Client Dependencies. The HDFS client jars do not ship with the Oracle GoldenGate for Big Data product. The HDFS Handler supports multiple versions of HDFS and it is required that the HDFS client jars be the same version as the HDFS version to which the HDFS Handler is connecting. The HDFS client jars are open source and freely available to download from sites such as the Apache Hadoop site or the maven central repository.
In order to establish connectivity to HDFS, the HDFS core-site.xml
file needs to be in the classpath of the HDFS Handler. If the core-site.xml
file is not in the classpath the HDFS client code defaults to a mode that attempts to write to the local file system. Writing to the local file system instead of HDFS can in fact be an advantageous for troubleshooting, building a point of contact (POC), or as a step in the process of building an HDFS integration.
Another common concern is that data streamed to HDFS using the HDFS Handler is often not immediately available to Big Data analytic tools such as Hive. This behavior commonly occurs when the HDFS Handler is in possession of an open write stream to an HDFS file. HDFS writes in blocks of 128MB by default. HDFS blocks under construction are not always visible to analytic tools. Additionally, inconsistencies between file sizes when using the -ls
, -cat
, and -get
commands in the HDFS shell are commonly seen. This is an anomaly of HDFS streaming and is discussed in the HDFS specification. This anomaly of HDFS leads to a potential 128MB per file blind spot in analytic data. This may not be an issue if you have a steady stream of Replication data and do not require low levels of latency for analytic data from HDFS. However, this may be a problem in some use cases. Closing the HDFS write stream causes the block writing to finalize. Data is immediately visible to analytic tools and file sizing metrics become consistent again. So the new file rolling feature in the HDFS Handler can be used to close HDFS writes streams thus making all data visible.
Caution:
The file rolling solution may present its own potential problems. Extensive use of file rolling can result in lots of small files in HDFS. Lots of small files in HDFS can be its own problem resulting in performance issues in analytic tools.
You may also notice the HDFS inconsistency problem in the following scenarios.
The HDFS Handler process crashes.
A forced shutdown is called on the HDFS Handler process.
A network outage or some other issue causes the HDFS Handler process to abend.
In each of these scenarios it is possible for the HDFS Handler to end without explicitly closing the HDFS write stream and finalizing the writing block. HDFS in its internal process will ultimately recognize that the write stream has been broken and HDFS will finalize the write block. However, in this scenario, users may experience a short term delay before the HDFS process finalizes the write block.
It is considered a Big Data best practice for the HDFS cluster to operate on dedicated servers called cluster nodes. Edge nodes are server machines that host the applications to stream data to and retrieve data from the HDFS cluster nodes. This physical architecture delineation between the HDFS cluster nodes and the edge nodes provides a number of benefits including the following:
The HDFS cluster nodes are not competing for resources with the applications interfacing with the cluster.
HDFS cluster nodes and edge nodes likely have different requirements. This physical topology allows the appropriate hardware to be tailored to the specific need.
It is a best practice for the HDFS Handler to be installed and running on an edge node and streaming data to the HDFS cluster using network connection. The HDFS Handler can run on any machine that has network visibility to the HDFS cluster. The installation of the HDFS Handler on an edge node requires that the core-site.xml
files and the dependency jars be copied to the edge node so that the HDFS Handler can access them. The HDFS Handler can also run collocated on a HDFS cluster node if required.