2 Using the HDFS Handler

This chapter explains the HDFS functionality, and includes examples that you can use to understand this functionality. The Oracle GoldenGate for Big Data Handler for HDFS is designed to stream change capture data into the Hadoop Distributed File System (HDFS).

This chapter includes the following sections:

2.1 Overview

Hadoop Distributed File System (HDFS) is the primary application for Big Data. Hadoop is typically installed on multiple machines which work together as a Hadoop cluster. Hadoop allows users to store very large amounts of data in the cluster that is horizontally scaled across the machines in the cluster. You can then perform analytics on that data using a variety of Big Data applications.

2.2 Hive Handler Support

The Oracle GoldenGate for Big Data 12.2.0.1 release does not include a Hive Handler as was included in the Oracle GoldenGate for Big Data 12.1.2.1.x releases. The 12.1.2.1.x Hive Handler actually provided no direct integration with Hive. The functionality of the Hive Handler was to load operation data from the source trail file into HDFS, partitioned by table, in a Hive friendly delimited text format. The 12.2.0.1 HDFS Handler provides all of the functionality of the previous 12.1.2.1.x Hive Handler.

Hive integration to create tables and update table definitions in the case of DDL events is possible. This functionality is limited to only data formatted as Avro Object Container File format. For more information, see Writing in HDFS in Avro Object Container File Format and HDFS Handler Configuration.

2.3 Writing into HDFS in Sequence File Format

The HDFS SequenceFile is a flat file consisting of binary key and value pairs. You can enable writing data in SequenceFile format by setting the gg.handler.name.format property to sequencefile. The key part of the record is set to null and the actual data is set in the value part.

For information about Hadoop SequenceFile, see https://wiki.apache.org/hadoop/SequenceFile.

2.3.1 Integrating with Hive

DDL to create Hive tables should include STORED as sequencefile for Hive to consume Sequence Files. Following is a sample create table script:

CREATE EXTERNAL TABLE table_name (
  col1 string,
  ...
  ...
  col2 string)
ROW FORMAT DELIMITED
STORED as sequencefile
LOCATION '/path/to/hdfs/file';

Note:

If files are intended to be consumed by Hive, then the gg.handler.name.partitionByTable property should be set to true.

2.3.2 Understanding the Data Format

The data written in the value part of each record and is in delimited text format. All of the options described in the Delimited Text Formatter section are applicable to HDFS SequenceFile when writing data to it.

For example:

gg.handler.name.format=sequencefile
gg.handler.name.format.includeColumnNames=true
gg.handler.name.format.includeOpType=true
gg.handler.name.format.includeCurrentTimestamp=true
gg.handler.name.format.updateOpKey=U

2.4 Runtime Prerequisites

In order to successfully run the HDFS Handler, a Hadoop single instance or Hadoop cluster must be installed, running, and network accessible from the machine running the HDFS Handler. Apache Hadoop is open source and available for download at http://hadoop.apache.org/. Follow the Getting Started links for information on how to install a single-node cluster (also called pseudo-distributed operation mode) or a clustered setup (also called fully-distributed operation mode).

2.4.1 Classpath Configuration

Two things must be configured in the gg.classpath configuration variable in order for the HDFS Handler to connect to HDFS and run. The first thing is the HDFS core-site.xml file and the second are the HDFS client jars. The HDFS client jars must match the version of HDFS that the HDFS Handler is connecting. For a listing of the required client JAR files by version, see HDFS Handler Client Dependencies.

The default location of the core-site.xml file is the follow:

Hadoop_Home/etc/hadoop

The default location of the HDFS client jars are the following directories:

Hadoop_Home/share/hadoop/common/lib/*

Hadoop_Home/share/hadoop/common/*

Hadoop_Home/share/hadoop/hdfs/lib/*

Hadoop_Home/share/hadoop/hdfs/*

The gg.classpath must be configured exactly as shown. Pathing to the core-site.xml should simply contain the path to the directory containing the core-site.xmlfile with no wild card appended. The inclusion of the * wildcard in the path to the core-site.xml file will cause it not to be picked up. Conversely, pathing to the dependency jars should include the * wild card character in order to include all of the jar files in that directory in the associated classpath. Do not use *.jar. An example of a correctly configured gg.classpath variable is the following:

gg.classpath=/ggwork/hadoop/hadoop-2.6.0/etc/hadoop:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/lib/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/hdfs/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/hdfs/lib/*

The HDFS configuration file hdfs-site.xml is also required to be in the classpath if Kerberos security is enabled. The hdfs-site.xml file is by default located in the Hadoop_Home/etc/hadoop directory. Either or both files can be copied to another machine if the HDFS Handler is not collocated with Hadoop.

2.4.2 Pluggable Formatters

The HDFS Handler supports all of the Big Data pluggable handlers including which includes:

  • JSON

  • Delimited Text

  • Avro Row

  • Avro Operation

  • Avro Object Container File Row

  • Avro Object Container File Operation

  • XML

For more information about formatters, see Using the Pluggable Formatters

2.4.3 HDFS Handler Configuration

The configuration properties of the Oracle GoldenGate for Big Data HDFS Handler are detailed in this section.

Table 2-1 HDFS Handler Configuration Properties

Property Optional / Required Legal Values Default Explanation

gg.handlerlist

Required

Any string

None

Provides a name for the HDFS Handler. The HDFS Handler name then becomes part of the property names listed in this table.

gg.handler.name.type=hdfs

Required

-

-

Selects the HDFS Handler for streaming change data capture into HDFS.

gg.handler.name.mode

Optional

tx | op

op

Selects operation (op) mode or transaction (tx) mode for the handler. In almost all scenarios, transaction mode results in better performance.

gg.handler.name.maxFileSize

Optional

Default unit of measure is bytes. You can stipulate k, m, or g to signify kilobytes, megabytes, or gigabytes respectively. Examples of legal values include 10000, 10k, 100m, 1.1g.

1g

Selects the maximum file size of created HDFS files.

gg.handler.name.rootFilePath

Optional

Any path name legal in HDFS.

/ogg

The HDFS Handler will create subdirectories and files under this directory in HDFS to store the data streaming into HDFS.

gg.handler.name.fileRollInterval

Optional

The default unit of measure is milliseconds. You can stipulate ms, s, m, h to signify milliseconds, seconds, minutes, or hours respectively. Examples of legal values include 10000, 10000ms, 10s, 10m, or 1.5h. Values of 0 or less indicate that file rolling on time is turned off.

File rolling on time is off.

The timer starts when an HDFS file is created. If the file is still open when the interval elapses then the file will be closed. A new file will not be immediately opened. New HDFS files are created on a just in time basis.

gg.handler.name.inactivityRollInterval

Optional

The default unit of measure is milliseconds. You can stipulate ms, s, m, h to signify milliseconds, seconds, minutes, or hours respectively. Examples of legal values include 10000, 10000ms, 10s, 10.5m, or 1h. Values of 0 or less indicate that file inactivity rolling on time is turned off.

File inactivity rolling on time is off.

The timer starts from the latest write to an HDFS file. New writes to an HDFS file restart the counter. If the file is still open when the counter elapses the HDFS file will be closed. A new file will not be immediately opened. New HDFS files are created on a just in time basis.

gg.handler.name.fileSuffix

Optional

Any string conforming to HDFS file name restrictions.

.txt

This is a suffix that is added on to the end of the HDFS file names. File names typically follow the format, {fully qualified table name}{current time stamp}{suffix}.

gg.handler.name.partitionByTable

Optional

true | false

true (data is partitioned by table)

Determines if data written into HDFS should be partitioned by table. If set to true, then data for different tables are written to different HDFS files. If se to false, then data from different tables is interlaced in the same HDFS file.

Must be set to true to use the Avro Object Container File Formatter. Set to false results in a configuration exception at initialization.

gg.handler.name.rollOnMetadataChange

Optional

true | false

true (HDFS files are rolled on a metadata change event)

Determines if HDFS files should be rolled in the case of a metadata change. True means the HDFS file is rolled, false means the HDFS file is not rolled.

Must be set to true to use the Avro Object Container File Formatter. Set to false results in a configuration exception at initialization.

gg.handler.name.format

Optional

delimitedtext | json | xml | avro_row | avro_op | avro_row_ocf | avro_op_ocf | sequencefile

delimitedtext

Selects the formatter for the HDFS Handler for how output data will be formatted

  • delimitedtext - Delimited text

  • json - JSON

  • xml - XML

  • avro_row - Avro in row compact format

  • avro_op - Avro in operation more verbose format.

  • avro_row_ocf - Avro in the row compact format written into HDFS in the Avro Object Container File format.

  • avro_op_ocf - Avro in the more verbose format written into HDFS in the Avro Object Container File format.

  • sequencefile - Delimited text written in sequence into HDFS is sequence file format.

gg.handler.name.includeTokens

Optional

true | false

false

Set to true to include the tokens field and tokens key/values in the output, false to suppress tokens output.

gg.handler.name.partitioner.fully_qualified_table_ name

Equals one or more column names separated by commas.

Optional

Fully qualified table name and column names must exist.

-

This partitions the data into subdirectories in HDFS in the following format, par_{column name}={column value}

gg.handler.name.authType

Optional

kerberos

none

Setting this property to
kerberos
enables Kerberos authentication.

gg.handler.name.kerberosKeytabFile

Optional (Required if
authType=Kerberos
)

Relative or absolute path to a Kerberos keytab file.

-

The keytab file allows the HDFS Handler to access a password to perform a kinit operation for Kerberos security.

gg.handler.name.kerberosPrincipal

Optional (Required if
authType=Kerberos
)

A legal Kerberos principal name like user/FQDN@MY.REALM.

-

The Kerberos principal name for Kerberos authentication.

gg.handler.name.schemaFilePath

Optional

 

null

Set to a legal path in HDFS so that schemas (if available) are written in that HDFS directory. Schemas are currently only available for Avro and JSON formatters. In the case of a metadata change event, the schema will be overwritten to reflect the schema change.

gg.handler.name.compressionType

Applicable to Sequence File Format only.

Optional

block | none | record

none

Hadoop Sequence File Compression Type. applicable only if gg.handler.name.format is set to sequencefile

gg.handler.name.compressionCodec

Applicable to Sequence File and writing to HDFS is Avro OCF formats only.

Optional

org.apache.hadoop.io.compress.DefaultCodec | org.apache.hadoop.io.compress. BZip2Codec | org.apache.hadoop.io.compress.SnappyCodec | org.apache.hadoop.io.compress. GzipCodec

org.apache.hadoop.io.compress.DefaultCodec

Hadoop Sequence File Compression Codec. applicable only if gg.handler.name.format is set to sequencefile

 

Optional

null | snappy | bzip2 | xz | deflate

null

Avro OCF Formatter Compression Code. This configuration controls the selection of the compression library to be used for Avro OCF files generated.

Snappy includes native binaries in the Snappy JAR file and performs a Java-native traversal when performing compression or decompression. Use of Snappy may introduce runtime issue and platform porting issues that you may not experience when working with Java. You may need to perform additional testing to ensure Snappy works on all of your required platforms. Snappy is an open source library so Oracle cannot guarantee its ability to operate on all of your required platforms.

gg.handler.name.hiveJdbcUrl

Optional

A legal URL for connecting to Hive using the Hive JDBC interface.

null (Hive integration disabled)

Only applicable to the Avro Object Container File (OCF) Formatter.

This configuration value provides a JDBC URL for connectivity to Hive through the Hive JDBC interface. Use of this property requires that you include the Hive JDBC library in the gg.classpath.

Hive JDBC connectivity can be secured through basic credentials, SSL/TLS, or Kerberos. Configuration properties are provided for the user name and password for basic credentials.

See the Hive documentation for how to generate a Hive JDBC URL for SSL/TLS.

See the Hive documentation for how to generate a Hive JDBC URL for Kerberos. (If Kerberos is used for Hive JDBC security, it must be enabled for HDFS connectivity. Then the Hive JDBC connection can piggyback on the HDFS Kerberos functionality by using the same Kerberos principal.)

gg.handler.name.hiveJdbcUserName

Optional

A legal user name if the Hive JDBC connection is secured through credentials.

Java call result from System.getProperty(user.name)

Only applicable to the Avro Object Container File (OCF) Formatter.

This property is only relevant if the hiveJdbcUrlproperty is set. It may be required in your environment when the Hive JDBC connection is secured through credentials. Hive requires that Hive DDL operations be associated with a user. If you do not set the value, it defaults to the result of the Java call System.getProperty(user.name)

gg.handler.name.hiveJdbcPassword

Optional

The fully qualified Hive JDBC driver class name.

org.apache.hive.jdbc.HiveDriver

Only applicable to the Avro Object Container File (OCF) Formatter.

This property is only relevant if the hiveJdbcUrl property is set. The default is the Hive Hadoop2 JDBC driver name. Typically, this property does not require configuration and is provided for use if Apache Hive introduces a new JDBC driver class.

2.4.4 Sample Configuration

The following is sample configuration for the HDFS Handler from the Java Adapter properties file:

gg.handlerlist=hdfs
gg.handler.hdfs.type=hdfs
gg.handler.hdfs.mode=tx
gg.handler.hdfs.includeTokens=false
gg.handler.hdfs.maxFileSize=1g
gg.handler.hdfs.rootFilePath=/ogg
gg.handler.hdfs.fileRollInterval=0
gg.handler.hdfs.inactivityRollInterval=0
gg.handler.hdfs.fileSuffix=.txt
gg.handler.hdfs.partitionByTable=true
gg.handler.hdfs.rollOnMetadataChange=true
gg.handler.hdfs.authType=none
gg.handler.hdfs.format=delimitedtext

A sample Replicat configuration and a Java Adapter Properties file for an HDFS integration can be found at the following directory:

GoldenGate_install_directory/AdapterExamples/big-data/hdfs

2.4.5 Troubleshooting the HDFS Handler

Troubleshooting of the HDFS Handler begins with the contents for the Java log4j file. Follow the directions in the Java Logging Configuration to configured the runtime to correctly generate the Java log4j log file.

2.4.5.1 Java Classpath

As previously stated, issues with the Java classpath are one of the most common problems. The usual indication of a Java classpath problem is a ClassNotFoundException in the Java log4j log file. The Java log4j log file can be used to troubleshoot this issue. Setting the log level to DEBUG allows for logging of each of the jars referenced in the gg.classpath object to be logged to the log file. In this way, you can ensure that all of the required dependency jars are resolved. Simply enable DEBUG level logging and search the log file for messages like the following:

2015-09-21 10:05:10 DEBUG ConfigClassPath:74 - ...adding to classpath: url="file:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/lib/guava-11.0.2.jar

2.4.5.2 HDFS Connection Properties

The contents of the HDFS core-site.xml file (including default settings) are output to the Java log4j log file when the logging level is set to DEBUG or TRACE. This will show the connection properties to HDFS. Search for the following in the Java log4j log file:

2015-09-21 10:05:11 DEBUG HDFSConfiguration:58 - Begin - HDFS configuration object contents for connection troubleshooting.

If the fs.defaultFS property is set as follows (pointing at the local file system) then the core-site.xml file is not properly set in the gg.classpath property.

  Key: [fs.defaultFS] Value: [file:///].  

This shows to the fs.defaultFS property properly pointed at and HDFS host and port.

Key: [fs.defaultFS] Value: [hdfs://hdfshost:9000].

2.4.5.3 Handler and Formatter Configuration

The Java log4j log file contains information on the configuration state of the HDFS Handler and the selected formatter. This information is output at the INFO log level. Sample output appears as follows:

2015-09-21 10:05:11 INFO  AvroRowFormatter:156 - **** Begin Avro Row Formatter -
 Configuration Summary ****
  Operation types are always included in the Avro formatter output.
    The key for insert operations is [I].
    The key for update operations is [U].
    The key for delete operations is [D].
    The key for truncate operations is [T].
  Column type mapping has been configured to map source column types to an
 appropriate corresponding Avro type.
  Created Avro schemas will be output to the directory [./dirdef].
  Created Avro schemas will be encoded using the [UTF-8] character set.
  In the event of a primary key update, the Avro Formatter will ABEND.
  Avro row messages will not be wrapped inside a generic Avro message.
  No delimiter will be inserted after each generated Avro message.
**** End Avro Row Formatter - Configuration Summary ****
 
2015-09-21 10:05:11 INFO  HDFSHandler:207 - **** Begin HDFS Handler -
 Configuration Summary ****
  Mode of operation is set to tx.
  Data streamed to HDFS will be partitioned by table.
  Tokens will be included in the output.
  The HDFS root directory for writing is set to [/ogg].
  The maximum HDFS file size has been set to 1073741824 bytes.
  Rolling of HDFS files based on time is configured as off.
  Rolling of HDFS files based on write inactivity is configured as off.
  Rolling of HDFS files in the case of a metadata change event is enabled.
  HDFS partitioning information:
    The HDFS partitioning object contains no partitioning information.
HDFS Handler Authentication type has been configured to use [none]
**** End HDFS Handler - Configuration Summary ****

2.4.6 Performance Considerations

The HDFS Handler calls the HDFS flush method on the HDFS write stream to flush data to the HDFS datanodes at the end of each transaction in order to maintain write durability. This is an expensive call. Performance can be adversely affected especially in the case of transactions of one or few operations that results in numerous HDFS flush calls.

Performance of the HDFS Handler can be greatly improved by batching multiple small transactions into a single larger transaction. If you have requirements for high performance, you should configure batching functionality provided by either the Extract process or the Replicat process. For more information, see the Replicat Grouping section.

The HDFS client libraries spawn threads for every HDFS file stream opened by the HDFS Handler. The result is that the number threads executing in the JMV grows proportionally to the number HDFS file streams that are open. Performance of the HDFS Handler can degrade as more HDFS file streams are opened. Configuring the HDFS Handler to write to many HDFS files due to many source replication tables or extensive use of partitioning can result in degraded performance. If the use case requires writing to many tables, then you are advised to enable the roll on time or roll on inactivity features to close HDFS file streams. Closing an HDFS file stream causes the HDFS client threads to terminate and the associated resources can be reclaimed by the JVM.

2.4.7 Security

The HDFS cluster can be secured using Kerberos authentication. Refer to the HDFS documentation for how to secure a Hadoop cluster using Kerberos. The HDFS Handler can connect to Kerberos secured cluster. The HDFS core-site.xml should be in the handlers classpath with the hadoop.security.authentication property set to kerberos and hadoop.security.authorization property set to true. Additionally, you must set the following properties in the HDFS Handler Java configuration file:

gg.handler.name.authType=kerberos
gg.handler.name.keberosPrincipalName=legal Kerberos principal name
gg.handler.name.kerberosKeytabFile=path to a keytab file that contains the password for the Kerberos principal so that the HDFS Handler can programmatically perform the Kerberos kinit operations to obtain a Kerberos ticket

2.5 Writing in HDFS in Avro Object Container File Format

The HDFS Handler includes specialized functionality to write to HDFS in Avro Object Container File (OCF) format. This Avro OCF is part of the Avro specification and is detailed in the Avro Documentation at

https://avro.apache.org/docs/current/spec.html#Object+Container+Files

Avro OCF format may be a good choice for you because it

  • integrates with Apache Hive (raw Avro written to HDFS is not supported by Hive)

  • and provides good support for schema evolution. Configure the following to enable writing to HDFS in Avro OCF format.

To write row data to HDFS in Avro OCF format configure the gg.handler.name.format=avro_row_ocf property.

To write operation data to HDFS is Avro OCF format configure the gg.handler.name.format=avro_op_ocf property.

The HDFS/Avro OCF integration includes optional functionality to create the corresponding tables in Hive and update the schema for metadata change events. The configuration section provides information on the properties to enable integration with Hive. The Oracle GoldenGate Hive integration accesses Hive using the JDBC interface so the Hive JDBC server must be running to enable this integration.

2.6 HDFS Handler Certification Matrix

The 12.2.0.1 Oracle GoldenGate for Big Data HDFS Handler is designed to work with the following versions of Apache Hadoop:

  • 2.7.x

  • 2.6.0

  • 2.5.x

  • 2.4.x

  • 2.3.0

  • 2.2.0

The HDFS Handler also works with the following versions of the Hortonworks Data Platform (HDP) that simply packages Apache Hadoop with it:

  • HDP 2.4 (HDFS 2.7.1)

  • HDP 2.3 (HDFS 2.7.1)

  • HDP 2.2 (HDFS 2.6.0)

  • HDP 2.1 (HDFS 2.4.0)

  • HDP 2.0 (HDFS 2.2.0)

The HDFS Handler also works with the following versions of Cloudera Distribution including Apache Hadoop (CDH):

  • CDH 5.7.x (HDFS 2.6.0)

  • CDH 5.6.x (HDFS 2.6.0)

  • CDH 5.5.x (HDFS 2.6.0)

  • CDH 5.4.x (HDFS 2.6.0)

  • CDH 5.3. (HDFS 2.5.0)

  • CDH 5.2.x (HDFS 2.5.0)

  • CDH 5.1.x (HDFS 2.3.0)

2.7 Metadata Change Events

Metadata change events are now handled in the HDFS Handler. The default behavior of the HDFS Handler is to roll the current relevant file in the event of a metadata change event. This behavior allows for the results of metadata changes to at least be separated into different files. File rolling on metadata change is configurable and can be turned off.

To support metadata change events the process capturing changes in the source database must support both DDL changes and metadata in trail. Oracle GoldenGate does not support DDL replication for all database implementations. You should consult the Oracle GoldenGate documentation for their database implementation to understand if DDL replication is supported.

2.8 Partitioning

The HDFS Handler supports partitioning of table data by one or more column values. The configuration syntax to enable partitioning is the following:

gg.handler.name.partitioner.fully qualified table name=one mor more column names separated by commas

Consider the following example:

gg.handler.hdfs.partitioner.dbo.orders=sales_region

This example can result in the following breakdown of files in HDFS:

/ogg/dbo.orders/par_sales_region=west/data files
/ogg/dbo.orders/par_sales_region=east/data files
/ogg/dbo.orders/par_sales_region=north/data files
/ogg/dbo.orders/par_sales_region=south/data files

Care should be exercised when choosing columns for partitioning. The key is to choose columns that contain only a few (10 or less) possible values and those values are also meaningful for the grouping and analysis of the data. An example of a good partitioning column might be sales regions. An example of a poor partitioning column might be customer date of birth. Configuring partitioning on a column that has many possible values can be problematic. A poor choice can result in hundreds of HDFS file streams being opened and performance can degrade for the reasons discussed in the Performance section. Additionally, poor partitioning can result in problems while performing analysis on the data. Apache Hive requires that all where clauses specify partition criteria if the Hive data is partitioned.

2.9 Common Pitfalls

The most common problems encountered are Java classpath issues. The Oracle HDFS Handler requires certain HDFS client libraries to be resolved in its classpath as a prerequisite for streaming data to HDFS.

For a listing of the required client JAR files by version, see HDFS Handler Client Dependencies. The HDFS client jars do not ship with the Oracle GoldenGate for Big Data product. The HDFS Handler supports multiple versions of HDFS and it is required that the HDFS client jars be the same version as the HDFS version to which the HDFS Handler is connecting. The HDFS client jars are open source and freely available to download from sites such as the Apache Hadoop site or the maven central repository.

In order to establish connectivity to HDFS, the HDFS core-site.xml file needs to be in the classpath of the HDFS Handler. If the core-site.xml file is not in the classpath the HDFS client code defaults to a mode that attempts to write to the local file system. Writing to the local file system instead of HDFS can in fact be an advantageous for troubleshooting, building a point of contact (POC), or as a step in the process of building an HDFS integration.

Another common concern is that data streamed to HDFS using the HDFS Handler is often not immediately available to Big Data analytic tools such as Hive. This behavior commonly occurs when the HDFS Handler is in possession of an open write stream to an HDFS file. HDFS writes in blocks of 128MB by default. HDFS blocks under construction are not always visible to analytic tools. Additionally, inconsistencies between file sizes when using the -ls, -cat, and -get commands in the HDFS shell are commonly seen. This is an anomaly of HDFS streaming and is discussed in the HDFS specification. This anomaly of HDFS leads to a potential 128MB per file blind spot in analytic data. This may not be an issue if you have a steady stream of Replication data and do not require low levels of latency for analytic data from HDFS. However, this may be a problem in some use cases. Closing the HDFS write stream causes the block writing to finalize. Data is immediately visible to analytic tools and file sizing metrics become consistent again. So the new file rolling feature in the HDFS Handler can be used to close HDFS writes streams thus making all data visible.

Caution:

The file rolling solution may present its own potential problems. Extensive use of file rolling can result in lots of small files in HDFS. Lots of small files in HDFS can be its own problem resulting in performance issues in analytic tools.

You may also notice the HDFS inconsistency problem in the following scenarios.

  • The HDFS Handler process crashes.

  • A forced shutdown is called on the HDFS Handler process.

  • A network outage or some other issue causes the HDFS Handler process to abend.

In each of these scenarios it is possible for the HDFS Handler to end without explicitly closing the HDFS write stream and finalizing the writing block. HDFS in its internal process will ultimately recognize that the write stream has been broken and HDFS will finalize the write block. However, in this scenario, users may experience a short term delay before the HDFS process finalizes the write block.

2.10 Best Practices

It is considered a Big Data best practice for the HDFS cluster to operate on dedicated servers called cluster nodes. Edge nodes are server machines that host the applications to stream data to and retrieve data from the HDFS cluster nodes. This physical architecture delineation between the HDFS cluster nodes and the edge nodes provides a number of benefits including the following:

  • The HDFS cluster nodes are not competing for resources with the applications interfacing with the cluster.

  • HDFS cluster nodes and edge nodes likely have different requirements. This physical topology allows the appropriate hardware to be tailored to the specific need.

It is a best practice for the HDFS Handler to be installed and running on an edge node and streaming data to the HDFS cluster using network connection. The HDFS Handler can run on any machine that has network visibility to the HDFS cluster. The installation of the HDFS Handler on an edge node requires that the core-site.xml files and the dependency jars be copied to the edge node so that the HDFS Handler can access them. The HDFS Handler can also run collocated on a HDFS cluster node if required.