9.1 HDFS Formats

HDFS file formats supported are Json, Avro and Parquet. The format is specified by setting the storage format value which can be found on the storage tab of the Data Store. For all files of HDFS, the storage type (Json, Avro, Parquet) are defined in the data store. JSON, Avro and Parquet formats contain complex data types, like array or Object. During the Reverse Engineer phase, the schema definition for these types are converted to Avro and stored in the data format column of the attribute with the complex data type. This information is used when flattening this data in the mappings.

For JSON, Avro and Parquet that each type requires the location of a schema file to be entered. For Delimited, you will need to specify the Record and field separator information, number of heading lines. If you are loading Avro files into Hive, then you will need to copy the Avro Schema file (.avsc) into the same HDFS location as the HDFS files.


Table 9-1 HDFS File Formats

File Format Reverse Engineer Complex Type Support Load into Hive Load into Spark Write from Spark

Avro

Yes (Schema required)

Yes

Yes (Schema required)

Yes

Yes

Delimited

No

No

Yes

Yes

Yes

JSON

Yes (Schema required)

Yes

Yes

Yes

Yes

Parquet

Yes (Schema required)

Yes

Yes

Yes

Yes


Separate KMs for each file format are not required. You can create just one or two KMs for each target (a standard LKM and where appropriate a Direct Load LKM). The file can either be delimited or fixed format. The new LKM HDFS File to Hive supports loading only HDFS file into Hive, the file can be in the format of JSON, Avro, Parquet, Delimited etc, with complex data.


Table 9-2 Complex Types

Avro Json Hive Parquet

Record

Object

Struct

Record

enum

NA

NA

enum

array

array

array

array

map

NA

map

map

union

NA

union

union

fixed

NA

NA

fixed