9.1 HDFS Formats

HDFS file formats supported are Json, Avro and Parquet. The format is specified by setting the storage format value which can be found on the storage tab of the Data Store. For all files of HDFS, the storage type (Json, Avro, Parquet) are defined in the data store. JSON, Avro and Parquet formats contain complex data types, like array or Object. During the Reverse Engineer phase, the schema definition for these types are converted to Avro and stored in the data format column of the attribute with the complex data type. This information is used when flattening this data in the mappings.

For JSON, Avro and Parquet that each type requires the location of a schema file to be entered. For Delimited, you will need to specify the Record and field separator information, number of heading lines. If you are loading Avro files into Hive, then you will need to copy the Avro Schema file (.avsc) into the same HDFS location as the HDFS files.

Table 9-1 HDFS File Formats

File Format	Reverse Engineer	Complex Type Support	Load into Hive	Load into Spark	Write from Spark
Avro	Yes (Schema required)	Yes	Yes (Schema required)	Yes	Yes
Delimited	No	No	Yes	Yes	Yes
JSON	Yes (Schema required)	Yes	Yes	Yes	Yes
Parquet	Yes (Schema required)	Yes	Yes	Yes	Yes

Separate KMs for each file format are not required. You can create just one or two KMs for each target (a standard LKM and where appropriate a Direct Load LKM). The file can either be delimited or fixed format. The new LKM HDFS File to Hive supports loading only HDFS file into Hive, the file can be in the format of JSON, Avro, Parquet, Delimited etc, with complex data.

Table 9-2 Complex Types

Avro	Json	Hive	Parquet
Record	Object	Struct	Record
enum	NA	NA	enum
array	array	array	array
map	NA	map	map
union	NA	union	union
fixed	NA	NA	fixed