HDFS file formats supported are Json, Avro and Parquet. The format is specified by setting the storage format value which can be found on the storage tab of the Data Store. For all files of HDFS, the storage type (Json, Avro, Parquet) are defined in the data store. JSON, Avro and Parquet formats contain complex data types, like array or Object. During the Reverse Engineer phase, the schema definition for these types are converted to Avro and stored in the data format column of the attribute with the complex data type. This information is used when flattening this data in the mappings.
For JSON, Avro and Parquet that each type requires the location of a schema file to be entered. For Delimited, you will need to specify the Record and field separator information, number of heading lines. If you are loading Avro files into Hive, then you will need to copy the Avro Schema file (.avsc) into the same HDFS location as the HDFS files.
Table 9-1 HDFS File Formats
File Format | Reverse Engineer | Complex Type Support | Load into Hive | Load into Spark | Write from Spark |
---|---|---|---|---|---|
Avro |
Yes (Schema required) |
Yes |
Yes (Schema required) |
Yes |
Yes |
Delimited |
No |
No |
Yes |
Yes |
Yes |
JSON |
Yes (Schema required) |
Yes |
Yes |
Yes |
Yes |
Parquet |
Yes (Schema required) |
Yes |
Yes |
Yes |
Yes |
Separate KMs for each file format are not required. You can create just one or two KMs for each target (a standard LKM and where appropriate a Direct Load LKM). The file can either be delimited or fixed format. The new LKM HDFS File to Hive supports loading only HDFS file into Hive, the file can be in the format of JSON, Avro, Parquet, Delimited etc, with complex data.
Table 9-2 Complex Types
Avro | Json | Hive | Parquet |
---|---|---|---|
Record |
Object |
Struct |
Record |
enum |
NA |
NA |
enum |
array |
array |
array |
array |
map |
NA |
map |
map |
union |
NA |
union |
union |
fixed |
NA |
NA |
fixed |