Adding a SerDe JAR to DP workflows

This topic describes the process of adding a custom Serializer-Deserializer (SerDe) to the Data Processing (DP) classpath, instead of the SerDe class that is shipped in the Data Processing package.

When customers create a Hive table, they can specify a Serializer-Deserializer (SerDe) class of their choice. For example, consider the last portion of this statement:

CREATE TABLE samples_table(id INT, city STRING, country STRING, region STRING, population INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';

If that SerDes JAR is not packaged with the Data Processing package that is part of the Big Data Discovery, then a Data Processing run will be unable to read the Hive table, which will prevent the importing of the data into the Dgraph. To solve this problem, you can integrate your custom SerDe into an Oozie Data Processing workflow.

This procedure assumes two pre-requisites:

The BDD Data Processing artifacts must already be present in the CDH cluster (that is, they must be present on HDFS path /home/username/oozieEdpLib, which is where the data_processing_CLI variable hdfsEdpLibPath should be pointing).
Before integrating the SerDe JAR with Data Processing, the SerDe JAR should be present on the CDH cluster's Hive node. To check this, you can verify that, for a table created with this SerDe, a SELECT * query on the table does not issue an error, whether the query is sent via Hue or from the Hive CLI.

To integrate a custom SerDe JAR into the Oozie Data Processing workflow:

Copy the SerDe JAR into the hdfsEdpLibPath directory (where all the cluster-side DP JARS are located).
In the hdfsEdpLibPath directory on HDFS, edit the spark_worker_files.txt and edp_classpath files to include the SerDe JAR name.
You can edit the files in the Hue file browser by clicking on the file. Next, the left pane will show an edit file option.

Note that you do not need to edit the files on the client machine on which the Data Processing CLI is run.

As a result, the SerDe JAR is added in the Data Processing classpath. This means that the SerDe class will be used in all Data Processing workflows, whether they are initiated automatically, by Studio, or by running the Data Processing CLI.