Adding a SerDe JAR to DP workflows

This topic describes the process of adding a custom Serializer-Deserializer (SerDe) to the Data Processing (DP) classpath.

When customers create a Hive table, they can specify a Serializer-Deserializer (SerDe) class of their choice. For example, consider the last portion of this statement:

CREATE TABLE samples_table(
   id INT, 
   city STRING, 
   country STRING, 
   region STRING, 
   population INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';

If that SerDe JAR is not packaged with the Data Processing package that is part of the Big Data Discovery, then a Data Processing run is unable to read the Hive table, which prevents the importing of the data into the Dgraph. To solve this problem, you can integrate your custom SerDe into the Data Processing workflow.

This procedure assumes this pre-requisite:

Before integrating the SerDe JAR with Data Processing, the SerDe JAR should be present on the Hadoop cluster's HiveServer2 node and configured via the Hive Auxiliary Jars Directory property in the Hive service. To check this, you can verify that, for a table created with this SerDe, a SELECT * query on the table does not issue an error. This query should be verified to work from Hue and the Hive CLI to ensure the SerDe was added properly.

To integrate a custom SerDe JAR into the Data Processing workflow:

Copy the SerDe JAR into the same location on each cluster node.
Note that this location can be the same one as used when adding the SerDe Jar to the HiveServer2 node.
Edit the DP CLI edp.properties file and add the path to the SerDe JAR to the extraJars property. This property should be a colon-separated list of paths to JARs. This will allow DP jobs from the CLI to pick up the SerDe JAR.
By default, the edp.properties file is in the $BDD_HOME/dataprocessing/edp_cli/config directory.

You should also update the DP_ADDITIONAL_JARS property in the installation version of the bdd.conf file with the path, in case you ever re-install BDD.
For Studio, edit the $DOMAIN_HOME/config/studio/portal-ext.properties file and add the path to the SerDe Jar to the dp.settings.extra.jars property. This property should be a colon-separated list of paths to JARs. This will allow DP jobs from Studio to pick up the SerDe JAR.

As a result, the SerDe JAR is added in the Data Processing classpath. This means that the SerDe class will be used in all Data Processing workflows, whether they are initiated automatically by Studio or by running the Data Processing CLI.