Processing Hive tables with Snappy compression

This topic explains how to set up the Snappy libraries so that the DP CLI can process Hive tables with Snappy compression.

By default, the DP CLI cannot successfully process Hive tables with Snappy compression. The reason is that the required Hadoop native libraries are not available in the library path of the JVM. Therefore, you must the Hadoop native libraries' path to the Workflow Manager's sparkContext.properties file, which is located in the $BDD_HOME/workflowmanager/dp/config directory. For information on this configuration file, see Spark configuration.

To configure workflows to use the Snappy libraries:

  1. Locate the source directory for the Hadoop native libraries in your Hadoop installation. Typical locations are:
    • CDH:
      /opt/cloudera/parcels/CDH/lib/hadoop/lib/native/
    • MapR:
      /opt/mapr/hadoop/hadoop-2.7.0/lib/native
    • HDP:
      /usr/hdp/2.4.2.0-212/hadoop/lib/native
  2. To make the native libraries available to Spark, the native libraries need to be added to the Workflow Manager's sparkContext.properties file, using these Spark properties:
    spark.executorEnv.LD_LIBRARY_PATH=<snappypath>
    spark.yarn.appMasterEnv.LD_LIBRARY_PATH=<snappypath>
    CDH example:
    spark.executorEnv.LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
    spark.yarn.appMasterEnv.LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native

Once the paths are added to the Workflow Manager's properties file, all subsequent DP workflows should be able to process Hive tables with Snappy compression.