DP CLI Configuration

The DP CLI configuration properties are contained in the data_processing_CLI script.

To set the CLI configuration parameters, open the data_processing_CLI script with a text editor. Some of the default values for the parameters are populated from the bdd.conf configuration file used during the installation of Big Data Discovery.

In general, the settings below should match those in the Data Processing Settings panel on Studio's Control Panel. Parameters that must be the same are mentioned in the table. For information on Studio's Data Processing Settings panel, see the Administrator's Guide.

Data Processing Defaults

The parameters in data_processing_CLI that set the Data Processing defaults are:
Data Processing parameter Description
maxRecordsProcessed The maximum number of records to be processed for each Hive table (that is, the number of sampled records from the table). The default is 1000000. In effect, this sets the maximum number of records in a BDD data set. You can override this setting by the CLI --maxRecords flag.
runEnrichment Specifies whether to run the Data Enrichment modules. The default is true. You can override this setting by the CLI --runEnrichment flag.
defaultLanguage The language for all attributes in the created data set. The default language code is en (US English). For the supported language country codes, see Supported languages.
edpDataDir Specifies the location of the HDFS directory where data ingest and transform operations are processed. The default location is the /user/bdd/edp/data directory. Must match the bdd.edpDataDir setting in Studio.

Settings controlling access to the Dgraph Gateway

These parameters are used in data_processing_CLI for the Dgraph Gateway that is managing the Dgraph nodes:
Dgraph Gateway parameter Description
endecaServerHost The name of the host on which the Dgraph Gateway is running. The default name is specified in the bdd.conf configuration file.
endecaServerPort The port on which Dgraph Gateway is listening. The default is 7003.
endecaServerContextRoot The context root of the Dgraph Gateway when running on Managed Servers within the WebLogic Server. The value should be set to: /endeca-server

Settings controlling access to Hadoop

The parameters that define connections to CDH processes and resources are:
Hadoop parameter Description
oozieHost Name of the host on which the Oozie server is running. The default value is at the BDD installation time. Must match the bdd.hadoopClusterHostname setting in Studio.
ooziePort Port on which the Oozie server is listening. The default value is set at the BDD installation time. Must match the bdd.oozieServerPort setting in Studio.
oozieJobsDir Path to the working directory for Oozie Data Processing job files. The default location is the /user/bdd/edp/oozieJobs directory. Must match the bdd.edpOozieJobsDir setting in Studio.
oozieWorkerJavaExecPath Path to the java executable file of the Java SDK on the Oozie worker that should be used to launch the Data Processing process. Must match the bdd.javaPath setting in Studio.
hdfsEdpLibPath HDFS path to the Data Processing libraries directory. The default location is the /user/bdd/edp/lib directory. Must match the bdd.hdfsEdpLibPath setting in Studio.
hiveServerHost Name of the host on which the Hive server is running. The default value is set at the BDD installation time. Must match the bdd.hadoopClusterHostname setting in Studio.
hiveServerPort Port on which the Hive server is listening. The default value is set at the BDD installation time. Must match the bdd.hiveMetastoreServerPort setting in Studio.
sparkMasterHost Name of the host on which the Spark Master server is running. The default value is set at the BDD installation time. Must match the bdd.hadoopClusterHostname setting in Studio.
sparkMasterPort Port on which the Spark Master server is listening. The default value is set at the BDD installation time. Must match the bdd.sparkServerPort setting in Studio.
sparkExecutorMemory Amount of memory to use per executor process, in the same format as JVM memory strings (such as, 512m, 2g, 10g, and so on). The default is 48g.

This setting must be less than or equal to Spark's Total Java Heap Sizes of Worker's Executors in Bytes (executor_total_max_heapsize) property in Cloudera Manager. You can access this property in Cloudera Manager by selecting Clusters > Spark (Standalone), then clicking the Configuration tab. This property is in the Worker Default Group category (using the classic view).

edpJarDir Path to the directory where the Data Processing JAR files for Spark workers are located on the cluster. The default location is the /opt/bdd/edp/lib directory. Must match the bdd.edpJarDir setting in Studio.
clusterOltHome Path to the OLT directory on the Spark worker node. The default location is the /opt/bdd/edp/olt directory. Must match the bdd.clusterOlthome setting in Studio.
sparkMaxNumberCores Maximum number of CPU cores to use for a Spark job. The default is 0. The default is used to set the same number of cores as the number of used blocks from the target data on HDFS.
kryoMode Specifies whether to enable (true) or disable (false) Kryo for serialization. The default is false and is the recommended setting for Data Processing workflows.
kryoBufferMemSizeMB Maximum object size (in MBs) to allow within Kryo. (The library needs to create a buffer at least as large as the largest single object you will serialize). The default is 1024. Increase this setting if you get a buffer limit exceeded exception inside Kryo. Note that there will be one buffer per core on each worker.

JAVA_HOME setting

In addition to setting the CLI configuration properties, make sure that the JAVA_HOME environment variable is set to the directory containing the specific version of Java that will be called when you run the Data Processing CLI.