DP CLI Configuration

To set the CLI configuration parameters, open the data_processing_CLI script with a text editor. Some of the default values for the parameters are populated from the bdd.conf configuration file used during the installation of Big Data Discovery.

In general, the settings below should match those in the Data Processing Settings panel on Studio's Control Panel. Parameters that must be the same are mentioned in the table. For information on Studio's Data Processing Settings panel, see the Administrator's Guide.

Data Processing Defaults

The parameters in data_processing_CLI that set the Data Processing defaults are:

Data Processing parameter	Description
`maxRecordsProcessed`	The maximum number of records to be processed for each Hive table (that is, the number of sampled records from the table). The default is 1000000. In effect, this sets the maximum number of records in a BDD data set. You can override this setting by the CLI `--maxRecords` flag.
`runEnrichment`	Specifies whether to run the Data Enrichment modules. The default is `true`. You can override this setting by the CLI `--runEnrichment` flag.
`defaultLanguage`	The language for all attributes in the created data set. The default language code is `en` (US English). For the supported language country codes, see Supported languages.
`edpDataDir`	Specifies the location of the HDFS directory where data ingest and transform operations are processed. The default location is the /user/bdd/edp/data directory. Must match the bdd.edpDataDir setting in Studio.

Settings controlling access to the Dgraph Gateway

These parameters are used in data_processing_CLI for the Dgraph Gateway that is managing the Dgraph nodes:

Dgraph Gateway parameter	Description
`endecaServerHost`	The name of the host on which the Dgraph Gateway is running. The default name is specified in the bdd.conf configuration file.
`endecaServerPort`	The port on which Dgraph Gateway is listening. The default is 7003.
`endecaServerContextRoot`	The context root of the Dgraph Gateway when running on Managed Servers within the WebLogic Server. The value should be set to: `/endeca-server`

Settings controlling access to Hadoop

The parameters that define connections to CDH processes and resources are:

Hadoop parameter	Description
`oozieHost`	Name of the host on which the Oozie server is running. The default value is at the BDD installation time. Must match the bdd.hadoopClusterHostname setting in Studio.
`ooziePort`	Port on which the Oozie server is listening. The default value is set at the BDD installation time. Must match the bdd.oozieServerPort setting in Studio.
`oozieJobsDir`	Path to the working directory for Oozie Data Processing job files. The default location is the /user/bdd/edp/oozieJobs directory. Must match the bdd.edpOozieJobsDir setting in Studio.
`oozieWorkerJavaExecPath`	Path to the java executable file of the Java SDK on the Oozie worker that should be used to launch the Data Processing process. Must match the bdd.javaPath setting in Studio.
`hdfsEdpLibPath`	HDFS path to the Data Processing libraries directory. The default location is the /user/bdd/edp/lib directory. Must match the bdd.hdfsEdpLibPath setting in Studio.
`hiveServerHost`	Name of the host on which the Hive server is running. The default value is set at the BDD installation time. Must match the bdd.hadoopClusterHostname setting in Studio.
`hiveServerPort`	Port on which the Hive server is listening. The default value is set at the BDD installation time. Must match the bdd.hiveMetastoreServerPort setting in Studio.
`sparkMasterHost`	Name of the host on which the Spark Master server is running. The default value is set at the BDD installation time. Must match the bdd.hadoopClusterHostname setting in Studio.
`sparkMasterPort`	Port on which the Spark Master server is listening. The default value is set at the BDD installation time. Must match the bdd.sparkServerPort setting in Studio.
`sparkExecutorMemory`	Amount of memory to use per executor process, in the same format as JVM memory strings (such as, 512m, 2g, 10g, and so on). The default is `48g`. This setting must be less than or equal to Spark's Total Java Heap Sizes of Worker's Executors in Bytes (`executor_total_max_heapsize`) property in Cloudera Manager. You can access this property in Cloudera Manager by selecting Clusters > Spark (Standalone), then clicking the Configuration tab. This property is in the Worker Default Group category (using the classic view).
`edpJarDir`	Path to the directory where the Data Processing JAR files for Spark workers are located on the cluster. The default location is the /opt/bdd/edp/lib directory. Must match the bdd.edpJarDir setting in Studio.
`clusterOltHome`	Path to the OLT directory on the Spark worker node. The default location is the /opt/bdd/edp/olt directory. Must match the bdd.clusterOlthome setting in Studio.
`sparkMaxNumberCores`	Maximum number of CPU cores to use for a Spark job. The default is `0`. The default is used to set the same number of cores as the number of used blocks from the target data on HDFS.
`kryoMode`	Specifies whether to enable (`true`) or disable (`false`) Kryo for serialization. The default is `false` and is the recommended setting for Data Processing workflows.
`kryoBufferMemSizeMB`	Maximum object size (in MBs) to allow within Kryo. (The library needs to create a buffer at least as large as the largest single object you will serialize). The default is `1024`. Increase this setting if you get a `buffer limit exceeded` exception inside Kryo. Note that there will be one buffer per core on each worker.

JAVA_HOME setting

In addition to setting the CLI configuration properties, make sure that the JAVA_HOME environment variable is set to the directory containing the specific version of Java that will be called when you run the Data Processing CLI.