DP CLI configuration

The DP CLI has a configuration file, edp.properties, that sets its default properties.

By default, the edp.properties file is located in the $BDD_HOME/dataprocessing/edp_cli/config directory.

Some of the default values for the properties are populated from the bdd.conf installation configuration file. After installation, you can change the CLI configuration parameters by opening the edp.properties file with a text editor.

Data Processing defaults

The properties that set the Data Processing defaults are:

Data Processing Property	Description
`maxRecordsForNewDataSet`	Specifies the maximum number of records in the sample size of a new data set (that is, the number of sampled records from the source Hive table). In effect, this sets the maximum number of records in a BDD data set. Note that this setting controls the sample size for all new data sets and it also controls the sample size resulting from transform operations (such as during a Refresh update on a data set that contains a transformation script). The default is set by the `MAX_RECORDS` property in the `bdd.conf` file. The CLI `--maxRecords` flag can override this setting.
`runEnrichment`	Specifies whether to run the Data Enrichment modules. The default is set by the `ENABLE_ENRICHMENTS` property in the `bdd.conf` file. You can override this setting by using the CLI `--runEnrichment` flag. The CLI `--excludePlugins` flag can also be used to exclude some of the Data Enrichment modules.
`defaultLanguage`	The language for all attributes in the created data set. The default is set by the `LANGUAGE` property in the `bdd.conf` file. For the supported language codes, see Supported languages.
`edpDataDir`	Specifies the location of the HDFS directory where data ingest and transform operations are processed. The default location is the `/user/bdd/edp/data` directory.
`datasetAccessType`	Sets the access type for the data set, which determines which Studio users can access the data set in the Studio UI. This property takes one of these case-insensitive values: `public` means that all Studio users can access the data set. This is the default. `private` means that only designated Studio users and groups can access the data set. The users and groups are specified in attributes set in the data set's entry in the DataSet Inventory.
`notificationsServerUrl`	Specifies the URL of the Notification Service. This value is automatically set by the BDD installer and will have a value similar to this example: https://web14.example.com:7003/bdd/v1/api/workflows

Dgraph Gateway connectivity settings

These properties are used to control access to the Dgraph Gateway that is managing the Dgraph nodes:

Dgraph Gateway Property	Description
`endecaServerHost`	The name of the host on which the Dgraph Gateway is running. The default name is specified in the `bdd.conf` configuration file.
`endecaServerPort`	The port on which Dgraph Gateway is listening. The default is 7003.
`endecaServerContextRoot`	The context root of the Dgraph Gateway when running on Managed Servers within the WebLogic Server. The value should be set to: `/endeca-server`

Kerberos credentials

The DP CLI is enabled for Kerberos support at installation time, if the ENABLE_KERBEROS property in the bdd.conf file is set to TRUE. The bdd.conf file also has parameters for specifying the name of the Kerberos principal, as well as paths to the Kerberos keytab file and the Kerberos configuration file. The installation script populates the edp.properties file with the properties in the following table.

Kerberos Property	Description
`isKerberized`	Specifies whether Kerberos support should be enabled. The default value is set by the `ENABLE_KERBEROS` property in the `bdd.conf` file.
`localKerberosPrincipal`	The name of the Kerberos principal. The default name is set by the `KERBEROS_PRINCIPAL` property in the `bdd.conf` file.
`localKerberosKeytabPath`	Path to the Kerberos keytab file on the WebLogic Admin Server. The default path is set by the `KERBEROS_KEYTAB_PATH` property in the `bdd.conf` file.
`clusterKerberosPrincipal`	The name of the Kerberos principal. The default name is set by the `KERBEROS_PRINCIPAL` property in the `bdd.conf` file.
`clusterKerberosKeytabPath`	Path to the Kerberos keytab file on the WebLogic Admin Server. The default path is set by the `KERBEROS_KEYTAB_PATH` property in the `bdd.conf` file.
`krb5ConfPath`	Path to the `krb5.conf` configuration file. This file contains configuration information needed by the Kerberos V5 library. This includes information describing the default Kerberos realm, and the location of the Kerberos key distribution centers for known realms. The default path is set by the `KRB5_CONF_PATH` property in the `bdd.conf` file. However, you can specify a local, custom location for the `krb5.conf` file.

For further details on these parameters, see the Installation Guide

Hadoop connectivity settings

The parameters that define connections to Hadoop environment processes and resources are:

Hadoop Parameter	Description
`hiveServerHost`	Name of the host on which the Hive server is running. The default value is set at the BDD installation time.
`hiveServerPort`	Port on which the Hive server is listening. The default value is set at the BDD installation time.
`clusterOltHome`	Path to the OLT directory on the Spark worker node. The default location is the `$BDD_HOME/common/edp/olt` directory.
`oltHome`	Both `clusterOltHome` and this parameter are required, and both must be set to the same value.
`hadoopClusterType`	The installation type, according to the Hadoop distribution. The value is set by the `INSTALL_TYPE` property in the `bdd.conf` file.
`hadoopTrustStore`	Path to the directory on the install machine where the certificates for HDFS, YARN, Hive, and the KMS are stored. Required for clusters with TLS/SSL enabled. The default path is set by the `HADOOP_CERTIFICATES_PATH` property in the `bdd.conf` file.

Spark environment settings

These parameters define settings for interactions with Spark workers:

Spark Properties	Description
`sparkMasterUrl`	Specifies the master URL of the Spark cluster. In Spark-on-YARN mode, the ResourceManager's address is picked up from the Hadoop configuration by simply specifying `yarn-cluster` for this parameter. The default value is set at the BDD installation time.
`sparkDynamicAllocation`	Indicates if Data Processing will dynamically compute the executor resources or use static executor resource configuration: If set to false, the values of the static resource parameters (`sparkDriverMemory`, `sparkDriverCores`, `sparkExecutorMemory`, `sparkExecutorCores`, and `sparkExecutors`) are required and are used. If set to true, the values for the executor resources are dynamically computed. This means that the static resource parameters are not required and will be ignored even if specified. The default is set by the `SPARK_DYNAMIC_ALLOCATION` property in the `bdd.conf` file.
`sparkDriverMemory`	Amount of memory to use for each Spark driver process, in the same format as JVM memory strings (such as 512m, 2g, 10g, and so on). The default is set by the `SPARK_DRIVER_MEMORY` property in the `bdd.conf` file.
`sparkDriverCores`	Maximum number of CPU cores to use by the Spark driver. The default is set by the `SPARK_DRIVER_CORES` property in the `bdd.conf` file.
`sparkExecutorMemory`	Amount of memory to use for each Spark executor process, in the same format as JVM memory strings (such as 512m, 2g, 10g, and so on). The default is set by the `SPARK_EXECUTOR_MEMORY` property in the `bdd.conf` file. This setting must be less than or equal to Spark's Total Java Heap Sizes of Worker's Executors in Bytes (`executor_total_max_heapsize`) property in Cloudera Manager. You can access this property in Cloudera Manager by selecting Clusters > Spark (Standalone), then clicking the Configuration tab. This property is in the Worker Default Group category (using the classic view).
`sparkExecutorCores`	Maximum number of CPU cores to use for each Spark executor. The default is set by the `SPARK_EXECUTOR_CORES` property in the `bdd.conf` file.
`sparkExecutors`	Total number of Spark executors to launch. The default is set by the `SPARK_EXECUTORS` property in the `bdd.conf` file.
`yarnQueue`	The YARN queue to which the Data Processing job is submitted. The default value is set by the `YARN_QUEUE` property in the `bdd.conf` file.
`maxSplitSizeMB`	The maximum partition size for Spark inputs, in MB. This controls the size of the blocks of data handled by Data Processing jobs. This property overrides the HDFS block size used in Hadoop. Partition size directly affects Data Processing performance — when partitions are smaller, more jobs run in parallel and cluster resources are used more efficiently. This improves both speed and stability. The default is set by the `MAX_INPUT_SPLIT_SIZE` property in the `bdd.conf` file (which is 32, unless changed by the user). The 32MB is amount should be sufficient for most clusters, with a few exceptions: If your Hadoop cluster has a very large processing capacity and most of your data sets are small (around 1GB), you can decrease this value. In rare cases, when data enrichments are enabled the enriched data set in a partition can become too large for its YARN container to handle. If this occurs, you can decrease this value to reduce the amount of memory each partition requires. If this property is empty, the DP CLI logs an error at start-up and uses a default value of 32MB.

Jar location settings

These properties specify the paths for jars used by workflows:

Jar Property	Description
`sparkYarnJar`	Path to JAR files used by Spark-on-YARN. The default path is set by the `SPARK_ON_YARN_JAR` property in the `bdd.conf` file. However, additional JARs (such as `edpLogging.jar`) are appended to the path by the installer.
`bddHadoopFatJar`	Path to the location of the Hadoop Shared Library (file name of `bddHadoopFatJar.jar`) on the cluster. The path is set by the installer. and is typically the `$BDD_HOME/common/hadoop/lib` directory. Note that the `data_processing_CLI` script has a `BDD_HADOOP_FATJAR` property that specifies the location of the Hadoop Shared Library on the local file system of the DP CLI client.
`edpJarDir`	Path to the directory where the Data Processing JAR files for Spark workers are located on the cluster. The default location is the `$BDD_HOME/common/edp/lib` directory.
`extraJars`	Path to any extra JAR files to be used by customers, such as the path to a custom SerDe JAR. The default path is set by the `DP_ADDITIONAL_JARS` property in the `bdd.conf` file.

Kryo serialization settings

These properties define the use of Kryo serialization:

Kryo Property	Description
`kryoMode`	Specifies whether to enable (`true`) or disable (`false`) Kryo for serialization. Make sure that this property is set to `false` because Kryo serialization is not supported in BDD.
`kryoBufferMemSizeMB`	Maximum object size (in MBs) to allow within Kryo. This property, like the `kryoMode` property, is not supported by BDD workflows.

JAVA_HOME setting

In addition to setting the CLI configuration properties, make sure that the JAVA_HOME environment variable is set to the directory containing the specific version of Java that will be called when you run the Data Processing CLI.