The DP CLI has a configuration file, edp.properties, that sets its default properties.
By default, the edp.properties file is located in the $BDD_HOME/dataprocessing/edp_cli/config directory.
Some of the default values for the properties are populated from the bdd.conf installation configuration file. After installation, you can change the CLI configuration parameters by opening the edp.properties file with a text editor.
Data Processing Property | Description |
---|---|
maxRecordsForNewDataSet | The maximum number of records to be processed for each new data set (that is, the number of sampled records from the source Hive table). In effect, this sets the maximum number of records in a BDD data set. The default is set by the MAX_RECORDS property in the bdd.conf file. The CLI --maxRecords flag can override this setting. |
runEnrichment | Specifies whether to run the Data
Enrichment modules. The default is set by the
ENABLE_ENRICHMENTS property in the
bdd.conf file.
You can override this setting by using the CLI --runEnrichment flag. The CLI --excludePlugins flag can also be used to exclude some of the Data Enrichment modules. |
defaultLanguage | The language for all attributes in the created data set. The default is set by the LANGUAGE property in the bdd.conf file. For the supported language codes, see Supported languages. |
edpDataDir | Specifies the location of the HDFS directory where data ingest and transform operations are processed. The default location is the /user/bdd/edp/data directory. |
datasetAccessType | Sets the access type for the data set,
which determines which Studio users can access the data set in the Studio UI.
This property takes one of these case-insensitive values:
|
Dgraph Gateway Property | Description |
---|---|
endecaServerHost | The name of the host on which the Dgraph Gateway is running. The default name is specified in the bdd.conf configuration file. |
endecaServerPort | The port on which Dgraph Gateway is listening. The default is 7003. |
endecaServerContextRoot | The context root of the Dgraph Gateway when running on Managed Servers within the WebLogic Server. The value should be set to: /endeca-server |
The DP CLI is enabled for Kerberos support at installation time, if the ENABLE_KERBEROS property in the bdd.conf file is set to TRUE. The bdd.conf file also has parameters for specifying the name of the Kerberos principal, as well as paths to the Kerberos keytab file and the Kerberos configuration file. The installation script populates the data_processing_CLI script with the properties in the following table.
Kerberos Property | Description |
---|---|
isKerberized | Specifies whether Kerberos support should be enabled. The default value is set by the ENABLE_KERBEROS property in the bdd.conf file. |
localKerberosPrincipal | The name of the Kerberos principal. The default name is set by the KERBEROS_PRINCIPAL property in the bdd.conf file. |
localKerberosKeytabPath | Path to the Kerberos keytab file on the WebLogic Admin Server. The default path is set by the KERBEROS_KEYTAB_PATH property in the bdd.conf file. |
clusterKerberosPrincipal | The name of the Kerberos principal. The default name is set by the KERBEROS_PRINCIPAL property in the bdd.conf file. |
clusterKerberosKeytabPath | Path to the Kerberos keytab file on the WebLogic Admin Server. The default path is set by the KERBEROS_KEYTAB_PATH property in the bdd.conf file. |
krb5ConfPath | Path to the
krb5.conf configuration file. This file
contains configuration information needed by the Kerberos V5 library. This
includes information describing the default Kerberos realm, and the location of
the Kerberos key distribution centers for known realms.
The default path is set by the KRB5_CONF_PATH property in the bdd.conf file. However, you can specify a local, custom location for the krb5.conf file. |
For further details on these parameters, see the Installation and Deployment Guide
Hadoop Parameter | Description |
---|---|
hiveServerHost | Name of the host on which the Hive server is running. The default value is set at the BDD installation time. |
hiveServerPort | Port on which the Hive server is listening. The default value is set at the BDD installation time. |
clusterOltHome | Path to the OLT directory on the Spark worker node. The default location is the /opt/bdd/edp-<version>/olt directory. |
oltHome | Both clusterOltHome and this parameter are required, and both must be set to the same value. |
Spark Properties | Description |
---|---|
sparkMasterUrl | Specifies the master URL of the Spark cluster. In Spark-on-YARN mode, the ResourceManager's address is picked up from the Hadoop configuration by simply specifying yarn-cluster for this parameter. The default value is set at the BDD installation time. |
sparkDynamicAllocation | Indicates if Data Processing will
dynamically compute the executor resources or use static executor resource
configuration:
The default is set by the SPARK_DYNAMIC_ALLOCATION property in the bdd.conf file. |
sparkDriverMemory | Amount of memory to use for each Spark driver process, in the same format as JVM memory strings (such as 512m, 2g, 10g, and so on). The default is set by the SPARK_DRIVER_MEMORY property in the bdd.conf file. |
sparkDriverCores | Maximum number of CPU cores to use by the Spark driver. The default is set by the SPARK_DRIVER_CORES property in the bdd.conf file. |
sparkExecutorMemory | Amount of memory to use for each Spark
executor process, in the same format as JVM memory strings (such as 512m, 2g,
10g, and so on). The default is set by the
SPARK_EXECUTOR_MEMORY property in the
bdd.conf file.
This setting must be less than or equal to Spark's Total Java Heap Sizes of Worker's Executors in Bytes (executor_total_max_heapsize) property in Cloudera Manager. You can access this property in Cloudera Manager by selecting Clusters > Spark (Standalone), then clicking the Configuration tab. This property is in the Worker Default Group category (using the classic view). |
sparkExecutorCores | Maximum number of CPU cores to use for each Spark executor. The default is set by the SPARK_EXECUTOR_CORES property in the bdd.conf file. |
sparkExecutors | Total number of Spark executors to launch. The default is set by the SPARK_EXECUTORS property in the bdd.conf file. |
yarnQueue | The YARN queue to which the Data Processing job is submitted. The default value is set by the YARN_QUEUE property in the bdd.conf file. |
maxSplitSizeMB | The maximum partition size for Spark
inputs, in MB. This controls the size of the blocks of data handled by Data
Processing jobs. This property overrides the HDFS block size used in Hadoop.
Partition size directly affects Data Processing performance — when partitions are smaller, more jobs run in parallel and cluster resources are used more efficiently. This improves both speed and stability. The default is set by the
MAX_INPUT_SPLIT_SIZE property in the
bdd.conf file (which is 32, unless
changed by the user). The 32MB is amount should be sufficient for most
clusters, with a few exceptions:
If this property is empty, the DP CLI logs an error at start-up and uses a default value of 32MB. |
Jar Property | Description |
---|---|
sparkYarnJar | Path to JAR files used by Spark-on-YARN. The default path is set by the SPARK_ON_YARN_JAR property in the bdd.conf file. For CDH 5.4 installations, EdpOdlAppender.jar is appended to the path. |
bddHadoopFatJar | Path to the location of the Hadoop Shared
Library (file name of
bddHadoopFatJar.jar) on the cluster. The
path is set by the installer.
Note that the data_processing_CLI script has a BDD_HADOOP_FATJAR property that specifies the location of the Hadoop Shared Library on the local file system of the DP CLI client. |
edpJarDir | Path to the directory where the Data Processing JAR files for Spark workers are located on the cluster. The default location is the /opt/bdd/edp-<version>/lib directory. |
extraJars | Path to any extra JAR files to be used by customers, such as the path to a custom SerDe JAR. The default path is set by the DP_ADDITIONAL_JARS property in the bdd.conf file. |
Kryo Property | Description |
---|---|
kryoMode | Specifies whether to enable (true) or disable (false) Kryo for serialization. The default is set by the kryoMode property in the bdd.conf file. Note that false is the recommended setting for Data Processing workflows. |
kryoBufferMemSizeMB | Maximum object size (in MBs) to allow within Kryo. (The library needs to create a buffer at least as large as the largest single object you will serialize). The default is set by the kryoBufferMemSizeMB property in the bdd.conf file. Increase this setting if you get a buffer limit exceeded exception inside Kryo. Note that there will be one buffer per core on each worker. |
In addition to setting the CLI configuration properties, make sure that the JAVA_HOME environment variable is set to the directory containing the specific version of Java that will be called when you run the Data Processing CLI.