List of Data Processing Settings

The settings listed in the table below must be set correctly in order to perform data processing tasks.

Many of the default values for these setting are populated based the values specified in bdd.conf during the installation process.

In general, the settings below should match the Data Processing CLI configuration properties which are contained in the script itself. Parameters that must be the same are noted as such in the table below. For information about the Data Processing CLI configuration properties, see the Data Processing Guide.

Important

Except where noted, editing the Data Processing settings is not supported in Big Data Discovery Cloud Service.

Hadoop Setting Description

Hadoop Setting	Description
`bdd.enableEnrichments`	Specifies whether to run data enrichments during the sampling phase of data processing. This setting controls the Language Detection, Term Extraction, Geocoding Address, Geocoding IP, and Reverse Geotagger modules. A value of `true` runs all the data enrichment modules and `false` does not run them. You cannot enable an individual enrichment. The default value is `true`. Note: Editing this setting is supported in BDD Cloud Service.
`bdd.sampleSize`	Specifies the maximum number of records in the sample size of a data set. This is a global setting controls both the sample size for all files uploaded using Studio, and it also controls the sample size resulting from transform operations such as Join, Aggregate, and FilterRows. For example, you if upload a file that has 5,000,000 rows, you could restrict the total number of sampled records to 1,000,000. The default value is 1,000,000. (This value is approximate. After data processing, the actual sample size may be slightly more or slightly less than this value.) Note: Editing this setting is supported in BDD Cloud Service.
`bdd.maxSplitSize`	The maximum partition size for Spark jobs measured in MB. This controls the size of the blocks of data handled by Data Processing jobs. Partition size directly affects Data Processing performance — when partitions are smaller, more jobs run in parallel and cluster resources are used more efficiently. This improves both speed and stability. The default is set by the `MAX_INPUT_SPLIT_SIZE` property in the `bdd.conf` file (which is 32, unless changed by the user). The 32MB is amount should be sufficient for most clusters, with a few exceptions: If your Hadoop cluster has a very large processing capacity and most of your data sets are small (around 1GB), you can decrease this value. In rare cases, when data enrichments are enabled the enriched data set in a partition can become too large for its YARN container to handle. If this occurs, you can decrease this value to reduce the amount of memory each partition requires. Note that this property overrides the HDFS block size used in Hadoop.

bdd.enableEnrichments

Specifies whether to run data enrichments during the sampling phase of data processing. This setting controls the Language Detection, Term Extraction, Geocoding Address, Geocoding IP, and Reverse Geotagger modules. A value of true runs all the data enrichment modules and false does not run them. You cannot enable an individual enrichment. The default value is true.

Note:

Editing this setting is supported in BDD Cloud Service.

bdd.sampleSize

Specifies the maximum number of records in the sample size of a data set. This is a global setting controls both the sample size for all files uploaded using Studio, and it also controls the sample size resulting from transform operations such as Join, Aggregate, and FilterRows.

For example, you if upload a file that has 5,000,000 rows, you could restrict the total number of sampled records to 1,000,000.

The default value is 1,000,000. (This value is approximate. After data processing, the actual sample size may be slightly more or slightly less than this value.)

Note:

Editing this setting is supported in BDD Cloud Service.

bdd.maxSplitSize

The maximum partition size for Spark jobs measured in MB. This controls the size of the blocks of data handled by Data Processing jobs.

Partition size directly affects Data Processing performance — when partitions are smaller, more jobs run in parallel and cluster resources are used more efficiently. This improves both speed and stability.

The default is set by the MAX_INPUT_SPLIT_SIZE property in the bdd.conf file (which is 32, unless changed by the user). The 32MB is amount should be sufficient for most clusters, with a few exceptions:

If your Hadoop cluster has a very large processing capacity and most of your data sets are small (around 1GB), you can decrease this value.
In rare cases, when data enrichments are enabled the enriched data set in a partition can become too large for its YARN container to handle. If this occurs, you can decrease this value to reduce the amount of memory each partition requires.

Note that this property overrides the HDFS block size used in Hadoop.

Data Processing Topology

In addition to the configurable settings above, you can review the data processing topology by navigating to the Big Data Discovery > About Big Data Discovery page and expanding the Data Processing Topology drop-down. This exposes the following information:

Hadoop Setting	Description
Hadoop Admin Console	The hostname and Admin Console port of the machine that acts as the Master for your Hadoop cluster.
Name Node	The NameNode internal Web server and port.
Hive metastore Server	The Hive metastore listener and port.
Hive Server	The Hive server listener and port.
Hue Server	The Hue Web interface server and port.
Cluster OLT Home	The OLT home directory in the BDD cluster. The BDD installer detects this value and populates the setting.
Database Name	The name of the Hive database that stores the source data for Studio data sets.
EDP Data Directory	The directory that contains the contents of the `edp_cluster_*.zip` file on each worker node.
Sandbox	The HDFS directory in which to store the avro files created when users export data from Big Data Discovery. The default value is `/user/bdd`.