List of Data Processing Settings

The settings listed in the table below must be set correctly in order to perform data processing tasks.

Many of the default values for these setting are populated based the values specified in bdd.conf during the installation process.

In general, the settings below should match the Data Processing CLI configuration properties which are contained in the script itself. Parameters that must be the same are noted as such in the table below. For information about the Data Processing CLI configuration properties, see the Data Processing Guide.

Important: Except where noted, editing the Data Processing settings is not supported in Big Data Discovery Cloud Service.
Hadoop Setting Description
bdd.enableEnrichments Specifies whether to run data enrichments during the sampling phase of data processing. This setting controls the Language Detection, Term Extraction, Geocoding Address, Geocoding IP, and Reverse Geotagger modules. A value of true runs all the data enrichment modules and false does not run them. You cannot enable an individual enrichment. The default value is true.
Note: Editing this setting is supported in BDD Cloud Service.
bdd.sampleSize Specifies the maximum number of records in the sample size of a data set. This is a global setting controls both the sample size for all files uploaded using Studio, and it also controls the sample size resulting from transform operations such as Join, Aggregate, and FilterRows.

For example, you if upload a file that has 5,000,000 rows, you could restrict the total number of sampled records to 1,000,000.

The default value is 1,000,000. (This value is approximate. After data processing, the actual sample size may be slightly more or slightly less than this value.)

Note: Editing this setting is supported in BDD Cloud Service.

Data Processing Topology

In addition to the configurable settings above, you can review the data processing topology by navigating to Big Data Discovery and then the About Big Data Discovery page, and expanding the Data Processing Topology drop-down. This exposes the following information:

Hadoop Setting Description
Hadoop Admin Console The hostname and Admin Console port of the machine that acts as the Master for your Hadoop cluster.
Name Node The NameNode internal Web server and port.
Hive metastore Server The Hive metastore listener and port.
Hive Server The Hive server listener and port.
Hue Server The Hue Web interface server and port.
Database Name The name of the Hive database that stores the source data for Studio data sets.
Sandbox The HDFS directory in which to store the Parquet files created when users export data from Big Data Discovery. The default value is /user/bdd.