The settings listed in the table below must be set correctly in order to perform data processing tasks.
Many of the default values for these setting are populated based the values specified in bdd.conf during the installation process.
In general, the settings below should match the Data Processing CLI configuration properties which are contained in the script itself. Parameters that must be the same are noted as such in the table below. For information about the Data Processing CLI configuration properties, see the Data Processing Guide.
Hadoop Setting | Description |
---|---|
bdd.enableEnrichments | Specifies whether to run data
enrichments during the sampling phase of data processing. This setting controls
the Language Detection, Term Extraction, Geocoding Address, Geocoding IP, and
Reverse Geotagger modules. A value of
true runs all the data enrichment modules and
false does not run them. You cannot enable an
individual enrichment. The default value is
true.
Note: Editing this setting is supported in BDD Cloud Service.
|
bdd.sampleSize | Specifies the maximum number of
records in the sample size of a data set. This is a global setting controls
both the sample size for all files uploaded using Studio, and it also controls
the sample size resulting from transform operations such as Join, Aggregate,
and FilterRows.
For example, you if upload a file that has 5,000,000 rows, you could restrict the total number of sampled records to 1,000,000. The default value is 1,000,000. (This value is approximate. After data processing, the actual sample size may be slightly more or slightly less than this value.) Note: Editing this setting is supported in BDD Cloud Service.
|
In addition to the configurable settings above, you can review the data processing topology by navigating to Big Data Discovery and then the About Big Data Discovery page, and expanding the Data Processing Topology drop-down. This exposes the following information:
Hadoop Setting | Description |
---|---|
Hadoop Admin Console | The hostname and Admin Console port of the machine that acts as the Master for your Hadoop cluster. |
Name Node | The NameNode internal Web server and port. |
Hive metastore Server | The Hive metastore listener and port. |
Hive Server | The Hive server listener and port. |
Hue Server | The Hue Web interface server and port. |
Database Name | The name of the Hive database that stores the source data for Studio data sets. |
Sandbox | The HDFS directory in which to store the Parquet files created when users export data from Big Data Discovery. The default value is /user/bdd. |