Updating the configuration file

Once you have created the installation source directory, you must configure your deployment by updating the bdd.conf file, which is located in the /BDD_deployer/installer directory.

Important: The bdd.conf file defines the configuration of your BDD cluster and provides the orchestration script with parameters it requires to run. Updating this file is the most important step of the installation and deployment process. If you don't modify the file, or if you modify it incorrectly, the orchestration script could fail or your cluster could be configured differently than you intended.

You can edit the configuration file using any text editor. Be sure to save your changes before closing.

The orchestration script validates the configuration file at runtime and fails if the file contains any invalid values. To avoid this, keep the following in mind when updating the file:
  • You must provide a value for all properties except DGRAPH_ADDITIONAL_ARG, which is only intended for use by Oracle Support.
  • The accepted values for some properties are case-sensitive and must be entered exactly as they appear in this document.
  • You must provide fully qualified hostnames.
  • Any symlinks included in paths must be identical on all nodes. If any are different, or do not exist, the installation may fail.
  • Each port setting must have a unique value. You cannot use the same port number more than once.
  • Some of the directories defined in the configuration file have location requirements. These are specified in this document.

The following sections describe the properties in the configuration file and any requirements or restrictions they have. The configuration file itself also provides some of this information. Be sure to read the following sections carefully before modifying any properties.

Global settings

The first section in bdd.conf configures global settings, which are relevant to all components and the installation and deployment process itself.

Configuration property Description
INSTALL_TYPE Sets the installation type according to the hardware you're installing on. This can be set to one of the following:
  • BDA: Use this value if you're installing on the Oracle Big Data Appliance.
  • GENERIC: Use this value if you're installing on general-purpose hardware. This is the default value.
  • OPC: Use this value if you're installing on the Oracle Public Cloud.

Note that this document does not cover BDA or Cloud installation. For information on installing on either platform, please contact Oracle Customer Support.

CLUSTER_MODE Determines whether you're deploying to a single machine or a cluster. Use TRUE if you're deploying to a cluster. This is the default value.

If you're deploying to a single machine, use FALSE. When deploying to a single machine, you should also be sure that the MANAGED_SERVERS, DGRAPH_SERVERS, and DETECTOR_SERVER properties are set to ${ADMIN_SERVER}, or the orchestration script will fail.

Note that this property only accepts UPPERCASE values.

FORCE Determines whether the orchestration script will remove files and directories left over from previous installations when it runs.

When set to TRUE, the orchestration script removes any previous installations from the ORACLE_HOME directory. Use this value if you're rerunning the script after a failed attempt.

When set to FALSE, the orchestration script does not remove any previous installations. If one exists, the script will fail. This is the default value.

Note that this property only accepts UPPERCASE values.

ORACLE_HOME The path to the BDD root directory, where BDD will be installed on all nodes in the cluster. This directory will be created by the orchestration script, and therefore should not be an existing one.
Important: You must ensure that this directory can be created on all nodes that BDD will be installed on, including CDH nodes that will host Data Processing.

On the Admin Server and nodes that will host WebLogic Server, this directory must contain at least 6GB of free space. Nodes that will host the Dgraph require 1GB of free space, and those that will host Data Processing require 2GB.

The default value is /localdisk/Oracle/Middleware.

ORACLE_INV_PTR The path to the Oracle inventory pointer file. This file can't be located in the ORACLE_HOME directory. The default value is /localdisk/Oracle/oraInst.loc.

If any other Oracle software products are installed on the machine, this file will already exist. You should update this value to point to that file.

JAVA_HOME The path to the JDK install directory. This must be the same on all BDD servers. Note that this property is not the same as the JAVA_PATH property. The default value is /usr/java/jdk1.7.0_67.
INSTALLER_PATH The path to the installation source directory on the Admin Server (the location you moved the installation packages to). This directory must contain at least 6GB of free space. The default value is /localdisk/BDD_deployer/packages.
BDD_HOME The path to the BDD install directory, which the orchestration script will create on all BDD servers. This directory must be inside ORACLE_HOME. The default value is ${ORACLE_HOME}/BDD1.0 .
ENABLE_AUTOSTART Determines whether the BDD cluster will automatically restart after its servers are rebooted:
  • TRUE: WebLogic (including Studio and the Dgraph Gateway), the Dgraph, and the HDFS Agent will automatically restart after their host servers are rebooted. This is the default value.
  • FALSE: WebLogic, the Dgraph, and the HDFS Agent must be restarted manually.

Note that this property only accepts UPPERCASE values.

TEMP_FOLDER_PATH The temporary directory used on each node during the installation. The default value is /tmp.

On the Admin Server and nodes that will host WebLogic Server or the Dgraph, this directory must contain at least 10GB of free space. Nodes that will host Data Processing require 3GB of free space.

WebLogic settings

The third section in bdd.conf configures the WebLogic Server, including the Admin Server and all Managed Servers. It does not configure Studio or the Dgraph Gateway.

Configuration property Description and possible settings
WLS_START_MODE Defines the mode WebLogic Server will start in.

If set to prod, the WebLogic Server starts in production mode, which requires a username and password when it starts. This is the default value.

If set to dev, it starts in development mode, which does not require a username or password. The orchestration script will still prompt you for a username and password at runtime, but these will not be required when starting WebLogic Server.

Note that this property only accepts lowercase values.

ADMIN_SERVER The fully qualified hostname of the machine that will become the WebLogic Admin Server. This should be the machine you are currently working on.

There is no default value for this property, so you must provide one. Be sure to provide a value for this property, as the installation script will fail if it is not set.

MANAGED_SERVERS A comma-separated list of the fully qualified hostnames of the WebLogic Managed Servers (the servers that will run WebLogic, Studio, and the Dgraph Gateway). This list must include the hostname for the Admin Server, and cannot contain duplicate values.

If you're installing on a single machine, this property should be set to ${ADMIN_SERVER}, or the orchestration script will fail.

WEBLOGIC_DOMAIN_NAME The name of the WebLogic domain, which Studio and the Dgraph Gateway run in. The default value is bdd_domain.
ADMIN_SERVER_PORT The port number used by the Admin Server. This number must be unique. The default value is 7001.
MANAGED_SERVER_PORT The port used by the Managed Server (i.e., Studio). This number must be unique. The default value is 7003.

This property is still required if you are installing on a single server.

WLS_CPU_CORES

This property does not set the number of CPU cores that the WebLogic Server will actually use.

Instead, the orchestration script uses the value for this property at runtime to check whether each of the machines has the minimum number of CPU cores required on each Managed Server.

The value you enter should be less than or equal to the number of CPU cores available on the node. If you are unsure of how many cores a node has, check its node file.

If you enter a value that is greater than the total number of cores available on the node, the script issues a warning but continues to run. If you do not specify a number for this property, the script uses the default value of 4.

WLS_RAM_SIZE

This property does not set the RAM size that the WebLogic Server will actually use.

Instead, the orchestration script uses the value for this property at runtime to check whether each of the machines has the minimum amount of RAM available on each Managed Server, in KB.

The value you enter (in KB) should be less than or equal to the total amount of RAM available on the node. If you are unsure of how much RAM a node has, check its node file.

If you enter a value that is greater than the total amount of RAM available on the node, the script issues a warning but continues to run. If you do not specify a value for this property, the script uses the default value of 2048000 KB.

WLS_SECURE_MODE Enables and disables SSL for Studio's outward-facing ports.

This can be set to TRUE or FALSE. When set to TRUE, the Studio instances on the Admin Server and the Managed Servers listen for requests on the ADMIN_SERVER_SECURE_PORT and MANAGED_SERVER_SECURE_PORT, respectively.

The default value is TRUE. Note that this property does not enable SSL for any other BDD components.

ADMIN_SERVER_SECURE_PORT The secure port on the Admin Server on which Studio listens when WLS_SECURE_MODE is set to TRUE. This number must be unique. The default value is 7002.

Note that when SSL is enabled, Studio still listens on the un-secure ADMIN_SERVER_PORT for requests from the Dgraph Gateway.

MANAGED_SERVER_SECURE_PORT The secure port on the Managed Server on which Studio listens when WLS_SECURE_MODE is set to TRUE. This number must be unique. The default value is 7004.

Note that when SSL is enabled, Studio still listens on the un-secure MANAGED_SERVER_PORT for requests from the Dgraph Gateway.

CDH settings

The second section in bdd.conf contains properties related to Cloudera Manager. The orchestration script uses the values you provide to query Cloudera Manager for information about the other CDH components, such as the URIs and names of their host servers.

Configuration property Description and possible settings
CM_HOST The hostname of the server running Cloudera Manager. The default value is ${ADMIN_SERVER}.
CM_PORT The port number used by the server running Cloudera Manager. The default value is 7180.
CM_CLUSTER_NAME The name of the CDH cluster, which is listed in the Cloudera Manager. Be sure to replace any spaces in the cluster name with %20. The default value is Cluster%201.

Dgraph Gateway settings

The fourth section in bdd.conf configures the Dgraph Gateway.

Configuration Property Description and possible settings
ENDECA_SERVER_LOG_LEVEL The log level used by the Dgraph Gateway:
  • DEBUG
  • INFO
  • WARN
  • ERROR
  • FATAL

The default value is ERROR.

More information on Dgraph Gateway log levels is available in the Oracle Big Data Discovery Administrator's Guide.

Studio settings

The fifth section in bdd.conf configures Studio.

Configuration property Description and possible settings
SERVER_TIMEOUT The timeout value (in milliseconds) used when responding to requests sent to all Dgraph Gateway web services except the Data Ingest Web Service. A value of 0 means there is no timeout. The default value is 300000.
SERVER_INGEST_TIMEOUT The timeout value (in milliseconds) used when responding to requests sent to the Data Ingest Web Service. A value of 0 means there is no timeout. The default value is 1680000.
SERVER_HEALTHCHECK_TIMEOUT The timeout value (in milliseconds) used when checking data source availability when connections are initialized. A value of 0 means there is no timeout. The default value is 10000.
STUDIO_JDBC_URL The JDBC URL for the database, which enables Studio to connect to it. There are three templates for this property, but only one can be used. The remaining two must be commented out with a hash symbol (#).
The first template is for MySQL 5.5.3 (or later) databases, which use the com.mysql.jdbc.Driver driver. This template is uncommented out by default. If you are using a MySQL database, leave this template uncommented, make sure the other two are commented out, and update the URL as follows:
jdbc:mysql://<database hostname>:<port number>/<database name>?useUnicode=true&characterEncoding=UTF-8&useFastDateParsing=false
The second template is for Oracle 11g or 12c databases, which use the oracle.jdbc.OracleDriver driver. If you are using an Oracle database, uncomment this template, comment out the other two, and update the URL as follows:
jdbc:oracle:thin:@<database hostname>:<port number>:<database SID>

The third template is for Hypersonic databases, which use the org.hsqldb.jdbcDriver driver. Hypersonic is not supported for production environments, so you should only use this instance if you are deploying to a demo environment. If you want the orchestration script to create a Hypersonic database for you, uncomment this template and comment out the other two. The orchestration script will create the database for you in the location defined by the URL.

Note: BDD does not currently support database migration. After deployment, the only ways to change to a different database are to reconfigure the database itself or reinstall BDD.

Dgraph and HDFS Agent settings

The sixth section in bdd.conf configures the Dgraph and HDFS Agent.

Configuration property Description and possible settings
DGRAPH_SERVERS A comma-separated list of the fully qualified hostnames of all Dgraph nodes in the cluster. The orchestration script will install and deploy the Dgraph to these nodes.

This list cannot contain duplicate values. Additionally, as Oracle does not recommend cohosting the Dgraph with Spark, this list should not contain hostnames of Spark nodes.

If you're installing on a single machine, this property should be set to ${ADMIN_SERVER}, or the orchestration script will fail.

DGRAPH_CPU_CORES

This property does not set the number of cores that the Dgraph will actually use.

Instead, the orchestration script uses the value for this property at runtime to check whether each of the machines has the minimum number of CPU cores required for hosting the Dgraph and the HDFS Agent.

The value you enter should be less than or equal to the number of cores available on the machine.

If you enter a value that is greater than the total number of cores available on the Dgraph nodes, the script issues a warning but continues to run.

If you do not specify a number for this property, the orchestration script uses the default value of 2 cores.

DGRAPH_RAM_SIZE

This property does not set the RAM size that the Dgraph will actually use.

Instead, the orchestration script uses the value for this property (in KB) at runtime to check whether each of the machines has the minimum amount of RAM required on nodes hosting the Dgraph and the HDFS Agent.

The value you enter should be less than or equal to the total amount of RAM available on the node.

If you enter a value that is greater than the total amount of RAM available on the Dgraph nodes, the script issues a warning but continues to run.

If you do not specify a number for this property, the orchestration script uses the default value of 2048000 KB.

DGRAPH_OUT_FILE The path to the Dgraph's stdout/stderr file. The default value is ${BDD_HOME}/logs/dgraph.out.
DGRAPH_INDEX_DIR The path to the directory on the shared NFS in which the Dgraph index (defined by DGRAPH_INDEX_DIR) will be located. The orchestration script will create this directory if it does not already exist.
The default value is /share/bdd_dgraph_index. If you are installing with an existing index, be sure to change the value of this property to the name of the directory the index is located in.
Important: If DGRAPH_INDEX_NAME is set to base, the orchestration script will delete any files in this location and replace them with the empty indexes.
DGRAPH_INDEX_NAME The name of the Dgraph index, which will be located in the directory defined by DGRAPH_INDEX_DIR. The default value is base.
Important: If you do not change this value, the orchestration script will delete all files in the DGRAPH_INDEX_DIR and create an empty index named base. Only use this value if you want to install with an empty index.

If you are installing with an existing index, move the index to the directory defined by DGRAPH_INDEX_DIR and change the value of this property to the name of the index you are using. If the index does not exist in the DGRAPH_INDEX_DIR location, the orchestration script will fail.

Do not include _indexes in the index's name. For example, if you have an index named product_indexes, you should only specify product.

DGRAPH_THREADS The number of threads the Dgraph starts with. There is no default value for this property, so you must provide one. Oracle recommends the following:
  • For machines running only the Dgraph, the number of threads should be equal to the number of CPU cores on the machine.
  • For machines running the Dgraph and other BDD components, the number of threads should be the number of CPU cores minus 2. For example, a machine with 4 cores should have 2 threads.

Be sure that the number you use is in compliance with the licensing agreement.

DGRAPH_CACHE The size of the Dgraph cache, in MB. There is no default value for this property, so you must provide one.

You only need to specify the number of MB to allocate to the cache. For example, a value of 50 sets the cache size to 50MB.

For enhanced performance, Oracle recommends allocating at least 50% of the node's available RAM to the Dgraph cache. If you later find that queries are getting cancelled because there is not enough available memory to process them, experiment with gradually decreasing this amount.

DGRAPH_WS_PORT The port number the Dgraph web service runs on. This number must be unique. The default value is 7010.
DGRAPH_BULKLOAD_PORT The port on which the Dgraph listens for bulk load ingest requests. This number must be unique. The default value is 7019.
COORDINATOR_INDEX The index of the Dgraph cluster in the ZooKeeper ensemble. ZooKeeper uses this value to identify the cluster. The default value is cluster1.

Note that this property is not related to the Dgraph index.

DGRAPH_ADDITIONAL_ARG
Note: This property is only intended for use by Oracle Support. Do not provide a value for this property when installing BDD.
Defines one or more flags to start the Dgraph with. More information on Dgraph flags is available in the Oracle Big Data Discovery Administrator's Guide.
AGENT_PORT The port on which the HDFS Agent listens for HTTP requests. This number must be unique. The default value is 7102.
AGENT_EXPORT_PORT The port on which the HDFS Agent listens for requests from the Dgraph. This number must be unique. The default value is 7101.
AGENT_OUT_FILE The path to the HDFS Agent's stdout/stderr file. The default value is ${BDD_HOME}/logs/dgraphHDFSAgent.out.

Data Processing settings

The seventh section in bdd.conf configures Data Processing and the Hive Table Detector.

Configuration property Description and possible settings
HDFS_DP_USER_DIR The location within the HDFS /user directory that stores the Avro files created when users export data from BDD. The orchestration script will create this directory if it does not already exist. The name of this directory must not include spaces.

The default value is bdd.

ENABLE_HIVE_TABLE_DETECTOR Enables and disables the Hive Table Detector. When set to TRUE, the Hive Table Detector runs automatically on the server defined by DETECTOR_SERVER. When set to FALSE, the Hive Table Detector is not created. The default value is FALSE.
DETECTOR_SERVER The fully qualified hostname of the server the Hive Table Detector runs on. This must be one of the WebLogic Managed Servers. The default value is ${ADMIN_SERVER}.

If you are installing on a single machine, this property should be set to ${ADMIN_SERVER}, or the orchestration script will fail.

DETECTOR_HIVE_DATABASE The name of the Hive database that the Hive Table Detector monitors.

The default value is default. This is the same as the default value of HIVE_DATABASE_NAME, which is used by Studio and the CLI. It is possible to use different databases for these properties, but it is recommended that you start with one for a first time installation.

DETECTOR_MAXIMUM_WAIT_TIME The maximum amount of time (in seconds) that the Hive Table Detector waits between update jobs. The default value is 1800.
DETECTOR_SCHEDULE A Cron format schedule that specifies how often the Hive Table Detector runs. This must be enclosed in quotes. The default value is "0 0 * * *", which means the Hive Table Detector runs at midnight, every day of every month.

CLI settings

The final section in bdd.conf configures the CLI. These properties are used in both Studio and Data Processing.

Configuration property Description and possible settings
ENABLE_ENRICHMENTS Determines whether data enrichments are run during the sampling phase of data processing. This setting controls the Language Detection, Term Extraction, Geocoding Address, Geocoding IP, and Reverse Geotagger modules.

When set to true, all of the data enrichments run, and when set to false, none of them run. The default value is true.

For more information on data enrichments, see the Data Processing Guide.

JAVA_PATH The path to the Java binaries within the Java installation, which should be in the same location on each server in the cluster. The default value is ${JAVA_HOME}/bin/java.

Note that this property is not the same as JAVA_HOME.

MAX_RECORDS The maximum number of records included in a data set. For example, if a Hive table has 1,000,000 records, you could restrict the total number of sampled records to 100,000.

Note that the actual number of records in each data set will sometimes be slightly more than or slightly less than the value of MAX_RECORDS.

The default value is 1000000.

SPARK_EXECUTOR_MEMORY The amount of memory that Data Processing jobs request from the Spark worker nodes. The default value is 48g. You should increase this value if you plan on processing very large Hive tables.

The value of this property must be equal to or less than the value of Spark's Total Java Heap Sizes of Worker's Executors in Bytes (executor_total_max_heapsize) property. To access the executor_total_max_heapsize property, open Cloudera Manager and select Clusters > Spark (Standalone), click the Configuration tab, and select the Worker Default Group category.

SANDBOX_PATH The path to the HDFS directory in which the Avro files created when users export data from BDD are stored. The default value is /user/${HDFS_DP_USER_DIR}.
LANGUAGE Specifies either a supported ISO-639 language code (en, de, fr, etc.) or a value of unknown to set the language property for all attributes in the data set. This controls whether Oracle Language Technology (OLT) libraries are invoked during indexing.

A language code requires more processing but produces better processing and indexing results by using OLT libraries for the specified language. If the value is unknown, the processing time is faster but the processing and indexing results are more generic and OLT is not invoked.

The default value is unknown.

HIVE_DATABASE_NAME The name of the Hive database that stores the source data for Studio data sets. This is used by Studio as well as the CLI.

The default value is default. This is the same as the default value of DETECTOR_HIVE_DATABASE, which is used by the Hive Table Detector. It is possible to use different databases for these properties, but it is recommended that you start with one for a first time installation.