Once you have created the installation source directory, you must configure your deployment by updating the bdd.conf file, which is located in the /BDD_deployer/installer directory.
You can edit the configuration file using any text editor. Be sure to save your changes before closing.
The following sections describe the properties in the configuration file and any requirements or restrictions they have. The configuration file itself also provides some of this information. Be sure to read the following sections carefully before modifying any properties.
The first section in bdd.conf configures global settings, which are relevant to all components and the installation and deployment process itself.
| Configuration property | Description |
|---|---|
| INSTALL_TYPE | Sets the installation type according to
the hardware you're installing on. This can be set to one of the following:
Note that this document does not cover BDA or Cloud installation. For information on installing on either platform, please contact Oracle Customer Support. |
| CLUSTER_MODE | Determines whether you're deploying to a
single machine or a cluster. Use
TRUE if you're deploying to a cluster. This is
the default value.
If you're deploying to a single machine, use FALSE. When deploying to a single machine, you should also be sure that the MANAGED_SERVERS, DGRAPH_SERVERS, and DETECTOR_SERVER properties are set to ${ADMIN_SERVER}, or the orchestration script will fail. Note that this property only accepts UPPERCASE values. |
| FORCE | Determines whether the orchestration
script will remove files and directories left over from previous installations
when it runs.
When set to TRUE, the orchestration script removes any previous installations from the ORACLE_HOME directory. Use this value if you're rerunning the script after a failed attempt. When set to FALSE, the orchestration script does not remove any previous installations. If one exists, the script will fail. This is the default value. Note that this property only accepts UPPERCASE values. |
| ORACLE_HOME | The path to the BDD root directory,
where BDD will be installed on all nodes in the cluster. This directory will be
created by the orchestration script, and therefore should not be an existing
one.
Important: You must ensure that this directory can
be created on all nodes that BDD will be installed on, including CDH nodes that
will host Data Processing.
On the Admin Server and nodes that will host WebLogic Server, this directory must contain at least 6GB of free space. Nodes that will host the Dgraph require 1GB of free space, and those that will host Data Processing require 2GB. The default value is /localdisk/Oracle/Middleware. |
| ORACLE_INV_PTR | The path to the Oracle inventory pointer
file. This file can't be located in the
ORACLE_HOME directory. The default value is
/localdisk/Oracle/oraInst.loc.
If any other Oracle software products are installed on the machine, this file will already exist. You should update this value to point to that file. |
| JAVA_HOME | The path to the JDK install directory. This must be the same on all BDD servers. Note that this property is not the same as the JAVA_PATH property. The default value is /usr/java/jdk1.7.0_67. |
| INSTALLER_PATH | The path to the installation source directory on the Admin Server (the location you moved the installation packages to). This directory must contain at least 6GB of free space. The default value is /localdisk/BDD_deployer/packages. |
| BDD_HOME | The path to the BDD install directory, which the orchestration script will create on all BDD servers. This directory must be inside ORACLE_HOME. The default value is ${ORACLE_HOME}/BDD1.0 . |
| ENABLE_AUTOSTART | Determines whether the BDD cluster will
automatically restart after its servers are rebooted:
Note that this property only accepts UPPERCASE values. |
| TEMP_FOLDER_PATH | The temporary directory used on each
node during the installation. The default value is
/tmp.
On the Admin Server and nodes that will host WebLogic Server or the Dgraph, this directory must contain at least 10GB of free space. Nodes that will host Data Processing require 3GB of free space. |
The third section in bdd.conf configures the WebLogic Server, including the Admin Server and all Managed Servers. It does not configure Studio or the Dgraph Gateway.
| Configuration property | Description and possible settings |
|---|---|
| WLS_START_MODE | Defines the mode WebLogic Server will start
in.
If set to prod, the WebLogic Server starts in production mode, which requires a username and password when it starts. This is the default value. If set to dev, it starts in development mode, which does not require a username or password. The orchestration script will still prompt you for a username and password at runtime, but these will not be required when starting WebLogic Server. Note that this property only accepts lowercase values. |
| ADMIN_SERVER | The fully qualified hostname of the machine
that will become the WebLogic Admin Server. This should be the machine you are
currently working on.
There is no default value for this property, so you must provide one. Be sure to provide a value for this property, as the installation script will fail if it is not set. |
| MANAGED_SERVERS | A comma-separated list of the fully
qualified hostnames of the WebLogic Managed Servers (the servers that will run
WebLogic, Studio, and the Dgraph Gateway). This list must include the hostname
for the Admin Server, and cannot contain duplicate values.
If you're installing on a single machine, this property should be set to ${ADMIN_SERVER}, or the orchestration script will fail. |
| WEBLOGIC_DOMAIN_NAME | The name of the WebLogic domain, which Studio and the Dgraph Gateway run in. The default value is bdd_domain. |
| ADMIN_SERVER_PORT | The port number used by the Admin Server. This number must be unique. The default value is 7001. |
| MANAGED_SERVER_PORT | The port used by the Managed Server (i.e.,
Studio). This number must be unique. The default value is
7003.
This property is still required if you are installing on a single server. |
| WLS_CPU_CORES |
This property does not set the number of CPU cores that the WebLogic Server will actually use. Instead, the orchestration script uses the value for this property at runtime to check whether each of the machines has the minimum number of CPU cores required on each Managed Server. The value you enter should be less than or equal to the number of CPU cores available on the node. If you are unsure of how many cores a node has, check its node file. If you enter a value that is greater than the total number of cores available on the node, the script issues a warning but continues to run. If you do not specify a number for this property, the script uses the default value of 4. |
| WLS_RAM_SIZE |
This property does not set the RAM size that the WebLogic Server will actually use. Instead, the orchestration script uses the value for this property at runtime to check whether each of the machines has the minimum amount of RAM available on each Managed Server, in KB. The value you enter (in KB) should be less than or equal to the total amount of RAM available on the node. If you are unsure of how much RAM a node has, check its node file. If you enter a value that is greater than the total amount of RAM available on the node, the script issues a warning but continues to run. If you do not specify a value for this property, the script uses the default value of 2048000 KB. |
| WLS_SECURE_MODE | Enables and disables SSL for Studio's
outward-facing ports.
This can be set to TRUE or FALSE. When set to TRUE, the Studio instances on the Admin Server and the Managed Servers listen for requests on the ADMIN_SERVER_SECURE_PORT and MANAGED_SERVER_SECURE_PORT, respectively. The default value is TRUE. Note that this property does not enable SSL for any other BDD components. |
| ADMIN_SERVER_SECURE_PORT | The secure port on the Admin Server on
which Studio listens when
WLS_SECURE_MODE is set to
TRUE. This number must be unique. The default
value is
7002.
Note that when SSL is enabled, Studio still listens on the un-secure ADMIN_SERVER_PORT for requests from the Dgraph Gateway. |
| MANAGED_SERVER_SECURE_PORT | The secure port on the Managed Server on
which Studio listens when
WLS_SECURE_MODE is set to
TRUE. This number must be unique. The default
value is
7004.
Note that when SSL is enabled, Studio still listens on the un-secure MANAGED_SERVER_PORT for requests from the Dgraph Gateway. |
The second section in bdd.conf contains properties related to Cloudera Manager. The orchestration script uses the values you provide to query Cloudera Manager for information about the other CDH components, such as the URIs and names of their host servers.
| Configuration property | Description and possible settings |
|---|---|
| CM_HOST | The hostname of the server running Cloudera Manager. The default value is ${ADMIN_SERVER}. |
| CM_PORT | The port number used by the server running Cloudera Manager. The default value is 7180. |
| CM_CLUSTER_NAME | The name of the CDH cluster, which is listed in the Cloudera Manager. Be sure to replace any spaces in the cluster name with %20. The default value is Cluster%201. |
The fourth section in bdd.conf configures the Dgraph Gateway.
| Configuration Property | Description and possible settings |
|---|---|
| ENDECA_SERVER_LOG_LEVEL | The log level used by the Dgraph Gateway:
The default value is ERROR. More information on Dgraph Gateway log levels is available in the Oracle Big Data Discovery Administrator's Guide. |
The fifth section in bdd.conf configures Studio.
| Configuration property | Description and possible settings |
|---|---|
| SERVER_TIMEOUT | The timeout value (in milliseconds) used when responding to requests sent to all Dgraph Gateway web services except the Data Ingest Web Service. A value of 0 means there is no timeout. The default value is 300000. |
| SERVER_INGEST_TIMEOUT | The timeout value (in milliseconds) used when responding to requests sent to the Data Ingest Web Service. A value of 0 means there is no timeout. The default value is 1680000. |
| SERVER_HEALTHCHECK_TIMEOUT | The timeout value (in milliseconds) used when checking data source availability when connections are initialized. A value of 0 means there is no timeout. The default value is 10000. |
| STUDIO_JDBC_URL | The JDBC URL for the database, which
enables Studio to connect to it. There are three templates for this property,
but only one can be used. The remaining two must be commented out with a hash
symbol (#).
The first template is for MySQL 5.5.3 (or later) databases,
which use the
com.mysql.jdbc.Driver driver. This template
is uncommented out by default. If you are using a MySQL database, leave this
template uncommented, make sure the other two are commented out, and update the
URL as follows:
jdbc:mysql://<database hostname>:<port number>/<database name>?useUnicode=true&characterEncoding=UTF-8&useFastDateParsing=false The second template is for Oracle 11g or 12c databases,
which use the
oracle.jdbc.OracleDriver driver. If you are
using an Oracle database, uncomment this template, comment out the other two,
and update the URL as follows:
jdbc:oracle:thin:@<database hostname>:<port number>:<database SID> The third template is for Hypersonic databases, which use the org.hsqldb.jdbcDriver driver. Hypersonic is not supported for production environments, so you should only use this instance if you are deploying to a demo environment. If you want the orchestration script to create a Hypersonic database for you, uncomment this template and comment out the other two. The orchestration script will create the database for you in the location defined by the URL. Note: BDD does not currently support database migration. After
deployment, the only ways to change to a different database are to reconfigure
the database itself or reinstall BDD.
|
The sixth section in bdd.conf configures the Dgraph and HDFS Agent.
| Configuration property | Description and possible settings |
|---|---|
| DGRAPH_SERVERS | A comma-separated list of the fully
qualified hostnames of all Dgraph nodes in the cluster. The orchestration
script will install and deploy the Dgraph to these nodes.
This list cannot contain duplicate values. Additionally, as Oracle does not recommend cohosting the Dgraph with Spark, this list should not contain hostnames of Spark nodes. If you're installing on a single machine, this property should be set to ${ADMIN_SERVER}, or the orchestration script will fail. |
| DGRAPH_CPU_CORES |
This property does not set the number of cores that the Dgraph will actually use. Instead, the orchestration script uses the value for this property at runtime to check whether each of the machines has the minimum number of CPU cores required for hosting the Dgraph and the HDFS Agent. The value you enter should be less than or equal to the number of cores available on the machine. If you enter a value that is greater than the total number of cores available on the Dgraph nodes, the script issues a warning but continues to run. If you do not specify a number for this property, the orchestration script uses the default value of 2 cores. |
| DGRAPH_RAM_SIZE |
This property does not set the RAM size that the Dgraph will actually use. Instead, the orchestration script uses the value for this property (in KB) at runtime to check whether each of the machines has the minimum amount of RAM required on nodes hosting the Dgraph and the HDFS Agent. The value you enter should be less than or equal to the total amount of RAM available on the node. If you enter a value that is greater than the total amount of RAM available on the Dgraph nodes, the script issues a warning but continues to run. If you do not specify a number for this property, the orchestration script uses the default value of 2048000 KB. |
| DGRAPH_OUT_FILE | The path to the Dgraph's stdout/stderr file. The default value is ${BDD_HOME}/logs/dgraph.out. |
| DGRAPH_INDEX_DIR | The path to the directory on the shared
NFS in which the Dgraph index (defined by
DGRAPH_INDEX_DIR) will be located. The
orchestration script will create this directory if it does not already exist.
The default value is
/share/bdd_dgraph_index. If you are
installing with an existing index, be sure to change the value of this property
to the name of the directory the index is located in.
Important: If
DGRAPH_INDEX_NAME is set to
base, the orchestration script will delete
any files in this location and replace them with the empty indexes.
|
| DGRAPH_INDEX_NAME | The name of the Dgraph index, which will
be located in the directory defined by
DGRAPH_INDEX_DIR. The default value is
base.
Important: If you do not change this value, the
orchestration script will delete all files in the
DGRAPH_INDEX_DIR and create an empty index
named
base. Only use this value if you want to
install with an empty index.
If you are installing with an existing index, move the index to the directory defined by DGRAPH_INDEX_DIR and change the value of this property to the name of the index you are using. If the index does not exist in the DGRAPH_INDEX_DIR location, the orchestration script will fail. Do not include _indexes in the index's name. For example, if you have an index named product_indexes, you should only specify product. |
| DGRAPH_THREADS | The number of threads the Dgraph starts
with. There is no default value for this property, so you must provide one.
Oracle recommends the following:
Be sure that the number you use is in compliance with the licensing agreement. |
| DGRAPH_CACHE | The size of the Dgraph cache, in MB.
There is no default value for this property, so you must provide one.
You only need to specify the number of MB to allocate to the cache. For example, a value of 50 sets the cache size to 50MB. For enhanced performance, Oracle recommends allocating at least 50% of the node's available RAM to the Dgraph cache. If you later find that queries are getting cancelled because there is not enough available memory to process them, experiment with gradually decreasing this amount. |
| DGRAPH_WS_PORT | The port number the Dgraph web service runs on. This number must be unique. The default value is 7010. |
| DGRAPH_BULKLOAD_PORT | The port on which the Dgraph listens for bulk load ingest requests. This number must be unique. The default value is 7019. |
| COORDINATOR_INDEX | The index of the Dgraph cluster in the
ZooKeeper ensemble. ZooKeeper uses this value to identify the cluster. The
default value is
cluster1.
Note that this property is not related to the Dgraph index. |
| DGRAPH_ADDITIONAL_ARG |
Note: This property is only intended for use by Oracle Support.
Do not provide a value for this property when installing BDD.
Defines one or more flags to start the Dgraph with. More
information on Dgraph flags is available in the
Oracle Big Data Discovery Administrator's Guide.
|
| AGENT_PORT | The port on which the HDFS Agent listens for HTTP requests. This number must be unique. The default value is 7102. |
| AGENT_EXPORT_PORT | The port on which the HDFS Agent listens for requests from the Dgraph. This number must be unique. The default value is 7101. |
| AGENT_OUT_FILE | The path to the HDFS Agent's stdout/stderr file. The default value is ${BDD_HOME}/logs/dgraphHDFSAgent.out. |
The seventh section in bdd.conf configures Data Processing and the Hive Table Detector.
| Configuration property | Description and possible settings |
|---|---|
| HDFS_DP_USER_DIR | The location within the HDFS
/user directory that stores the Avro files
created when users export data from BDD. The orchestration script will create
this directory if it does not already exist. The name of this directory must
not include spaces.
The default value is bdd. |
| ENABLE_HIVE_TABLE_DETECTOR | Enables and disables the Hive Table Detector. When set to TRUE, the Hive Table Detector runs automatically on the server defined by DETECTOR_SERVER. When set to FALSE, the Hive Table Detector is not created. The default value is FALSE. |
| DETECTOR_SERVER | The fully qualified hostname of the server
the Hive Table Detector runs on. This must be one of the WebLogic Managed
Servers. The default value is
${ADMIN_SERVER}.
If you are installing on a single machine, this property should be set to ${ADMIN_SERVER}, or the orchestration script will fail. |
| DETECTOR_HIVE_DATABASE | The name of the Hive database that the Hive
Table Detector monitors.
The default value is default. This is the same as the default value of HIVE_DATABASE_NAME, which is used by Studio and the CLI. It is possible to use different databases for these properties, but it is recommended that you start with one for a first time installation. |
| DETECTOR_MAXIMUM_WAIT_TIME | The maximum amount of time (in seconds) that the Hive Table Detector waits between update jobs. The default value is 1800. |
| DETECTOR_SCHEDULE | A Cron format schedule that specifies how often the Hive Table Detector runs. This must be enclosed in quotes. The default value is "0 0 * * *", which means the Hive Table Detector runs at midnight, every day of every month. |
The final section in bdd.conf configures the CLI. These properties are used in both Studio and Data Processing.
| Configuration property | Description and possible settings |
|---|---|
| ENABLE_ENRICHMENTS | Determines whether data enrichments are run
during the sampling phase of data processing. This setting controls the
Language Detection, Term Extraction, Geocoding Address, Geocoding IP, and
Reverse Geotagger modules.
When set to true, all of the data enrichments run, and when set to false, none of them run. The default value is true. For more information on data enrichments, see the Data Processing Guide. |
| JAVA_PATH | The path to the Java binaries within the
Java installation, which should be in the same location on each server in the
cluster. The default value is
${JAVA_HOME}/bin/java.
Note that this property is not the same as JAVA_HOME. |
| MAX_RECORDS | The maximum number of records included in a
data set. For example, if a Hive table has 1,000,000 records, you could
restrict the total number of sampled records to 100,000.
Note that the actual number of records in each data set will sometimes be slightly more than or slightly less than the value of MAX_RECORDS. The default value is 1000000. |
| SPARK_EXECUTOR_MEMORY | The amount of memory that Data Processing
jobs request from the Spark worker nodes. The default value is
48g. You should increase this value if you
plan on processing very large Hive tables.
The value of this property must be equal to or less than the value of Spark's Total Java Heap Sizes of Worker's Executors in Bytes (executor_total_max_heapsize) property. To access the executor_total_max_heapsize property, open Cloudera Manager and select Clusters > Spark (Standalone), click the Configuration tab, and select the Worker Default Group category. |
| SANDBOX_PATH | The path to the HDFS directory in which the Avro files created when users export data from BDD are stored. The default value is /user/${HDFS_DP_USER_DIR}. |
| LANGUAGE | Specifies either a supported ISO-639
language code (en,
de,
fr, etc.) or a value of
unknown to set the language property for all
attributes in the data set. This controls whether Oracle Language Technology
(OLT) libraries are invoked during indexing.
A language code requires more processing but produces better processing and indexing results by using OLT libraries for the specified language. If the value is unknown, the processing time is faster but the processing and indexing results are more generic and OLT is not invoked. The default value is unknown. |
| HIVE_DATABASE_NAME | The name of the Hive database that stores
the source data for Studio data sets. This is used by Studio as well as the
CLI.
The default value is default. This is the same as the default value of DETECTOR_HIVE_DATABASE, which is used by the Hive Table Detector. It is possible to use different databases for these properties, but it is recommended that you start with one for a first time installation. |