Oracle Commerce Guided Search - Data Ingest configuration with Forge

Data Ingest configuration with Forge

One or many Forge components are defined for baseline update processing and partial update processing depending on the deployment type you choose.

If necessary, you can define a Forge cluster component to apply actions to an entire cluster of Forges, rather than manually iterating over a number of Forges. You could use this feature to run several instances of Forge in parallel to process large joins.

In addition, the object contains logic associated with executing Forges in parallel based on Forge groups, which are described below. Multiple Forge clusters can be defined, with no restriction around which Forges belong to each cluster or how many clusters a Forge belongs to.

A Forge cluster is configured with references to all Forges that belong to that cluster. In addition, the cluster can be configured to copy data in parallel or serially. This setting applies to copies that are performed to retrieve source data and configuration to each server that hosts a Forge component. By default, the template sets this value to true.

<!--
########################################################################
# Forge Cluster
#
-->
<forge-cluster id="ForgeCluster" getDataInParallel="true">
  <forge ref="ForgeServer" />
  <forge ref="ForgeClient1" />
  <forge ref="ForgeClient2" />
</forge-cluster>

In addition to standard Forge configuration settings and process arguments, the Deployment Template uses several configurable properties and custom directories during processing:

numLogBackups - Number of log directory backups to store.
numStateBackups - Number of autogen state directory backups to store.
numPartialsBackups - Number of cumulative partials directory backups to store. It is recommended that you increase the default value of 5. The reason is that the files in the updates directory for the Dgraph are automatically deleted after partials are applied to the Dgraph. The number you choose depends on how often you run partial updates and how many copies you want to keep.
incomingDataHost - Host to which source data files are extracted.
incomingDataDir - Directory to which source data files are extracted.
incomingDataFileName - Filename of the source data files that are extracted.
configHost - Host from which configuration files and dimensions are retrieved for Forge to process.
configDir - Directory from which configuration files and dimensions are retrieved for Forge to process.
cumulativePartialsDir - Directory where partial updates are accumulated between baseline updates.
wsTempDir - Temp Oracle Commerce Workbench directory to which post-Forge dimensions are copied to be uploaded to the Workbench.
skipTestingForFilesDuringCleanup - Used for directory-cleaning operations. If set to "true", will skip the directory-contents test and instead proceed directly to cleaning the directory. The default behavior is to test the directory contents and skip cleanup if the directory is not empty.
The properties documented in the "Fault tolerance and polling interval properties" topic.

This excerpt combines properties from both the baseline and partial update Forge to demonstrate the use of all of these configuration settings.

<properties>
  <property name="forgeGroup" value="A" />
  <property name="incomingDataHost">ITLHost</property>
  <property name="incomingDataFileName">project_name-part0-*</property>
  <property name="configHost">ITLHost</property>
  <property name="numStateBackups" value="10" />
  <property name="numLogBackups" value="10" />
  <property name="numPartialsBackups" value="5" />
  <property name="skipTestingForFilesDuringCleanup" value="true" />
</properties>
<directories>
  <directory name="incomingDataDir">./data/partials/incoming</directory>
  <directory name="configDir">./config/pipeline</directory>
  <directory name="cumulativePartialsDir">
    ./data/partials/cumulative_partials
  </directory>
  <directory name="wsTempDir">./data/web_studio/temp</directory>
</directories>

In addition to standard Forge configuration and process arguments, Forge processes add a custom property used to define which Forge processes run in parallel with each other when they belong to a Forge cluster.

forgeGroup - Indicates the Forge's membership in a Forge group. When the run method on a Forge cluster is executed, Forge processes within the same Forge group are run in parallel. Forge group values are arbitrary strings. The Forge cluster iterates through the groups in alphabetical order, though non-standard characters may result in groups being updated in an unexpected order.

Defining indexers

If necessary, you can define a Dgidx cluser to apply actions to an entire cluster of Dgidxs, rather than manually iterating over a number of Dgidxs. In addition, the object contains logic associated with executing Dgidxs in parallel based on Dgidx groups, which are described below. Multiple indexing clusters can be defined, with no restriction around which Dgidx belongs to each cluster or how many clusters a Dgidx belongs to.

An indexing cluster is configured with references to all Dgidxs that belong to that cluster. In addition, the cluster can be configured to copy data in parallel or serially. This setting applies to copies that are performed to retrieve source data and configuration to each server that hosts a Dgidx component. By default, the template sets this value to true.

<!--
########################################################################
# Indexing Cluster
#
-->
<indexing-cluster id="IndexingCluster" getDataInParallel="true">
  <dgidx ref="Dgidx1" />
  <dgidx ref="Dgidx2" />
</indexing-cluster>

In addition to standard Dgidx configuration settings and process arguments, the Deployment Template uses several configurable properties and custom directories during processing:

numLogBackups - Number of log directory backups to store.
numIndexbackups - Number of index backups to store.
incomingDataHost - Host to which source data files are extracted.
incomingDataDir - Directory to which source data files are extracted.
incomingDataFileName - Filename of the source data files that are extracted.
configHost - Host from which configuration files and dimensions are retrieved for Dgidx to process.
configDir - Directory from which configuration files and dimensions are retrieved for Dgidx to process.
configFileName - Filename of the configuration files and dimensions that are retrieved for Dgidx to process.
skipTestingForFilesDuringCleanup - Used for directory-cleaning operations. If set to "true", will skip the directory-contents test and instead proceed directly to cleaning the directory. The default behavior is to test the directory contents and skip cleanup if the directory is not empty.
The properties documented in the "Fault tolerance and polling interval properties" topic.

In addition to standard Dgidx configuration and process arguments, Dgidx processes add a custom property used to define which Dgidx processes run in parallel with each other when they belong to an indexing cluster.

dgidxGroup - Indicates the Dgidx's membership in a Dgidx group. When the run method on an indexing cluster is executed, Dgidx processes within the same Dgidx group are run in parallel. Dgidx group values are arbitrary strings. The indexing cluster iterates through the groups in alphabetical order, though non-standard characters may result in groups being updated in an unexpected order.

Data Ingest configuration with Forge

Defining indexers

Guided Search Administrator's Guide