Oracle Commerce Guided Search - A closer look at data processing and indexing

A closer look at data processing and indexing

It is important to have a clear understanding of how the Data Foundry works with source records before you begin building your instance configuration. Read the following sections for a behind-the-scenes look at the data processing and indexing functions in the Data Foundry.

Data processing

The data processing workflow in the Data Foundry is defined in your pipeline and typically follows a specific path.

The Forge and Dgidx programs do the actual data processing, but the components you have defined in the pipeline dictate which tasks are performed and when. The Data Foundry attempts to utilize all of the hardware resources available to it, both by processing records in multiple components simultaneously, and by processing multiple records simultaneously within the same component.

The data processing workflow typically follows this path:

Load the raw data for each source record.
Standardize each source record’s properties and property values to create consistency across records.
Map the source record’s properties into Guided Search properties and/or dimensions.
Write the tagged Guided Search records, along with any dimension hierarchy and index configuration, as finished data that is ready for indexing.
Index the finished data and create the proprietary indices used by the MDEX Engine.

Example 1. Data processing workflow

The following illustration shows a simple conversion of source data into tagged Guided Search records:

Source data

You can load source data from a variety of formats using the Content Acquisition System components.

Your Guided Search applications will most often read data directly from one or more database systems, or from database extracts. Input components load records in a variety of formats including delimited, JDBC, and XML. Each input component has its own set of configuration properties. One of the most commonly used type of input component loads data stored in delimited format.

About loading source data

Source data may be loaded into the Data Foundry from a variety of formats. The easiest format to use is a two-dimensional format similar to the tables found in database management systems.

Database tables are organized into rows of records, with columns that represent the source properties and property values for each record. The illustration below shows a simple example of source data in a two-dimensional format.

You specify the location and format of the source data to be loaded in the pipeline. Forge loads and processes one source record at a time, in sequential order. When Forge loads a source record, it transforms the record into a series of property/property value pairs.

Standardizing source records

You specify any standardization of source properties and property values in the pipeline. Standardization cleanses the data so that it is as consistent as possible before mapping begins.

You can take the following steps to standardize your data:

Note

The functionality described below supports limited data cleansing. If you have an existing data cleansing infrastructure, it may be more advantageous to use that facility instead.

Fix misspellings in your source properties and property values.
Explicitly specify the encoding type (e.g., UTF-8, CP-1252, or Latin-1) of the source data when Forge reads it into a Pipeline. If you are loading text-based source data in a Record Adapter, you specify the encoding type in the Encoding field of the General tab. If an incorrect encoding is specified, then Forge generates warnings about any characters that do not make sense in the specified encoding. For example, in the ASCII encoding, any character with a number above 127 is considered invalid. Invalid characters are replaced with strings prefixed by %X, so the invalid characters are not loaded into Forge.
Remove unsupported characters.
The only legal Unicode characters are U+09, U+0D, U+0A, U+20-U+7E, U+85, U+A0-U+D7FF, and U+E000-U+FFFD. In particular, source data should not contain Unicode characters from the range 0x00 through 0x1F with the exceptions of 0x09 (tab), 0x0A (newline), and 0x0D (carriage return). For example, records based on databases may use 0x00 (null) as a default empty value. Other characters that are often in existing database sources are 0x1C (field separator), 0x1E (record separator), and 0x1F (unit separator).
If a data source contains additional control characters as defined by the chosen encoding, remove or replace the control characters. For example, Windows-1252 specifies 0x7F-0x81, 0x8D-0x90, 0x9D-0x9E as control characters, and Latin-1 specifies x7F and x9F as control characters.
The following are some notes and suggestions for dealing with control characters:
Edit source property values to use a consistent format (for example, USA instead of United States or U.S.).
Re-assign similar source properties to one common property. (for example, you could assign a Flavor1 property and a Flavor2 property to a generic Flavors property).

Example 2. Example of standardized source records

The following image shows a simple standardization example:

About mapping source properties and property values

After a source record has been standardized, Forge maps the record’s source properties to dimensions and Guided Search properties.

Mapping a source property to a dimension indicates that the record should be tagged with a dimension value ID from within that dimension. This enables navigation on the property.
Mapping a source property to a Guided Search property indicates that the property should be retained for display and search.

Related links

Overview of Source Property Mapping

About writing out tagged data

After all the source records have been mapped, the Forge program writes its finished data.

The finished data consists of:

The Guided Search records along with their tagged dimension value IDs and Guided Search properties.
The names and IDs for each dimension and dimension value, along with any dimension hierarchy.
Any index configuration specified.

About indexing

After Forge creates the tagged data, Dgidx indexes the output and creates the proprietary indexes for the MDEX Engine.

Copyright © Legal Notices