Oracle Commerce Guided Search - Introduction

Introduction

The text that an Oracle Commerce Guided Search application displays to its users is stored in memory as char arrays, c-strings, pascal strings, or other data structures. Before the application can display the text, it must convert the text into a format that it can display correctly and legibly. The process of converting the text is known as encoding.

Any process that reads and writes data must both encode and decode it. In particular, data must be encoded or decoded during I/O operations such as the following:

Reading from disk
Saving to disk
Sending across a network
Rendering a web page

Choosing the Right Encoding for a Language

Choosing the right encoding for text can minimize loss of information and ensure that your application renders text correctly and legibly.

Unless you have reason to use other encodings, choose UTF-8 for:

All of your Endeca indexing processes, such as Forge and Dgidx.
Rendering English and most European languages. UTF-8 is optimized for these languages.
UNICODE characters, which your application may display incorrectly if they are not encoded as UTF-8.

Note

Use the same encoding across all of your Endeca data processing/indexing components.

When to Use Encodings Other Than UTF-8

Use encodings other than UTF-8 only for reasons such as the following:

Your data is in Hindi, Arabic, Chinese, Japanese, Korean or other languages for which UTF-8 is not a suitable or even a possible encoding. Some Korean glyphs are not supported by Unicode, for example.
Encodings such as EUC, Shift JIS, HZ, and GB2312 have lower memory and conversion costs than UTF-8 for Chinese, Japanese, and Korean, as well as for certain cell phones.
Encodings other than UTF-8 can reduce consumption of disk space for Chinese, Japanese, and Korean languages.
You need to debug the indexing process using editors that support only EUC or Shift JIS.

Know the Encoding of Your Source Data

Make sure you know (or can determine) the encoding of all of your source data. Note the following:

Web pages from web crawls can be in any of a wide variety of encodings.
Some applications encode text in CP1252 and variants of the ISO-8859 encodings.
Some documents are stored in encodings other than the ones that they declare; for example, web pages that declare their charset to be UTF-8 may in fact have been saved in ISO-8859-1 or CP1252.

Note

Make sure that all input sources, such as CAS, encode any text that they read from external sources using the same encoding that the external sources use for the text.

Specifying Character Sets Through Java Manipulators (Forge)

You must specify the encoding for characters displayed in your application's user interface through a Java Manipulator component of the Forge pipeline.

Java Manipulators

In Java Manipulators, you can specify Java routines that set the character encoding of your source data to UTF-8 as follows:

File f = new File(fileName);
FileInputStream fis = new fileInputStream(f);
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
Buffered reader r = new (BufferedReader(isr);

Guided Search saves characters as UTF-8 by default.

For detailed information about how to create and configure Java Manipulators, refer to the Developer Studio Online Help.

Encoding Search Terms

You must ensure that search terms are properly encoded before users of your Guided Search application submit them to a form.

Specify encoding (such as UTF-8) for search terms in the following calls to the Presentation API:

Unicode Normalization of Search Terms

During indexing, text is normalized to NFC (Normalization Form Composition); that is, equivalent sequences of characters are converted to the same sequence of code points. For best recall, be sure to normalize your search terms to NFC before they are used in queries.

To normalize text, use a Normalizer object such as the one provided with the IBM International Components for Unicode (ICU) library:

import com.ibm.icu.text.Normalizer;

String nfc = Normalizer.normalize(searchTerms, Normalizer.NFC);

Lowercase Conversion of Search Terms

Uppercase characters in search terms are automatically mapped to lowercase characters. For example, searching for WINES is equivalent to searching for wines.

In some cases, uppercase characters can be converted to lowercase characters in more than one way, given a variety of local spelling conventions. For example, the German word FLUSS (river) can be converted either to fluss or to fluß.

You can pre-process the search terms in application code to conform to local spelling conventions before the search term is submitted.

Specifying Character Encodings for HTML Pages

In each HTML page that your application displays, you must specify the correct character encoding using a Content-Type META tag. In addition, any links in the page must also encode these strings properly.

The following example illustrates how to specify character encoding for an HTML page using the Java URLEncoder class:

<META http-equiv="Content-Type" content="text/html; charset=UTF-8"> 
<a href="search.jsp?term=<%=URLEncoder.encode(searchTerm,"UTF-8") %>">

Copyright © Legal Notices