The text that an Oracle Commerce Guided Search application displays to its users is stored in memory as char arrays, c-strings, pascal strings, or other data structures. Before the application can display the text, it must convert the text into a format that it can display correctly and legibly. The process of converting the text is known as encoding.
Any process that reads and writes data must both encode and decode it. In particular, data must be encoded or decoded during I/O operations such as the following:
Choosing the right encoding for text can minimize loss of information and ensure that your application renders text correctly and legibly.
Unless you have reason to use other encodings, choose UTF-8 for:
Note
Use the same encoding across all of your Endeca data processing/indexing components.
When to Use Encodings Other Than UTF-8
Use encodings other than UTF-8 only for reasons such as the following:
Your data is in Hindi, Arabic, Chinese, Japanese, Korean or other languages for which UTF-8 is not a suitable or even a possible encoding. Some Korean glyphs are not supported by Unicode, for example.
Encodings such as EUC, Shift JIS, HZ, and GB2312 have lower memory and conversion costs than UTF-8 for Chinese, Japanese, and Korean, as well as for certain cell phones.
Encodings other than UTF-8 can reduce consumption of disk space for Chinese, Japanese, and Korean languages.
You need to debug the indexing process using editors that support only EUC or Shift JIS.
Know the Encoding of Your Source Data
Make sure you know (or can determine) the encoding of all of your source data. Note the following:
Web pages from web crawls can be in any of a wide variety of encodings.
Some applications encode text in CP1252 and variants of the ISO-8859 encodings.
Some documents are stored in encodings other than the ones that they declare; for example, web pages that declare their charset to be UTF-8 may in fact have been saved in ISO-8859-1 or CP1252.
Note
Make sure that all input sources, such as CAS, encode any text that they read from external sources using the same encoding that the external sources use for the text.
You must specify the encoding for characters displayed in your application's user interface through a Java Manipulator component of the Forge pipeline.
Java Manipulators
In Java Manipulators, you can specify Java routines that set the character encoding of your source data to UTF-8 as follows:
File f = new File(fileName); FileInputStream fis = new fileInputStream(f); InputStreamReader isr = new InputStreamReader(fis, "UTF8"); Buffered reader r = new (BufferedReader(isr);
Guided Search saves characters as UTF-8 by default.
For detailed information about how to create and configure Java Manipulators, refer to the Developer Studio Online Help.
You must ensure that search terms are properly encoded before users of your Guided Search application submit them to a form.
Specify encoding (such as UTF-8) for search terms in the following calls to the Presentation API:
Statements that retrieve information from the
HttpServletRequest
object:Statements that construct URL query strings:
Statements that create queries to the MDEX Engine; for example:
For information about how to invoke the Presentation API to create and manage queries, refer to the MDEX Engine Development Guide.
During indexing, text is normalized to NFC (Normalization Form Composition); that is, equivalent sequences of characters are converted to the same sequence of code points. For best recall, be sure to normalize your search terms to NFC before they are used in queries.
To normalize text, use a Normalizer object such as the one provided with the IBM International Components for Unicode (ICU) library:
import com.ibm.icu.text.Normalizer;
String nfc = Normalizer.normalize(searchTerms,
Normalizer.NFC);
Uppercase characters in search terms are automatically mapped to lowercase characters. For example, searching for WINES is equivalent to searching for wines.
In some cases, uppercase characters can be converted to lowercase characters in more than one way, given a variety of local spelling conventions. For example, the German word FLUSS (river) can be converted either to fluss or to fluß.
You can pre-process the search terms in application code to conform to local spelling conventions before the search term is submitted.
In each HTML page that your application displays, you must specify the correct character encoding using a Content-Type META tag. In addition, any links in the page must also encode these strings properly.
The following example illustrates how to specify character encoding for
an HTML page using the Java
URLEncoder
class:
<META http-equiv="Content-Type" content="text/html; charset=UTF-8"> <a href="search.jsp?term=<%=URLEncoder.encode(searchTerm,"UTF-8") %>">