Oracle® Content Database Administrator's Guide for Oracle WebCenter Suite 10g (10.1.3.2) Part Number B32191-01 |
|
|
View PDF |
Oracle Content DB globalization support lets users store and search documents of heterogeneous character sets and languages in a single Oracle Content DB instance. The globalization infrastructure ensures that the resource strings, error messages, sort order, date, time, numeric, and calendar conventions adapt automatically to any native language and locale.
Globalization support is provided in the Oracle Content DB repository so that the other dependent processes, such as the protocol servers, can share and use this support. The major globalization goal for the repository is to ensure efficient storage of documents of heterogeneous character sets and languages, and to allow effective update, retrieval, and search operations on these documents.
This appendix provides information about the following topics:
How to Choose the Database Character Set for Oracle Content DB
How to Ensure Documents Are Properly Indexed in Oracle Content DB
In the repository, all metadata strings, such as the name of the document or the description, are stored in the VARCHAR2 data type of Oracle Database. Strings stored in this data type are encoded in the database character set specified when a database is created. The document itself, however, is unstructured data and stored in one of the large object data types of Oracle Database, particularly the BLOB data type. The BLOB data type stores content as is, avoiding any character set conversion on document content. The LONG and CLOB data types store content in the database character set, which requires character set conversion. Conversions can compromise the data integrity and have the potential to convert incorrectly or lose characters.
The full-text search index built on the document content is encoded in the database character set. When the content of a document is indexed, the BLOB data is converted from the character set of the content to the database character set for creation of the index text tokens. If the character set of the content is not a subset of the database character set, then the conversion will yield garbage tokens. For example, a database character set of ISO-8859-1 (Western European languages) will not be able to index correctly a Shift-JIS (Japanese) document. To be able to search content effectively, the character set of the documents stored by the users must be considered when selecting the database character set.
If your Oracle Content DB instance will contain multilingual documents, AL32UTF-8 is the recommended database character set. AL32UTF-8 supports characters defined in the Unicode standard. The Unicode standard solves the problem of many different languages in the same application or database. Unicode is a single, global character set that contains all major living scripts and conforms to international standards. Unicode provides a unique code value for every character, regardless of the platform, program, or language. AL32UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding and a strict superset of ASCII. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in AL32UTF-8 encoding. Characters from the European scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary characters are represented in 4 bytes. By using a Unicode-based file system, document content and metadata of different languages can be shared by users with different language preferences in one system.
To support documents in different character sets and languages in a single file system, the repository associates two globalization attributes with each document. They are the character set and language attributes.
The character set of a document is used in several situations. When the document content is rendered to a file, the character set of the document is used as the character encoding of the file. When the document is displayed in the browser, the character set of the document is set in the HTTP content-type header. Finally, when a full-text search is built on a text document, Oracle Text uses the character set of the document to convert the data into the database character set before building the index. When a character set is updated, the content is reindexed.
If no character set is specified when a document is inserted, the repository determines a default character set by using the character set of the user's LibrarySession stored in the Localizer object. This is obtained from the PrimaryUserProfile information of the user when the LibrarySession of the user is initialized.
The language of a document is used as a criterion to limit the search for documents of a particular language. It is also used to build a full-text search index on the document with Oracle Text. The multilexer feature of Oracle Text uses the language to identify the specific lexer to parse the document for searchable words. The language-specific lexers need to be defined and associated with a language before the index is built. Table F-1 describes the language-specific lexers.
Table F-1 Language-Specific Lexers
Language | Lexer | Lexer Option |
---|---|---|
Brazilian Portuguese |
BASIC_LEXER |
BASE LETTER |
Canadian French |
BASIC_LEXER |
BASE LETTER INDEX THEME |
Danish |
BASIC_LEXER |
BASE LETTER DANISH ALTERNATE SPELLING |
Dutch |
BASIC_LEXER |
BASE LETTER |
Finnish |
BASIC_LEXER |
BASE LETTER |
French |
BASIC_LEXER |
BASE LETTER INDEX THEME THEME LANGUAGE=FRENCH |
German |
BASIC_LEXER |
BASE LETTER GERMAN ALTERNATE SPELLING |
Italian |
BASIC_LEXER |
BASE LETTER |
Japanese |
JAPANESE_VGRAM_LEXER |
Not applicable |
Korean |
KOREAN_LEXER |
Not applicable |
Latin American |
BASIC_LEXER |
BASE LETTER |
Spanish Portuguese |
BASIC_LEXER |
BASE LETTER |
Simplified Chinese |
CHINES_VGRAM_LEXER |
Not applicable |
Swedish |
BASIC_LEXER |
BASE LETTER SWEDISH ALTERNATE SPELLING |
Tradition Chinese |
CHINESE_VGRAM_LEXER |
Not applicable |
Others |
BASIC_LEXER |
INDEX THEME THEME LANGUAGE=ENGLISH INDEX TEXT |
The BASIC_LEXER is used for single-byte languages using white space as a word separator. Asian language lexers cannot use white space as word separators. Instead, they use a V-gram algorithm to parse the documents for searchable keys. Languages that are not supported by Oracle Text are parsed as English. Oracle Content DB uses the multilexer feature of Oracle Text. It is a global lexer that contains German, Danish, Swedish, Japanese, Simplified Chinese, Traditional Chinese, and Korean sublexers.
If no language is specified when a document is inserted, the repository determines a default language as follows:
If the character set has been set, the language can most likely be obtained from a best-guess algorithm based on the character set value. For example, a document with a character set of Shift-JIS will most likely be in Japanese.
The default language is obtained from the Localizer of the user's LibrarySession. During initialization of the LibrarySession, the default language is obtained from the PrimaryUserProfile of the user.
The default language and default character set are specified when a new user is created in Oracle Internet Directory.
Oracle Content DB identifies languages using Oracle Globalization Support language abbreviations. See "Document Languages Supported in Oracle Content DB" for a list of Oracle Content DB-supported languages.
Some protocols do not support multibyte user names. Access through WebDAV and HTTP is not available for user names that contain multibyte characters. In addition, some protocols require that user passwords be in ASCII format.
Table F-2 is a summary of the character sets supported in Oracle Content DB.
Table F-2 Character Sets Supported in Oracle Content DB
Language | IANA Preferred MIME Character Set | IANA Additional Aliases | Java Encodings | Oracle Character Set |
---|---|---|---|---|
Arabic (ISO) |
iso-8859-6 |
ISO_8859-6:1987, iso-ir-127, ISO_8859-6, ECMA-114, ASMO-708, arabic, csISOLatinArabic |
ISO8859_6 |
AR8ISO8859P6 |
Arabic (Windows) |
windows-1256 |
none |
Cp1256 |
AR8MSWIN1256 |
Baltic (ISO) |
iso-8859-4 |
csISOLatin4, iso-ir-110, ISO_8859-4, ISO_8859-4:1988, l4, latin4 |
ISO8859_4 |
NEE8ISO8859P4 |
Baltic (Windows) |
windows-1257 |
none |
Cp1257 |
BLT8MSWIN1257 |
Central European (DOS) |
ibm852 |
cp852, 852, csPcp852 |
Cp852 |
EE8PC852 |
Central European (ISO) |
iso-8859-2 |
csISOLatin2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin2 |
ISO8859_2 |
EE8ISO8859P2 |
Central European (Windows) |
windows-1250 |
x-cp1250 |
Cp1250 |
EE8MSWIN1250 |
Chinese |
iso-2022-cn It is not defined in IANA, but use in MIME documents. |
csISO2022CN |
ISO2022CN |
ISO2022-CN |
Chinese Simplified (GB2312) |
gb2312 |
chinese, csGB2312, csISO58GB231280, GB2312, GB_2312-80, iso-ir-58 |
EUC_CN |
ZHS16CGB231280 |
Chinese Simplified (GB18030) |
GB18030 |
none |
GB18030 |
ZHS32GB18030 |
Chinese Simplified (Windows) |
GBK |
windows-936 |
GBK |
ZHS16GBK |
Chinese Traditional |
big5 |
csbig5, x-x-big5 |
Big5 |
ZHT16BIG5 |
Chinese Traditional |
windows-950 |
none |
MS950 |
ZHT16MSWIN950 |
Chinese Traditional (EUC-TW) |
EUC-TW |
none |
EUC_TW |
ZHT32EUC |
Chinese Traditional (Big5-HKSCS) |
Big5-HKSCS |
none |
Big5_HKSCS |
ZHT16HKSCS |
Cyrillic (DOS) |
ibm866 |
cp866, 866, csIBM866 |
Cp866 |
RU8PC866 |
Cyrillic (ISO) |
iso-8859-5 |
csISOLatinCyrillic, cyrillic, iso-ir-144, ISO_8859-5, ISO_8859-5:1988 |
ISO8859_5 |
CL8ISO8859P5 |
Cyrillic (KOI8-R) |
koi8-r |
csKOI8R, koi |
KOI8_R |
CL8KOI8R |
Cyrillic Alphabet (Windows) |
windows-1251 |
x-cp1251 |
Cp1251 |
CL8MSWIN1251 |
Greek (ISO) |
iso-8859-7 |
csISOLatinGreek, ECMA-118, ELOT_928, greek, greek8, iso-ir-126, ISO_8859-7, ISO_8859-7:1987, csISOLatinGreek |
ISO8859_7 |
EL8ISO8859P7 |
Greek (Windows) |
windows-1253 |
none |
Cp1253 |
EL8MSWIN1253 |
Hebrew (ISO) |
iso-8859-8 |
csISOLatinHebrew, hebrew, iso-ir-138, ISO_8859-8, visual, ISO-8859-8 Visual, ISO_8859-8:1988 |
ISO8859_8 |
IW8ISO8859P8 |
Hebrew (Windows) |
windows-1255 |
none |
Cp1255 |
IW8MSWIN1255 |
Japanese (JIS) |
iso-2022-jp |
csISO2022JP |
ISO2022JP |
ISO2022-JP |
Japanese (EUC) |
euc-jp |
csEUCPkdFmtJapanese, Extended_UNIX_Code_Packed_Format_for_Japanese, x-euc, x-euc-jp |
EUC_JP |
JA16EUC |
Japanese (Shift-JIS) |
shift_jis |
csShiftJIS, csWindows31J, ms_Kanji, shift-jis, x-ms-cp932, x-sjis |
MS932 |
JA16SJIS |
Korean |
ks_c_5601-1987 |
csKSC56011987, korean, ks_c_5601, euc-kr, csEUCKR |
EUC_KR |
KO16KSC5601 |
Korean (ISO) |
iso-2022-kr |
csISO2022KR |
ISO2022KR |
ISO2022-KR |
Korean (Windows) |
windows-949 |
none |
MS949 |
KO16MSWIN949 |
South European (ISO) |
iso-8859-3 |
ISO_8859-3, ISO_8859-3:1988, iso-ir-109, latin3, l3, csISOLatin3 |
ISO8859_3 |
SE8ISO8859P3 |
Thai |
TIS-620 |
windows-874 |
TIS620 |
TH8TISASCII |
Turkish (Windows) |
windows-1254 |
none |
Cp1254 |
TR8MSWIN1254 |
Turkish (ISO) |
iso-8859-9 |
latin5, l5, csISOLatin5, ISO_8859-9, iso-ir-148, ISO_8859-9:1989 |
ISO8859_9 |
WE8ISO8859P9 |
Universal (UTF-8) |
utf-8 |
unicode-1-1-utf-8, unicode-2-0-utf-8, x-unicode-2-0-utf-8 |
UTF8 |
UTF8 |
Unicode (UTF-16BE) |
UTF-16BE |
none |
UTF-16BE |
AL16UTF16 |
Unicode (UTF-16LE) |
UTF16LE |
none |
UTF-16LE |
AL16UTF16LE |
Vietnamese (Windows) |
windows-1258 |
none |
Cp1258 |
VN8MSWIN1258 |
Western Alphabet |
iso-8859-1 |
cp819, ibm819, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, latin1, l1, csISOLatin1 |
ISO8859_1 |
WE8ISO8859P1 |
Western Alphabet (DOS) |
ibm850 |
cp850, 850, csIBM850 |
Cp850 |
WE38PC850 |
Western Alphabet (Windows) |
windows-1252 |
x-ansi |
Cp1252 |
WE8MSWIN1252 |
Table F-3 is a summary of the document languages supported in Oracle Content DB. Note that the supported document languages are different from the languages supported in the Oracle Content DB application.
Table F-3 Document Languages Supported in Oracle Content DB
Oracle Language Name | Java Locale | ISO Locale |
---|---|---|
Arabic |
ar |
ar |
Bengali |
bn |
bn |
Brazilian Portuguese |
pt_BR |
pt-br |
Bulgarian |
bg |
bg |
Canadian French |
fr_CA |
fr-CA |
Catalan |
ca |
ca |
Croatian |
hr |
hr |
Czech |
cs |
cs |
Danish |
da |
da |
Dutch |
nl |
nl |
Egyptian |
ar_EG |
ar-eg |
American |
en |
en |
English |
en_GB |
en-gb |
Estonian |
et |
et |
Finnish |
fi |
fi |
French |
fr |
fr |
German |
de |
de |
Greek |
el |
el |
Hebrew |
he |
he |
Hungarian |
hu |
hu |
Icelandic |
is |
is |
Indonesian |
id |
in |
Italian |
it |
it |
Japanese |
ja |
ja |
Korean |
ko |
ko |
Latin American Spanish |
es |
es |
Latvian |
lv |
lv |
Lithuanian |
lt |
lv |
Malay |
ms |
ms |
Mexican Spanish |
es_MX |
es-mx |
Norwegian |
no |
no |
Polish |
pl |
pl |
Portuguese |
pt |
pt |
Romanian |
ro |
ro |
Russian |
ru |
ru |
Simplified Chinese |
zh_CN |
zh-cn |
Slovak |
sk |
sk |
Slovenian |
sl |
sl |
Spanish |
es_ES |
es-es |
Swedish |
sv |
sv |
Thai |
th |
th |
Traditional Chinese |
zh_TW |
zh-tw |
Turkish |
tr |
tr |
Ukrainian |
uk |
uk |
Vietnamese |
vi |
vi |