Oracle® Content Services Administrator's Guide 10g Release 1 (10.1.1) Part Number B14493-01 |
|
|
View PDF |
Oracle Content Services globalization support enables users to store and search documents of heterogeneous character sets and languages in a single Oracle Content Services instance. The globalization infrastructure ensures that the resource strings, error messages, sort order, date, time, numeric, and calendar conventions adapt automatically to any native language and locale.
Globalization support is provided in the Oracle Content Services repository so that the other dependent processes, such as the protocol servers, can share and utilize this support. The major globalization goal for the repository is to ensure efficient storage of documents of heterogeneous character sets and languages, and to allow effective update, retrieval, and search on these documents.
This appendix covers the following topics:
How to Choose the Database Character Set for Oracle Content Services
How to Make Sure Documents Are Properly Indexed in Oracle Content Services
In the repository, all metadata strings, such as the name of the document or the description, are stored in the VARCHAR2 data type of the Oracle database. Strings stored in this data type are encoded in the database character set specified when a database is created. The document itself, however, is unstructured data and stored in one of the large object data types of the Oracle database, particularly the BLOB data type. The BLOB data type stores content as-is, avoiding any character set conversion on document content. The LONG and CLOB data types store content in the database character set, which requires character set conversion. Conversions can compromise the data integrity and have the potential to convert incorrectly or lose characters.
The full-text search index built on the document content is encoded in the database character set. When a document's content is indexed, the BLOB data is converted from the content's character set to the database character set for creation of the index text tokens. If the content's character set is not a subset of the database character set, the conversion will yield garbage tokens. For example, a database character set of ISO-8859-1 (Western European languages) will not be able to index correctly a Shift-JIS (Japanese) document. To be able to search content effectively, the character set of the documents stored by the users should be considered when selecting the database character set.
If your Oracle Content Services instance will contain multilingual documents, UTF8 is the recommended database character set. UTF8 supports characters defined in the Unicode standard. The Unicode standard solves the problem of many different languages in the same application or database. Unicode is a single, global character set which contains all major living scripts and conforms to international standards. Unicode provides a unique code value for every character, regardless of the platform, program, or language. UTF8 is the 8-bit encoding of Unicode. It is a variable-width encoding and a strict superset of ASCII. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF8 encoding. Characters from the European scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary characters are represented in 4 bytes. By using a Unicode-based file system, document content and metadata of different languages can be shared by users with different language preferences in one system.
The Oracle9i database introduces the new character set, AL32UTF-8. In Release 1, AL32UTF-8 was the default character set for Unicode 3.0 deployment. In Release 2, AL32UTF-8 is compliant to the latest Unicode 3.1 standard, which contains the supplementary characters, particularly additional Chinese, Japanese, and Korean ideographs. AL32UTF-8 is the default character set of an Oracle9i database installation.
Note:
Oracle Content Services does not support an AL32UTF-8 database because Oracle Text does not support Chinese, Japanese, and Korean lexers on an AL32UTF-8 database. UTF8 is the recommended database character set for an Unicode-based file system. If Oracle Content Services is installed in an AL32UTF-8 database, Chinese, Japanese, and Korean documents will not get indexed and, thus, will not be searchable.Oracle Content Services configuration will fail in a Chinese, Japanese, or Korean locale against an AL32UTF-8 database. This is because Oracle Text behaves differently when the database session language is initialized to an Asian language as opposed to American. JDBC initializes the database session language according to the locale of the running application, which in this case is the configuration tool.
To support documents in different character sets and languages in a single file system, the repository associates two globalization attributes with each document. They are the character set and language attributes.
The character set of a document is used in several situations. When the document content is rendered to a file, the character set of the document is used as the character encoding of the file. When the document is displayed in the browser, the character set of the document is set in the HTTP content-type header. Finally, when a full-text search is built on a text document, Oracle Text uses the character set of the document to convert the data into the database character set before building the index. When a character set is updated, the content is reindexed.
If no character set is specified upon insertion of a document, the repository determines a default character set as follows: the character set of the user's LibrarySession stored in the Localizer object is first used. This is obtained from the user's PrimaryUserProfile information at initialization of the user's LibrarySession.
The language of a document is used as a criterion to limit the search for documents of a particular language. It is also used to build a full-text search index on the document with Oracle Text. Oracle Text's multilexer feature uses the language to identify the specific lexer to parse the document for searchable words. The language-specific lexers need to be defined and associated with a language before the index is built. They are defined as follows:
Table G-1 Language-Specific Lexers
Language | Lexer | Lexer Option |
---|---|---|
Brazilian Portuguese |
BASIC_LEXER |
BASE LETTER |
Canadian French |
BASIC_LEXER |
BASE LETTER INDEX THEME |
Danish |
BASIC_LEXER |
BASE LETTER DANISH ALTERNATE SPELLING |
Dutch |
BASIC_LEXER |
BASE LETTER |
Finnish |
BASIC_LEXER |
BASE LETTER |
French |
BASIC_LEXER |
BASE LETTER INDEX THEME THEME LANGUAGE=FRENCH |
German |
BASIC_LEXER |
BASE LETTER GERMAN ALTERNATE SPELLING |
Italian |
BASIC_LEXER |
BASE LETTER |
Japanese |
JAPANESE_VGRAM_LEXER |
N/A |
Korean |
KOREAN_LEXER |
N/A |
Latin American |
BASIC_LEXER |
BASE LETTER |
Spanish Portuguese |
BASIC_LEXER |
BASE LETTER |
Simplified Chinese |
CHINES_VGRAM_LEXER |
N/A |
Swedish |
BASIC_LEXER |
BASE LETTER SWEDISH ALTERNATE SPELLING |
Tradition Chinese |
CHINESE_VGRAM_LEXER |
N/A |
Others |
BASIC_LEXER |
INDEX THEME THEME LANGUAGE=ENGLISH INDEX TEXT |
The BASIC_LEXER is used for single-byte languages using white space as a word separator. Asian language lexers cannot use white space as word separators. Instead, they use a V-gram algorithm to parse the documents for searchable keys. Languages that have not been supported by Oracle Text are parsed as English. Oracle Content Services uses the multilexer feature of Oracle Text. It is a global lexer containing German, Danish, Swedish, Japanese, Simplified Chinese, Traditional Chinese, and Korean sublexers.
If no language is specified upon insertion of a document, the repository determines a default language as follows.
If the character set has been set, the language can most likely be obtained from a 'best-guess' algorithm based on the character set value. For example, a document with a character set of Shift-JIS will most likely be in Japanese.
The default language is obtained from the Localizer of the user's LibrarySession. During initialization of the LibrarySession, the default language is obtained from the user's PrimaryUserProfile.
The defaults for both language and character set is specified by the Subscriber Administrator when a new user is created.
Oracle Content Services identifies languages using Oracle Globalization Support language abbreviations. See "Document Languages Supported in Oracle Content Services" for a list of Oracle Content Services-supported languages.
There are two service configuration properties that hold default character set and language values for Oracle Content Services Subscribers. The properties are:
IFS.SERVICE.DefaultCharacterSet
IFS.SERVICE.DefaultLanguage
These two properties are initialized during Oracle Content Services configuration and can later be modified through the Oracle Collaboration Suite Control. The Oracle Content Services default character set should be the same or a subset of the database character set. The character set should be specified in accordance with the IANA standard naming convention. The language should be specified in accordance with Oracle naming for languages. See "Character Sets Supported in Oracle Content Services" and "Document Languages Supported in Oracle Content Services" for a list of Oracle Content Services-supported character sets and languages.
Some protocols do not support multibyte user names. Access through WebDAV and HTTP is not available for user names that contain multibyte characters. FTP allows multibyte user names. In addition, some protocols require that user passwords be in ASCII.
You can use a protocol command character set that is different from the default document character set. A protocol command character set is the character set you use to type commands in FTP or other protocols.
Oracle Content Services provides the following server configuration properties to specify the default FTP command character set:
IFS.SERVER.PROTOCOL.FTP.DefaultCommandCharacterSet
IFS.SERVER.PROTOCOL.FTP.CommandCharacterSetIsUserCharacterSet
The following precedence model determines a session's FTP command character encoding:
Explicitly specified (using quote setcommandcharacterset
).
If IFS.SERVER.PROTOCOL.FTP.CommandCharacterSetIsUser CharacterSet
is true, use the Default Character Set specified by the user in Globalization Preferences.
If IFS.SERVER.PROTOCOL.FTP.CommandCharacterSetIsUser CharacterSet
is false, use the value of IFS.SERVER.PROTOCOL.FTP. DefaultCommandCharacterSet
.
If no character set is found, use the service wide default, IFS.SERVICE. DefaultCharacterSet
.
The standard FTP protocol does not define the character set of the file names or directory names that are usually passed as arguments of FTP commands. The FTP server is responsible for interpreting the byte sequence of the FTP commands. To allow users to access documents of different character sets and languages, and to allow users to set and view the protocol command character set, the Oracle Content Services FTP server provides the following QUOTE commands:
Ftp> quote setcommandcharacterset
: Allows users to specify the command character set for the FTP session. This character set specifies the character encoding to be used in subsequent FTP commands. The FTP protocol server converts FTP commands from this character encoding to Java String and vice versa. When the FTP session is first created, the FTP server uses the default character set of the session. The IANA naming standards should be used to specify the character set.
Ftp> quote setcharacterset
: Allows users to specify the character set of the documents to be uploaded. Called setcharencoding
in previous releases of Oracle Content Services. The IANA naming standards should be used to specify the character set.
Ftp> quote showcharacterset
: Displays both the current command character set and the current document character set of the FTP session. Called showcharencoding
in previous releases of Oracle Content Services. The character set is displayed in the IANA naming standards.
Ftp> quote setlanguage
: Allows users to specify the language for the FTP session. The language of a FTP session is then associated with the documents that are uploaded. Oracle Text uses the language information to determine the appropriate lexer to use to index the document. When the FTP session is first created, the FTP server uses the default language of the session. Oracle language names should be used.
Ftp> quote showlanguage
: Displays the current language of the FTP session. The language is displayed with the Oracle naming standard.
When a quote command is issued to change the character set or language of the FTP session, the FTP server actually updates the settings in the Localizer object of the current LibrarySession. Subsequently, since quote commands cannot be issued until a FTP session is established, only user names in the character set or subset of the FTP server's default character set can be used to log in to the FTP server. Appendix F, "FTP Quote Command Reference" for more information about quote commands.
Users can specify the character sets and languages of their environments using standard command-line FTP clients. Browser-based FTP clients, such as Internet Explorer or Netscape, do not allow issuance of quote commands. FtpSession defaults will be used.
Oracle Content Services provides the following server configuration properties to specify the default WebDAV command character set:
IFS.SERVER.PROTOCOL.DAV.Webfolders.DefaultCommandCharacterSet
IFS.SERVER.PROTOCOL.DAV.Webfolders.CommandCharacterSetIsUser CharacterSet
The following precedence model determines a session's WebDAV command character encoding:
If IFS.SERVER.PROTOCOL.DAV.Webfolders.CommandCharacterSetIs UserCharacterSet
is true, use the Default Character Set specified by the user in Globalization Preferences.
If IFS.SERVER.PROTOCOL.DAV.Webfolders.CommandCharacterSetIs UserCharacterSet
is false, use the value of IFS.SERVER.PROTOCOL.DAV. Webfolders.DefaultCommandCharacterSet
.
If no character set is found, use the service wide default, IFS.SERVICE. DefaultCharacterSet
.
The following table summarizes the character sets supported in Oracle Content Services.
Table G-2 Character Sets Supported in Oracle Content Services
Language | IANA Preferred MIME Charset | IANA Additional Aliases | Java Encodings | Oracle Charset |
---|---|---|---|---|
Arabic (ISO) |
iso-8859-6 |
ISO_8859-6:1987, iso-ir-127, ISO_8859-6, ECMA-114, ASMO-708, arabic, csISOLatinArabic |
ISO8859_6 |
AR8ISO8859P6 |
Arabic (Windows) |
windows-1256 |
none |
Cp1256 |
AR8MSWIN1256 |
Baltic (ISO) |
iso-8859-4 |
csISOLatin4, iso-ir-110, ISO_8859-4, ISO_8859-4:1988, l4, latin4 |
ISO8859_4 |
NEE8ISO8859P4 |
Baltic (Windows) |
windows-1257 |
none |
Cp1257 |
BLT8MSWIN1257 |
Central European (DOS) |
ibm852 |
cp852, 852, csPcp852 |
Cp852 |
EE8PC852 |
Central European (ISO) |
iso-8859-2 |
csISOLatin2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin2 |
ISO8859_2 |
EE8ISO8859P2 |
Central European (Windows) |
windows-1250 |
x-cp1250 |
Cp1250 |
EE8MSWIN1250 |
Chinese |
iso-2022-cn It is not defined in IANA, but use in MIME documents. |
csISO2022CN |
ISO2022CN |
ISO2022-CN |
Chinese Simplified (GB2312) |
gb2312 |
chinese, csGB2312, csISO58GB231280, GB2312, GB_2312-80, iso-ir-58 |
EUC_CN |
ZHS16CGB231280 |
Chinese Simplified (GB18030) |
GB18030 |
none |
GB18030 |
ZHS32GB18030 |
Chinese Simplified (Windows) |
GBK |
windows-936 |
GBK |
ZHS16GBK |
Chinese Traditional |
big5 |
csbig5, x-x-big5 |
Big5 |
ZHT16BIG5 |
Chinese Traditional |
windows-950 |
none |
MS950 |
ZHT16MSWIN950 |
Chinese Traditional (EUC-TW) |
EUC-TW |
none |
EUC_TW |
ZHT32EUC |
Chinese Traditional (Big5-HKSCS) |
Big5-HKSCS |
none |
Big5_HKSCS |
ZHT16HKSCS |
Cyrillic (DOS) |
ibm866 |
cp866, 866, csIBM866 |
Cp866 |
RU8PC866 |
Cyrillic (ISO) |
iso-8859-5 |
csISOLatinCyrillic, cyrillic, iso-ir-144, ISO_8859-5, ISO_8859-5:1988 |
ISO8859_5 |
CL8ISO8859P5 |
Cyrillic (KOI8-R) |
koi8-r |
csKOI8R, koi |
KOI8_R |
CL8KOI8R |
Cyrillic Alphabet (Windows) |
windows-1251 |
x-cp1251 |
Cp1251 |
CL8MSWIN1251 |
Greek (ISO) |
iso-8859-7 |
csISOLatinGreek, ECMA-118, ELOT_928, greek, greek8, iso-ir-126, ISO_8859-7, ISO_8859-7:1987, csISOLatinGreek |
ISO8859_7 |
EL8ISO8859P7 |
Greek (Windows) |
windows-1253 |
none |
Cp1253 |
EL8MSWIN1253 |
Hebrew (ISO) |
iso-8859-8 |
csISOLatinHebrew, hebrew, iso-ir-138, ISO_8859-8, visual, ISO-8859-8 Visual, ISO_8859-8:1988 |
ISO8859_8 |
IW8ISO8859P8 |
Hebrew (Windows) |
windows-1255 |
none |
Cp1255 |
IW8MSWIN1255 |
Japanese (JIS) |
iso-2022-jp |
csISO2022JP |
ISO2022JP |
ISO2022-JP |
Japanese (EUC) |
euc-jp |
csEUCPkdFmtJapanese, Extended_UNIX_Code_Packed_Format_for_Japanese, x-euc, x-euc-jp |
EUC_JP |
JA16EUC |
Japanese (Shift-JIS) |
shift_jis |
csShiftJIS, csWindows31J, ms_Kanji, shift-jis, x-ms-cp932, x-sjis |
MS932 |
JA16SJIS |
Korean |
ks_c_5601-1987 |
csKSC56011987, korean, ks_c_5601, euc-kr, csEUCKR |
EUC_KR |
KO16KSC5601 |
Korean (ISO) |
iso-2022-kr |
csISO2022KR |
ISO2022KR |
ISO2022-KR |
Korean (Windows) |
windows-949 |
none |
MS949 |
KO16MSWIN949 |
South European (ISO) |
iso-8859-3 |
ISO_8859-3, ISO_8859-3:1988, iso-ir-109, latin3, l3, csISOLatin3 |
ISO8859_3 |
SE8ISO8859P3 |
Thai |
TIS-620 |
windows-874 |
TIS620 |
TH8TISASCII |
Turkish (Windows) |
windows-1254 |
none |
Cp1254 |
TR8MSWIN1254 |
Turkish (ISO) |
iso-8859-9 |
latin5, l5, csISOLatin5, ISO_8859-9, iso-ir-148, ISO_8859-9:1989 |
ISO8859_9 |
WE8ISO8859P9 |
Universal (UTF-8) |
utf-8 |
unicode-1-1-utf-8, unicode-2-0-utf-8, x-unicode-2-0-utf-8 |
UTF8 |
UTF8 |
Unicode (UTF-16BE) |
UTF-16BE |
none |
UTF-16BE |
AL16UTF16 |
Unicode (UTF-16LE) |
UTF16LE |
none |
UTF-16LE |
AL16UTF16LE |
Vietnamese (Windows) |
windows-1258 |
none |
Cp1258 |
VN8MSWIN1258 |
Western Alphabet |
iso-8859-1 |
cp819, ibm819, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, latin1, l1, csISOLatin1 |
ISO8859_1 |
WE8ISO8859P1 |
Western Alphabet (DOS) |
ibm850 |
cp850, 850, csIBM850 |
Cp850 |
WE38PC850 |
Western Alphabet (Windows) |
windows-1252 |
x-ansi |
Cp1252 |
WE8MSWIN1252 |
The following table summarizes the document languages supported in Oracle Content Services. Note that the supported document languages are different from the languages supported in the Oracle Content Services application.
Table G-3 Document Languages Supported in Oracle Content Services
Oracle Language Name | Java Locale | ISO Locale |
---|---|---|
Arabic |
ar |
ar |
Bengali |
bn |
bn |
Brazilian Portuguese |
pt_BR |
pt-br |
Bulgarian |
bg |
bg |
Canadian French |
fr_CA |
fr-CA |
Catalan |
ca |
ca |
Croatian |
hr |
hr |
Czech |
cs |
cs |
Danish |
da |
da |
Dutch |
nl |
nl |
Egyptian |
ar_EG |
ar-eg |
American |
en |
en |
English |
en_GB |
en-gb |
Estonian |
et |
et |
Finnish |
fi |
fi |
French |
fr |
fr |
German |
de |
de |
Greek |
el |
el |
Hebrew |
he |
he |
Hungarian |
hu |
hu |
Icelandic |
is |
is |
Indonesian |
id |
in |
Italian |
it |
it |
Japanese |
ja |
ja |
Korean |
ko |
ko |
Latin American Spanish |
es |
es |
Latvian |
lv |
lv |
Lithuanian |
lt |
lv |
Malay |
ms |
ms |
Mexican Spanish |
es_MX |
es-mx |
Norwegian |
no |
no |
Polish |
pl |
pl |
Portuguese |
pt |
pt |
Romanian |
ro |
ro |
Russian |
ru |
ru |
Simplified Chinese |
zh_CN |
zh-cn |
Slovak |
sk |
sk |
Slovenian |
sl |
sl |
Spanish |
es_ES |
es-es |
Swedish |
sv |
sv |
Thai |
th |
th |
Traditional Chinese |
zh_TW |
zh-tw |
Turkish |
tr |
tr |
Ukrainian |
uk |
uk |
Vietnamese |
vi |
vi |