G Globalization Support

This appendix provides information about Oracle Files globalization support. Topics include:

Globalization Support Overview
The Repository
Oracle Files Protocols
Character Sets Supported in Oracle Files
Languages Supported In Oracle Files

Globalization Support Overview

Oracle Files globalization support enables users to store and search documents of heterogeneous character sets and languages in a single Oracle Files instance. The globalization infrastructure ensures that the resource strings, error messages, sort order, date, time, numeric, and calendar conventions adapt automatically to any native language and locale.

The Repository

The repository is the implementation of the core of Oracle Files, on which the protocol servers and applications are built. Globalization support is provided in the repository so that the other dependent components can share and utilize this support. The major globalization goal for the repository is to ensure efficient storage of documents of heterogeneous character sets and languages, and to allow effective update, retrieval, and search on these documents.

How to Choose the Database Character Set for Oracle Files

In the repository, all metadata strings, such as the name of the document or the description, are stored in the VARCHAR2 data type of the Oracle9i database. Strings stored in this data type are encoded in the database character set specified when a database is created. The document itself, however, is unstructured data and stored in one of the large object data types of the Oracle9i database, particularly the BLOB data type. The BLOB data type stores content as-is, avoiding any character set conversion on document content. The LONG and CLOB data types store content in the database character set, which requires character set conversion. Conversions can compromise the data integrity and have the potential to convert incorrectly or lose characters.

The full-text search index built on the document content is encoded in the database character set. When a document's content is indexed, the BLOB data is converted from the content's character set to the database character set for creation of the index text tokens. If the content's character set is not a subset of the database character set, the conversion will yield garbage tokens. For example, a database character set of ISO-8859-1 (Western European languages) will not be able to index correctly a Shift-JIS (Japanese) document. To be able to search content effectively, the character set of the documents stored by the users should be considered when selecting the database character set.

If your Oracle Files instance will contain multilingual documents, UTF-8 is the recommended database character set. UTF-8 supports characters defined in the Unicode standard. The Unicode standard solves the problem of many different languages in the same application or database. Unicode is a single, global character set which contains all major living scripts and conforms to international standards. Unicode provides a unique code value for every character, regardless of the platform, program, or language. UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding and a strict superset of ASCII. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF-8 encoding. Characters from the European scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary characters are represented in 4 bytes. By using a Unicode-based file system, document content and metadata of different languages can be shared by users with different language preferences in one system.

The Oracle9i database introduces the new character set, AL32UTF-8. In Release 1, AL32UTF-8 was the default character set for Unicode 3.0 deployment. In Release 2, AL32UTF-8 is compliant to the latest Unicode 3.1 standard, which contains the supplementary characters, particularly additional Chinese, Japanese, and Korean ideographs. AL32UTF-8 is the default character set of an Oracle9i database installation.

Note:

Oracle Files does not support an AL32UTF-8 database because Oracle Text does not support Chinese, Japanese, and Korean lexers on an AL32UTF-8 database. UTF-8 is the recommended database character set for an Unicode-based file system. If Oracle Files is installed in an AL32UTF-8 database, Chinese, Japanese, and Korean documents will not get indexed and, thus, will not be searchable.

The Oracle Files Configuration Assistant will fail in a Chinese, Japanese, or Korean locale against an AL32UTF-8 database. This is because Oracle Text behaves differently when the database session language is initialized to an Asian language as opposed to American. JDBC initializes the database session language according to the locale of the running application, which in this case is the configuration tool.

How to Make Sure Documents Are Properly Indexed in Oracle Files

To support documents in different character sets and languages in a single file system, the repository associates two globalization attributes per document. They are the character set and language attributes.

Character Set

The character set of a document is used in several situations. When the document content is rendered to a file, the character set of the document is used as the character encoding of the file. When the document is displayed in the browser, the character set of the document is set in the HTTP content-type header. Finally, when a full-text search is built on a text document, Oracle Text uses the character set of the document to convert the data into the database character set before building the index. When a character set is updated, the content is reindexed.

If no character set is specified upon insertion of a document, the repository determines a default character set as follows: the character set of the user's LibrarySession stored in the Localizer object is first used. This is obtained from the user's PrimaryUserProfile information at initialization of the user's LibrarySession.

Language

The language of a document is used as a criterion to limit the search for documents of a particular language. It is also used to build a full-text search index on the document with Oracle Text. Oracle Text's multilexer feature uses the language to identify the specific lexer to parse the document for searchable words. The language-specific lexers need to be defined and associated with a language before the index is built. They are defined as follows:

Table G-1 Language-Specific Lexers

Language	Lexer	Lexer Option
Brazilian Portuguese	BASIC_LEXER	BASE LETTER
Canadian French	BASIC_LEXER	BASE LETTER INDEX THEME
Danish	BASIC_LEXER	BASE LETTER DANISH ALTERNATE SPELLING
Dutch	BASIC_LEXER	BASE LETTER
Finnish	BASIC_LEXER	BASE LETTER
French	BASIC_LEXER	BASE LETTER INDEX THEME THEME LANGUAGE=FRENCH
German	BASIC_LEXER	BASE LETTER GERMAN ALTERNATE SPELLING
Italian	BASIC_LEXER	BASE LETTER
Japanese	JAPANESE_VGRAM_LEXER	N/A
Korean	KOREAN_LEXER	N/A
Latin American	BASIC_LEXER	BASE LETTER
Spanish Portuguese	BASIC_LEXER	BASE LETTER
Simplified Chinese	CHINES_VGRAM_LEXER	N/A
Swedish	BASIC_LEXER	BASE LETTER SWEDISH ALTERNATE SPELLING
Tradition Chinese	CHINESE_VGRAM_LEXER	N/A
Others	BASIC_LEXER	INDEX THEME THEME LANGUAGE=ENGLISH INDEX TEXT

The BASIC_LEXER is used for single-byte languages using white space as a word separator. Asian language lexers cannot use white space as word separators. Instead, they use a V-gram algorithm to parse the documents for searchable keys. Languages that have not been supported by Oracle Text are parsed as English. Oracle Files uses the multilexer feature of Oracle Text. It is a global lexer containing German, Danish, Swedish, Japanese, Simplified Chinese, Traditional Chinese, and Korean sublexers.

If no language is specified upon insertion of a document, the repository determines a default language as follows.

If the character set has been set, the language can most likely be obtained from a 'best-guess' algorithm based on the character set value. For example, a document with a character set of Shift-JIS will most likely be in Japanese.
The default language is obtained from the Localizer of the user's LibrarySession. During initialization of the LibrarySession, the default language is obtained from the user's PrimaryUserProfile.
The defaults for both language and character set is specified by the Subscriber Administrator when a new user is created.

Oracle Files identifies languages using Oracle NLS language abbreviations. See "Languages Supported In Oracle Files" for a list of Oracle Files-supported languages.

Service Configuration Properties

There are two service configuration properties that hold default character set and language values for Oracle Files Subscribers. The properties are:

IFS.SERVICE.DefaultCharacterSet
IFS.SERVICE.DefaultLanguage

These two properties are initialized with the Oracle Files Configuration Assistant tool and can be later modified through the Oracle Enterprise Manager Web site. The Oracle Files default character set should be the same or a subset of the database character set. The character set should be specified in accordance with the IANA standard naming convention. The language should be specified in accordance with Oracle naming for languages. See "Character Sets Supported in Oracle Files" and "Languages Supported In Oracle Files" for a list of Oracle Files-supported character sets and languages.

Oracle Files Protocols

Oracle Files does not support multibyte user names for certain protocols. Access through WebDAV (Web Folders and Oracle FileSync), HTTP, and SMB is not available for user names that contain multibyte characters. FTP allows multibyte user names. In addition, some protocols require that user passwords be in ASCII.

FTP

The standard FTP protocol does not define the character set of the file names or directory names that are usually passed as arguments of FTP commands. The FTP server is responsible for interpreting the byte sequence of the FTP commands. To allow users to access documents of different character sets and languages, the Oracle Files FTP server provides the following QUOTE commands:

Ftp> quote setcharencoding: Allows users to specify the character set for the FTP session. This character set specifies the character encoding to be used in subsequent FTP commands and the character set of the documents to be uploaded. The FTP protocol server converts FTP commands from this character encoding to Java String and vice versa. When the FTP session is first created, the FTP server uses the default character set of the session. The IANA naming standards should be used to specify the character set.
Ftp> quote showcharencoding: Displays the current character set of the FTP session. The character set is displayed in the IANA naming standards.
Ftp> quote setlanguage: Allows users to specify the language for the FTP session. The language of a FTP session is then associated with the documents that are uploaded. Oracle Text uses the language information to determine the appropriate lexer to use to index the document. When the FTP session is first created, the FTP server uses the default language of the session. Oracle language names should be used.
Ftp> quote showlanguage: Displays the current language of the FTP session. The language is displayed with the Oracle naming standard.

When a quote command is issued to change the character set or language of the FTP session, the FTP server actually updates the settings in the Localizer object of the current LibrarySession. Subsequently, since quote commands cannot be issued until a FTP session is established, only user names in the character set or subset of the FTP server's default character set can be used to log in to the FTP server. Appendix F, "FTP Quote Command Reference" for more information about quote commands.

Users can specify the character sets and languages of their environments using standard command-line FTP clients.

SMB

The Server Message Block (SMB) protocol server implements the SMB protocol to allow mounting of Oracle Files as a disk drive in Microsoft Windows Explorer. Microsoft has included Unicode support for the SMB protocol since LanManager Version 0.12.

The SMB protocol does not allow users to pass the character set and language information to the server. The session defaults will be used for documents inserted into the repository via the SMB protocols.

Character Sets Supported in Oracle Files

The following table summarizes the character sets supported in Oracle Files.

Table G-2 Character Sets Supported in Oracle Files

Language	IANA Preferred MIME Charset	IANA Additional Aliases	Java Encodings	Oracle Charset
Arabic (ISO)	iso-8859-6	ISO_8859-6:1987, iso-ir-127, ISO_8859-6, ECMA-114, ASMO-708, arabic, csISOLatinArabic	ISO8859_6	AR8ISO8859P6
Arabic (Windows)	windows-1256	none	Cp1256	AR8MSWIN1256
Baltic (ISO)	iso-8859-4	csISOLatin4, iso-ir-110, ISO_8859-4, ISO_8859-4:1988, l4, latin4	ISO8859_4	NEE8ISO8859P4
Baltic (Windows)	windows-1257	none	Cp1257	BLT8MSWIN1257
Central European (DOS)	ibm852	cp852, 852, csPcp852	Cp852	EE8PC852
Central European (ISO)	iso-8859-2	csISOLatin2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin2	ISO8859_2	EE8ISO8859P2
Central European (Windows)	windows-1250	x-cp1250	Cp1250	EE8MSWIN1250
Chinese	iso-2022-cn It is not defined in IANA, but use in MIME documents.	csISO2022CN	ISO2022CN	ISO2022-CN
Chinese Simplified (GB2312)	gb2312	chinese, csGB2312, csISO58GB231280, GB2312, GB_2312-80, iso-ir-58	EUC_CN	ZHS16CGB231280
Chinese Simplified (Windows)	GBK	windows-936	GBK	ZHS16GBK
Chinese Traditional	big5	csbig5, x-x-big5	Big5	ZHT16BIG5
Chinese Traditional	windows-950	none	MS950	ZHT16MSWIN950
Chinese Traditional (EUC-TW)	EUC-TW	none	EUC_TW	ZHT32EUC
Cyrillic (DOS)	ibm866	cp866, 866, csIBM866	Cp866	RU8PC866
Cyrillic (ISO)	iso-8859-5	csISOLatinCyrillic, cyrillic, iso-ir-144, ISO_8859-5, ISO_8859-5:1988	ISO8859_5	CL8ISO8859P5
Cyrillic (KOI8-R)	koi8-r	csKOI8R, koi	KOI8_R	CL8KOI8R
Cyrillic Alphabet (Windows)	windows-1251	x-cp1251	Cp1251	CL8MSWIN1251
Greek (ISO)	iso-8859-7	csISOLatinGreek, ECMA-118, ELOT_928, greek, greek8, iso-ir-126, ISO_8859-7, ISO_8859-7:1987, csISOLatinGreek	ISO8859_7	EL8ISO8859P7
Greek (Windows)	windows-1253	none	Cp1253	EL8MSWIN1253
Hebrew (ISO)	iso-8859-8	csISOLatinHebrew, hebrew, iso-ir-138, ISO_8859-8, visual, ISO-8859-8 Visual, ISO_8859-8:1988	ISO8859_8	IW8ISO8859P8
Hebrew (Windows)	windows-1255	none	Cp1255	IW8MSWIN1255
Japanese (JIS)	iso-2022-jp	csISO2022JP	ISO2022JP	ISO2022-JP
Japanese (EUC)	euc-jp	csEUCPkdFmtJapanese, Extended_UNIX_Code_Packed_Format_for_Japanese, x-euc, x-euc-jp	EUC_JP	JA16EUC
Japanese (Shift-JIS)	shift_jis	csShiftJIS, csWindows31J, ms_Kanji, shift-jis, x-ms-cp932, x-sjis	MS932	JA16SJIS
Korean	ks_c_5601-1987	csKSC56011987, korean, ks_c_5601, euc-kr, csEUCKR	EUC_KR	KO16KSC5601
Korean (ISO)	iso-2022-kr	csISO2022KR	ISO2022KR	ISO2022-KR
Korean (Windows)	windows-949	none	MS949	KO16MSWIN949
South European (ISO)	iso-8859-3	ISO_8859-3, ISO_8859-3:1988, iso-ir-109, latin3, l3, csISOLatin3	ISO8859_3	SE8ISO8859P3
Thai	TIS-620	windows-874	TIS620	TH8TISASCII
Turkish (Windows)	windows-1254	none	Cp1254	TR8MSWIN1254
Turkish (ISO)	iso-8859-9	latin5, l5, csISOLatin5, ISO_8859-9, iso-ir-148, ISO_8859-9:1989	ISO8859_9	WE8ISO8859P9
Universal (UTF-8)	utf-8	unicode-1-1-utf-8, unicode-2-0-utf-8, x-unicode-2-0-utf-8	UTF8	UTF8
Vietnamese (Windows)	windows-1258	none	Cp1258	VN8MSWIN1258
Western Alphabet	iso-8859-1	cp819, ibm819, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, latin1, l1, csISOLatin1	ISO8859_1	WE8ISO8859P1
Western Alphabet (DOS)	ibm850	cp850, 850, csIBM850	Cp850	WE38PC850
Western Alphabet (Windows)	windows-1252	x-ansi	Cp1252	WE8MSWIN1252

Languages Supported In Oracle Files

The following table summarizes the languages supported in Oracle Files.

Table G-3 Languages Supported in Oracle Files

Oracle Language Name	Java Locale	ISO Locale
Arabic	ar	ar
Bengali	bn	bn
Brazilian Portuguese	pt_BR	pt-br
Bulgarian	bg	bg
Canadian French	fr_CA	fr-CA
Catalan	ca	ca
Croatian	hr	hr
Czech	cs	cs
Danish	da	da
Dutch	nl	nl
Egyptian	ar_EG	ar-eg
American	en	en
English	en_GB	en-gb
Estonian	et	et
Finnish	fi	fi
French	fr	fr
German	de	de
Greek	el	el
Hebrew	he	he
Hungarian	hu	hu
Icelandic	is	is
Indonesian	id	in
Italian	it	it
Japanese	ja	ja
Korean	ko	ko
Latin American Spanish	es	es
Latvian	lv	lv
Lithuanian	lt	lv
Malay	ms	ms
Mexican Spanish	es_MX	es-mx
Norwegian	no	no
Polish	pl	pl
Portuguese	pt	pt
Romanian	ro	ro
Russian	ru	ru
Simplified Chinese	zh_CN	zh-cn
Slovak	sk	sk
Slovenian	sl	sl
Spanish	es_ES	es-es
Swedish	sv	sv
Thai	th	th
Traditional Chinese	zh_TW	zh-tw
Turkish	tr	tr
Ukrainian	uk	uk
Vietnamese	vi	vi