Skip Headers

Oracle Files Administration Guide
9.0.3

Part Number A97358-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

G
Globalization Support

Globalization Support Overview

Oracle Files globalization support enables users to store and search documents of heterogeneous character sets and languages in a single Oracle Files instance. In addition, the foundation is provided for development of multilingual applications that can be accessed and run from anywhere in the world simultaneously. The content of the user interface would be rendered and the data processed in the native users' languages and locale preferences. The globalization infrastructure ensures that the resource strings, error messages, sort order, date, time, monetary, numeric, and calendar conventions automatically adapt to any native language and locale.

The Repository

The repository is the implementation of the core of Oracle Files, on which the protocol servers and applications are built. Globalization support is provided in the repository so that the other dependent components can share and utilize this support. The major globalization goal for the repository is to ensure efficient storage of documents of heterogeneous character sets and languages, and to allow effective update, retrieval, and search on these documents.

How to Choose the Database Character Set for Oracle Files

In the repository, all metadata strings, such as the name of the document, description, etc., are stored in the VARCHAR2 data type of the Oracle9i database. Strings stored in this data type are encoded in the database character set specified when a database is created. The document itself, however, is unstructured data and stored in one of the large object data types of the Oracle9i database, particularly the BLOB data type. The BLOB data type stores content as-is, avoiding any character set conversion on document content. The LONG and CLOB data types store content in the database character set, which requires character set conversion. Conversions can compromise the data integrity and have the potential to convert incorrectly or lose characters.

The full-text search index built on the document content is encoded in the database character set. When a document's content is indexed, the BLOB data is converted from the content's character set to the database character set for creation of the index text tokens. If the content's character set is not a subset of the database character set, the conversion will yield garbage tokens. For example, a database character set of ISO-8859-1 (Western European languages) will not be able to index correctly a Shift-JIS (Japanese) document. To be able to search content effectively, the character set of the documents stored by the users should be considered when selecting the database character set.

If your Oracle Files instance will contain multilingual documents, UTF-8 is the recommended database character set. UTF-8 supports characters defined in the Unicode standard. The Unicode standard solves the problem of many different languages in the same application or database. Unicode is a single, global character set which contains all major living scripts and conforms to international standards. Unicode provides a unique code value for every character, regardless of the platform, program, or language. UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding and a strict superset of ASCII. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF-8 encoding. Characters from the European scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary characters are represented in 4 bytes. By using a Unicode-based file system, document content and metadata of different languages can be shared by users with different language preferences in one system.

The Oracle9i database introduces the new character set, AL32UTF-8. In Release 1, AL32UTF-8 was the default character set for Unicode 3.0 deployment. In Release 2, AL32UTF-8 is conformant to the latest Unicode 3.1 standard, which contains the supplementary characters, particularly additional Chinese, Japanese, and Korean ideographs. AL32UTF-8 is the default character set of an Oracle9i database installation.


Note:

Oracle Files does not support an AL32UTF-8 database because Oracle Text does not support Chinese, Japanese, and Korean lexers on an AL32UTF-8 database. UTF-8 is the recommended database character set for an Unicode-based file system. If Oracle Files is installed in an AL32UTF-8 database, Chinese, Japanese, and Korean documents will not get indexed and, thus, will not be searchable.

The Oracle Files Configuration Assistant will fail in a Chinese, Japanese, or Korean locale against an AL32UTF-8 database. This is because Oracle Text behaves differently when the database session language is initialized to an Asian language as opposed to American. JDBC initializes the database session language according to the locale of the running application, which in this case is the configuration tool.




How to Make Sure Documents Are Properly Indexed in Oracle Files

To support documents in different character sets and languages in a single file system, the repository associates two globalization attributes per document. They re the character set and language attributes.

Character Set

The character set of a document is used in several situations. When the document content is rendered to a file, the character set of the document is used as the character encoding of the file. When the document is displayed in the browser, the character set of the document is set in the HTTP content-type header. Finally, when a full-text search is built on a text document, Oracle Text uses the character set of the document to convert the data into the database character set before building the index. When a character set is updated, the content is reindexed.

If no character set is specified upon insertion of a document, the repository determines a default character set as follows: The character set of the user's LibrarySession stored in the Localizer object is first used. This is obtained from the user's PrimaryUserProfile information at initialization of the user's LibrarySession. During initialization, if a character set default value is not found in the user's PrimaryUserProfile, the default is obtained from the Oracle Files systemwide default character set, which is specified in the service configuration property, IFS.SERVICE.DefaultCharacterSet.

The various naming conventions for character sets are not trivial. Oracle Files Java API standardizes on the Java naming convention for character sets. Any GUI end-user application should expose the more publicly known IANA naming convention. See Table G-2, " Character Sets Supported in Oracle Files" for character set names.

Language

The language of a document is used in mainly two ways. First, as a criterion to limit the search for documents of a particular language. The more significant usage, however, is for building a full-text search index on the document with Oracle Text. Oracle Text's multilexer feature uses the language to identify the specific lexer to parse the document for searchable words. The language-specific lexers need to be defined and associated with a language before the index is built. They are defined as follows:

Table G-1  Language-Specific Lexers
Language Lexer Lexer Option

Brazilian Portugese

BASIC_LEXER

BASE LETTER

Canadian French

BASIC_LEXER

BASE LETTER

INDEX THEME

Danish

BASIC_LEXER

BASE LETTER

DANISH ALTERNATE SPELLING

Dutch

BASIC_LEXER

BASE LETTER

Finnish

BASIC_LEXER

BASE LETTER

French

BASIC_LEXER

BASE LETTER

INDEX THEME

THEME

LANGUAGE=FRENCH

German

BASIC_LEXER

BASE LETTER

GERMAN ALTERNATE SPELLING

Italian

BASIC_LEXER

BASE LETTER

Japanese

JAPANESE_VGRAM_LEXER

Korean

KOREAN_LEXER

Latin American

BASIC_LEXER

BASE LETTER

Spanish Portugese

BASIC_LEXER

BASE LETTER

Simplified Chinese

CHINES_VGRAM_LEXER

Swedish

BASIC_LEXER

BASE LETTER

SWEDISH ALTERNATE SPELLING

Tradition Chinese

CHINESE_VGRAM_LEXER

Others

BASIC_LEXER

INDEX THEME

THEME LANGUAGE=ENGLISH

INDEX TEXT

The BASIC_LEXER is used for single-byte languages using white space as a word separator. Asian language lexers cannot use white space as word separator.s Instead, they use a V-gram algorithm to parse the documents for searchable keys. Languages that have not been supported by Oracle Text are parsed as English. Oracle Files uses the multilexer feature of Oracle Text. It is a global lexer containing German, Danish, Swedish, Japanese, Simplified Chinese, Traditional Chinese, and Korean sublexers.

If no language is specified upon insertion of a document, the repository determines a default language as follows.

  1. If the character set has been set, the language can most likely be obtained from a `best-guess' algorithm based on the character set value.
  2. A document with a character set of Shift-JIS will most likely be in Japanese.
  3. The default language is obtained from the Localizer of the user's LibrarySession. During initialization of the LibrarySession, the default language is obtained from the user's PrimaryUserProfile. The last resort is the 9iFS system-wide default language which is specified in the service configuration property, IFS.SERVICE.DefaultLanguage.

The naming convention for language is simply one flavor, the Oracle NLS language abbreviation. See Table G-3, " Languages Supported in Oracle Files" for a list of Oracle Files-supported languages.

Service Configuration Properties

There are two service configuration properties that hold default character set and language values for the Oracle Files instance. The properties are:

These two properties are initialized with the Oracle Files Configuration assistant tool and can be later modified through the Oracle Files Manager. The Oracle Files default character set should be the same or a subset of the database character set. The character set should be specified in accordance with the IANA standard naming convention. The language should be specified in accordance with Oracle naming for languages. See Table G-2, " Character Sets Supported in Oracle Files" and Table G-3, " Languages Supported in Oracle Files" for a list of Oracle Files-supported character sets and languages.

How to Search for Multilingual Documents

To accurately search for a document based on linguistic characteristics, Oracle Text needs to know the language of the string to be searched. In this regard, the Oracle Files Java API provides a new method to the oracle.ifs.bean.search class, namely search.open(String language), to allow applications to specify the language of the search string.

When a language is specified to the search.open() method to indicate a language-sensitive search, the Oracle Files repository issues the following SQL statement to the database to alter the NLS_LANGUAGE session parameter before issuing the SELECT statement to start the context search.

ALTER SESSION SET NLS_LANGUAGE=<nls_language> 
 

Oracle Text looks at the NLS_LANGUAGE variable and determines the language on which the search string should be parsed. After the search has been completed or the search.close() method is called, the Oracle Files repository will issue another ALTER SESSION SQL command to change the NLS_LANGUAGE session parameter back to its original value.

A query is parsed using the sublexer appropriate to the database session language. If the database session language is German, for example, the contains query gets parsed using the German sublexer preferences.

Oracle Files Protocols

Oracle Files does not support mult-byte user names for certain protocols. Access through WebDAV (Web Folders and File Sync), HTTP, and SMB is not available for user names that contain multi-byte characters. FTP allows multi-byte user names.

FTP

The standard FTP protocol does not define the character set of the file names or directory names that are usually passed as arguments of FTP commands. The FTP server is responsible for interpreting the byte sequence of the FTP commands. To allow users to access documents of different character sets and languages, the Oracle Files FTP server provides the following QUOTE commands to support this.

When a quote command is issued to change the character set or language of the FTP session, the FTP server actually updates the settings in the Localizer object of the current LibrarySession. Subsequently, since quote commands cannot be issued until a FTP session is established, only usernames in the character set or subset of the FTP server's default character set can be used to login to the FTP server.

Users can specify the character sets and languages of their environments using standard command-line FTP clients. Browser-based FTP clients, such as Internet Explorer or Netscape, do not allow issuance of quote commands. FtpSession defaults will be used.

SMB

The Server Message Block (SMB) protocol server implements the SMB protocol to allow mounting of Oracle Files as a disk drive in Microsoft Windows Explorer. Microsoft has included Unicode support for the SMB protocol since LanManager Version 0.12.

The SMB protocol does not allow users to pass the character set and language information to the server. The session defaults will be used for documents inserted into the repository via the SMB protocols.

Table G-2  Character Sets Supported in Oracle Files
Language IANA Preferred MIME Charset IANA Additional Aliases Java Encodings Oracle Charset
Arabic (ISO) iso-8859-6 ISO_8859-6:1987, iso-ir-127, ISO_8859-6, ECMA-114, ASMO-708, arabic, csISOLatinArabic ISO8859_6 AR8ISO8859P6
Arabic (Windows) windows-1256   Cp1256 AR8MSWIN1256
Baltic (ISO) iso-8859-4 csISOLatin4, iso-ir-110, ISO_8859-4, ISO_8859-4:1988, l4, latin4 ISO8859_4 NEE8ISO8859P4
Baltic (Windows) windows-1257   Cp1257 BLT8MSWIN1257
Central European (DOS) ibm852 cp852, 852, csPcp852 Cp852 EE8PC852
Central European (ISO) iso-8859-2 csISOLatin2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin2 ISO8859_2 EE8ISO8859P2
Central European (Windows) windows-1250 x-cp1250 Cp1250 EE8MSWIN1250
Chinese Simplified (GB2312) gb2312 chinese, csGB2312, csISO58GB231280, GB2312, GB_2312-80, iso-ir-58 EUC_CN ZHS16CGB231280
Chinese Simplified (Windows) GBK windows-936 GBK ZHS16GBK
Chinese Traditional big5 csbig5, x-x-big5 Big5 ZHT16BIG5
Chinese Traditional windows-950   MS950 ZHT16MSWIN950
Chinese iso-2022-cn(It is not defined in IANA, but use in MIME documents. csISO2022CN ISO2022CN ISO2022-CN
Chinese Traditional (EUC-TW) EUC-TW   EUC_TW ZHT32EUC
Cyrillic (DOS) ibm866 cp866, 866, csIBM866 Cp866 RU8PC866
Cyrillic (ISO) iso-8859-5 csISOLatinCyrillic, cyrillic, iso-ir-144, ISO_8859-5, ISO_8859-5:1988 ISO8859_5 CL8ISO8859P5
Cyrillic (KOI8-R) koi8-r csKOI8R, koi KOI8_R CL8KOI8R
Cyrillic Alphabet (Windows) windows-1251 x-cp1251 Cp1251 CL8MSWIN1251
Greek (ISO) iso-8859-7 csISOLatinGreek,  ECMA-118,  ELOT_928, greek, greek8, iso-ir-126, ISO_8859-7, ISO_8859-7:1987, csISOLatinGreek ISO8859_7 EL8ISO8859P7
Greek (Windows) windows-1253   Cp1253 EL8MSWIN1253
Hebrew (ISO) iso-8859-8 csISOLatinHebrew, hebrew, iso-ir-138, ISO_8859-8, visual, ISO-8859-8 Visual,ISO_8859-8:1988 ISO8859_8 IW8ISO8859P8
Hebrew (Windows) windows-1255   Cp1255 IW8MSWIN1255
Japanese (JIS) iso-2022-jp csISO2022JP ISO2022JP ISO2022-JP
Japanese (EUC) euc-jp csEUCPkdFmtJapanese, Extended_UNIX_Code_Packed_Format_for_Japanese, x-euc, x-euc-jp EUC_JP JA16EUC
Japanese (Shift-JIS) shift_jis csShiftJIS, csWindows31J, ms_Kanji, shift-jis, x-ms-cp932, x-sjis MS932 JA16SJIS
Korean ks_c_5601-1987 csKSC56011987, korean, ks_c_5601, euc-kr, csEUCKR EUC_KR KO16KSC5601
Korean (ISO) iso-2022-kr csISO2022KR ISO2022KR ISO2022-KR
Korean (Windows) windows-949   MS949 KO16MSWIN949
South European (ISO) iso-8859-3 ISO_8859-3, ISO_8859-3:1988, iso-ir-109, latin3, l3, csISOLatin3 ISO8859_3 SE8ISO8859P3
Thai TIS-620 windows-874 TIS620 TH8TISASCII
Turkish (Windows) windows-1254   Cp1254 TR8MSWIN1254
Turkish (ISO) iso-8859-9 latin5, l5, csISOLatin5, ISO_8859-9, iso-ir-148, ISO_8859-9:1989 ISO8859_9 WE8ISO8859P9
Universal (UTF-8) utf-8 unicode-1-1-utf-8, unicode-2-0-utf-8, x-unicode-2-0-utf-8 UTF8 UTF8
Vietnamese (Windows) windows-1258   Cp1258 VN8MSWIN1258
Western Alphabet (windows) windows-1252 x-ansi Cp1252 WE8MSWIN1252
Western Alphabet iso-8859-1 cp819, ibm819, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, latin1, l1, csISOLatin1 ISO8859_1 WE8ISO8859P1
Western Alphabet (DOS) ibm850 cp850, 850, csIBM850 Cp850 WE38PC850
Table G-3  Languages Supported in Oracle Files
Oracle Language Name Java Locale ISO Locale
Arabic ar ar
Bengali bn bn
Brazilian Portuguese pt_BR pt-br
Bulgarian bg bg
Canadian French fr_CA fr-CA
Catalan ca ca
Croatian hr hr
Czech cs cs
Danish da da
Dutch nl nl
Egyptian ar_EG ar-eg
American en en
English en_GB en-gb
Estonian et et
Finnish fi fi
French fr fr
German de de
Greek el el
Hebrew he he
Hungarian hu hu
Icelandic is is
Indonesian id in
Italian it it
Japanese ja ja
Korean ko ko
Latin American Spanish es es
Latvian lv lv
Lithuanian lt lt
Malay ms ms
Mexican Spanish es_MX es-mx
Norwegian no no
Polish pl pl
Portuguese pt pt
Romanian ro ro
Russian ru ru
Simplified Chinese zh_CN zh-cn
Slovak sk sk
Slovenian sl sl
Spanish es_ES es-es
Swedish sv sv
Thai th th
Traditional Chinese zh_TW zh-tw
Turkish tr tr
Ukrainian uk uk
Vietnamese vi vi