Oracle Files Administrator's Guide Release 2 (9.0.4.1) Part Number B10872-01 |
|
|
View PDF |
This appendix provides information about Oracle Files globalization support. Topics include:
Oracle Files globalization support enables users to store and search documents of heterogeneous character sets and languages in a single Oracle Files instance. The globalization infrastructure ensures that the resource strings, error messages, sort order, date, time, numeric, and calendar conventions automatically adapt to any native language and locale.
The repository is the implementation of the core of Oracle Files, on which the protocol servers and applications are built. Globalization support is provided in the repository so that the other dependent components can share and utilize this support. The major globalization goal for the repository is to ensure efficient storage of documents of heterogeneous character sets and languages, and to allow effective update, retrieval, and search on these documents.
In the repository, all metadata strings, such as the name of the document, description, etc., are stored in the VARCHAR2 data type of the Oracle9i database. Strings stored in this data type are encoded in the database character set specified when a database is created. The document itself, however, is unstructured data and stored in one of the large object data types of the Oracle9i database, particularly the BLOB data type. The BLOB data type stores content as-is, avoiding any character set conversion on document content. The LONG and CLOB data types store content in the database character set, which requires character set conversion. Conversions can compromise the data integrity and have the potential to convert incorrectly or lose characters.
The full-text search index built on the document content is encoded in the database character set. When a document's content is indexed, the BLOB data is converted from the content's character set to the database character set for creation of the index text tokens. If the content's character set is not a subset of the database character set, the conversion will yield garbage tokens. For example, a database character set of ISO-8859-1 (Western European languages) will not be able to index correctly a Shift-JIS (Japanese) document. To be able to search content effectively, the character set of the documents stored by the users should be considered when selecting the database character set.
If your Oracle Files instance will contain multilingual documents, UTF-8 is the recommended database character set. UTF-8 supports characters defined in the Unicode standard. The Unicode standard solves the problem of many different languages in the same application or database. Unicode is a single, global character set which contains all major living scripts and conforms to international standards. Unicode provides a unique code value for every character, regardless of the platform, program, or language. UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding and a strict superset of ASCII. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF-8 encoding. Characters from the European scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary characters are represented in 4 bytes. By using a Unicode-based file system, document content and metadata of different languages can be shared by users with different language preferences in one system.
The Oracle9i database introduces the new character set, AL32UTF-8. In Release 1, AL32UTF-8 was the default character set for Unicode 3.0 deployment. In Release 2, AL32UTF-8 is compliant to the latest Unicode 3.1 standard, which contains the supplementary characters, particularly additional Chinese, Japanese, and Korean ideographs. AL32UTF-8 is the default character set of an Oracle9i database installation.
To support documents in different character sets and languages in a single file system, the repository associates two globalization attributes per document. They are the character set and language attributes.
The character set of a document is used in several situations. When the document content is rendered to a file, the character set of the document is used as the character encoding of the file. When the document is displayed in the browser, the character set of the document is set in the HTTP content-type header. Finally, when a full-text search is built on a text document, Oracle Text uses the character set of the document to convert the data into the database character set before building the index. When a character set is updated, the content is reindexed.
If no character set is specified upon insertion of a document, the repository determines a default character set as follows: the character set of the user's LibrarySession stored in the Localizer object is first used. This is obtained from the user's PrimaryUserProfile information at initialization of the user's LibrarySession.
The language of a document is used in mainly two ways. First, as a criterion to limit the search for documents of a particular language. The more significant usage, however, is for building a full-text search index on the document with Oracle Text. Oracle Text's multilexer feature uses the language to identify the specific lexer to parse the document for searchable words. The language-specific lexers need to be defined and associated with a language before the index is built. They are defined as follows:
The BASIC_LEXER is used for single-byte languages using white space as a word separator. Asian language lexers cannot use white space as word separator.s Instead, they use a V-gram algorithm to parse the documents for searchable keys. Languages that have not been supported by Oracle Text are parsed as English. Oracle Files uses the multilexer feature of Oracle Text. It is a global lexer containing German, Danish, Swedish, Japanese, Simplified Chinese, Traditional Chinese, and Korean sublexers.
If no language is specified upon insertion of a document, the repository determines a default language as follows.
The naming convention for language is simply one flavor, the Oracle NLS language abbreviation. See Table F-3, " Languages Supported in Oracle Files" for a list of Oracle Files-supported languages.
There are two service configuration properties that hold default character set and language values for Oracle Files Subscribers. The properties are:
These two properties are initialized with the Oracle Files Configuration assistant tool and can be later modified through Oracle Enterprise Manager. The Oracle Files default character set should be the same or a subset of the database character set. The character set should be specified in accordance with the IANA standard naming convention. The language should be specified in accordance with Oracle naming for languages. See Table F-2, " Character Sets Supported in Oracle Files" and Table F-3, " Languages Supported in Oracle Files" for a list of Oracle Files-supported character sets and languages.
Oracle Files does not support multibyte user names for certain protocols. Access through WebDAV (Web Folders and Oracle FileSync), HTTP, and SMB is not available for user names that contain multibyte characters. FTP allows multibyte user names.
The standard FTP protocol does not define the character set of the file names or directory names that are usually passed as arguments of FTP commands. The FTP server is responsible for interpreting the byte sequence of the FTP commands. To allow users to access documents of different character sets and languages, the Oracle Files FTP server provides the following QUOTE commands to support this.
When a quote command is issued to change the character set or language of the FTP session, the FTP server actually updates the settings in the Localizer object of the current LibrarySession. Subsequently, since quote commands cannot be issued until a FTP session is established, only usernames in the character set or subset of the FTP server's default character set can be used to log in to the FTP server.
Users can specify the character sets and languages of their environments using standard command-line FTP clients. Browser-based FTP clients, such as Internet Explorer or Netscape, do not allow issuance of quote commands. FtpSession defaults will be used.
The Server Message Block (SMB) protocol server implements the SMB protocol to allow mounting of Oracle Files as a disk drive in Microsoft Windows Explorer. Microsoft has included Unicode support for the SMB protocol since LanManager Version 0.12.
The SMB protocol does not allow users to pass the character set and language information to the server. The session defaults will be used for documents inserted into the repository via the SMB protocols.
The following table summarizes the character sets supported in Oracle Files.
The following table summarizes the languages supported in Oracle Files.