4 About Globalization and Multibyte Support

This chapter introduces Oracle Access Manager 10g (10.1.4.0.1) globalization, localization for international languages, and multibyte support through the use of Unicode to enable processing of internationalized data. The following topics are included:

Oracle Access Manager Globalization and Localization
Languages For Localized Messages in Oracle Access Manager
Oracle Access Manager and the Unicode Standard
Oracle Unicode Character Sets
Oracle Access Manager and Latin-1 Encoding
Looking Ahead

4.1 Oracle Access Manager Globalization and Localization

Oracle Access Manager 10g (10.1.4.0.1) has undergone a globalization process. Globalization means providing multi-lingual applications and software products that can be accessed and run anywhere simultaneously, without modification, while rendering content in the native user's language and locale preferences.

A locale is the linguistic and cultural environment in which a system or program is running; data associated with a locale provides support for formatting and parsing of dates, times, numbers, currencies, and the like based on the linguistic and cultural requirements that corresponds to a given language and country.

Oracle product globalization is a two part process that includes internationalization and localization. Internationalization (sometimes shortened to "I18N", meaning "I - eighteen letters -N") requires that software products and applications must be usable on a machine running any supported operating system (in any supported language), with non-US keyboards or other country-specific hardware. Oracle applications do not have hard-coded dependencies on language strings, and inter-operate with non-US versions of other products. Oracle applications can handle multibyte characters and differences in a distributed environment, as well as being able to detect the user's desired locale. Oracle Access Manager10g (10.1.4.0.1) meets these requirements and conforms to Unicode Standard 4.0 discussed in "Oracle Access Manager and the Unicode Standard".

Localization includes translation of separated file text. In Oracle products, including Oracle Access Manager, information is presented in a manner that is consistent with the user's local cultural conventions, including data formatting, collation, currency, date, time, and directionality of text (right-to-left or left-to-right), as discussed next.

4.2 Languages For Localized Messages in Oracle Access Manager

Translatable information can be categorized into two types: end-user information (accessible to all users) and administrative information (for users with administrator privileges). When you install Oracle Access Manager 10g (10.1.4.0.1) without a Language Pack, English is the default language for Administrators and end users. When you install 10g (10.1.4.0.1) with Oracle-provided Language Packs, you can choose the language to be used as the default for Administrative activities. Regardless of the default Administrator language you choose during installation, English is always installed.

For end-users, Oracle Access Manager10g (10.1.4.0.1) enables the display of static application data such as error messages, and display names for tabs, panels, and attributes in the End Users languages identified in Table 4-1. Administrative information can be displayed in only the Administrators languages listed in Table 4-1. If administrative pages are requested in any other language (by the browser setting), the language that was selected as the default during product installation is used to display the pages.

Table 4-1 Languages for Localized Messages in Oracle Access Manager

Language Tag for Installation Directory	End User Information	Administrators
en-us	English	English
ar-ar	Arabic
pt-br	Brazilian Portuguese	Brazilian Portuguese
fr-ca	Canadian French	Canadian French
cs-cs	Czech
da-dk	Danish
nl-nl	Dutch
fi-fi	Finnish
fr-fr	French	French
de-de	German	German
el-gr	Greek
he-il	Hebrew
hu-hu	Hungarian
it-it	Italian	Italian
ja-jp	Japanese	Japanese
ko-kr	Korean	Korean
es-mx	Latin American Spanish	Latin American Spanish
no-no	Norwegian
pl-pl	Polish
pt-pt	Portuguese
ro-ro	Romanian
ru-ru	Russian
zh-cn	Simplified Chinese	Simplified Chinese
sk-sk	Slovak
es-es	Spanish/Spain	Spanish
sv-sv	Swedish
th-th	Thai
zh-tw	Traditional Chinese	Traditional Chinese
tr-tr	Turkish

4.2.1 Bi-directional Language Support

Most Western languages are written left to right (LTR), from the top of the page to the bottom. East Asian languages are usually written top to bottom, from the right side of the page to the left (RTL)—although exceptions are frequently made for technical books translated from Western languages.

Some languages, such as Hebrew and Arabic, are written and read predominantly from right to left. Numbers reverse direction in Arabic and Hebrew. While the text is written right to left, numbers within the sentence are written left to right with the most significant digit on the left, as in European and other LTR languages.

When LTR languages are mixed in with RTL languages, the complete document or content is considered bi-directional. Oracle Access Manager can support bi-directional languages. If the browser on the host machine is configured to use any bi-directional language, then Oracle Access Manager will handle it properly.

Note:

No administrative languages require bi-directional support.

To provide support for multiple languages and bi-directional languages, Oracle Access Manager 10g (10.1.4.0.1) supports the Unicode standard for encoding.

Note:

Writing direction does not affect the encoding of a character. Regardless of the writing direction, Oracle stores data in logical order—the order used by someone typing a language—rather than the order in which it is presented on the screen.

4.3 Oracle Access Manager and the Unicode Standard

Oracle Access Manager 10g (10.1.4.0.1) supports the Unicode standard, which has been adopted by many software and hardware vendors. Many operating systems and browsers now support Unicode, and Unicode is required by modern standards such as XML, Java, JavaScript, LDAP, CORBA 3.0, WML, Windows XP, and others. The Unicode standard is also synchronized with the ISO/IEC 10646 standard.

Computers represent all characters as numbers. A character set is the mapping of individual characters to binary values. This mapping is also known as an encoding scheme. There are dozens of different encoding schemes for characters.

Unicode is a universal encoding scheme that defines codes for characters that are used in every major contemporary language written today. Unicode enables information from any language to be stored using a single character set. The Unicode standard assigns a distinct numeric value, called a codepoint, to each character and organizes characters into blocks of related characters. Unicode codepoints are commonly expressed in hexadecimal form. Unicode provides a unique code value for every character whatever the platform, program, or language.

The Unicode standard primarily encodes scripts rather than languages. A script is a collection of symbols used in one or more related languages. When more than one language shares a set of symbols that have an historically-related derivation, the set of symbols of each such language is unified into a single collection identified as a single script. For example, the Cyrillic script is a superset of all the characters used by the different Cyrillic alphabets.

As with many technologies, Unicode has more than one implementation standard: fixed-width UCS-2 form (or its superset UTF16) and the reasonably compact, variable-width UTF-8 form used by Oracle.

Note:

The canonical UCS-2 form (used internally in Windows NT, for example), uses 2 bytes for each character (including ASCII characters).

4.3.1 UTF-8 Encoding

Oracle products, including Oracle Access Manager 10g (10.1.4.0.1), support UTF-8 encoding. Incoming data uses UTF-8 encoding. Outgoing text, such as the HTML pages and information from Oracle Access Manager, use UTF-8 encoding. This means that outgoing characters are displayed in the appropriate language (LTR, RTL, or bi-directional), as needed.

Note:

Earlier releases of Oracle Access Manager supported only the Latin-1 encoding standard, as discussed in "Oracle Access Manager and Latin-1 Encoding". With 10g (10.1.4.0.1), Unicode is provided in new installations while backward compatibility is provided for customizations in older installations that have been upgraded to 10g (10.1.4.0.1). For more information, see Chapter 5, "Overview of 10g (10.1.4.0.1) Behaviors".

Most modern character encodings have an historical grounding in ASCII, which is the most common format for text files in computers and on the Internet. In an ASCII file, each alphabetic, numeric, or special character is represented with a 7-bit binary number. Unix and DOS-based operating systems use ASCII for text files. Windows NT, 2000, and XP use Unicode.

UTF-8 is the Unicode 8-bit encoding standard, which is a strict superset of 7-bit ASCII. This means that each 7-bit ASCII character is available in UTF-8 with the same corresponding codepoint value. While each 7-bit ASCII character occupies 1 byte, each UTF-8 codepoint value generates a bit pattern that is distributed over one to four bytes. This means that in UTF-8 encoding, a single Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes.

In the UTF-8 form of Unicode supported by Oracle, accented Latin characters as well as Greek, Cyrillic, Arabic, and Hebrew characters occupy 2 bytes each. All other characters, including Chinese, Japanese, Korean, Indian, occupy 3 bytes each. Supplementary characters occupy 4 bytes each.

4.4 Oracle Unicode Character Sets

Historically vendors have defined different character sets for their hardware and software, primarily because there were no official standards. Different character sets support different character inventories (known as repertoires). When character sets were first developed, they had limited repertoires. For example the ASCII 7-bit character set provided only 128 symbols and the Unix ROMAN8 8-bit character set provided only 256 symbols. UTF8 an older multibyte character set developed by Oracle is based on an older Unicode standard.

Unicode (UTF-8) is a multibyte character set with the capability to define over a million characters. Because character sets are typically based on a particular writing script, one character set can support more than one language.

The Oracle equivalent for the Unicode UTF-8 standard is the AL32UTF8 character set. The code used to process this character set resides within the libraries bundled with each Oracle Access Manager component and installed automatically with the product.

Oracle Access Manager 10g (10.1.4.0.1) uses the AL32UTF8 character set to process all data. Data coming in to Oracle Access Manager is UTF-8 encoded. Outgoing data is UTF-8 encoded.

4.4.1 Background on Oracle AL32UTF8 and Other Oracle Character Sets

The following information is provided for information only:

Oracle AL32UTF8 and UTF8 Character Sets
Older Oracle Unicode Character Sets

4.4.1.1 Oracle AL32UTF8 and UTF8 Character Sets

With Oracle9i Oracle introduced the Unicode character set, AL32UTF8 with enhancements based on Unicode standard 3.0. Starting with the Oracle 10g Release 2 (10.1.2) AL32UTF8 maps to the latest version of the Unicode Standard (Unicode 4.0) and provides support for newly defined supplementary characters. All supplementary characters are stored as 4 bytes. As the UTF-8 standard evolves, so too will the AL32UTF8 character set.

With Oracle8 and 8i, Oracle introduced UTF8 as the UTF-8 encoded character set (based on Unicode version 2.1). Oracle9i included an updated version of the Oracle UTF8 character set to support Unicode standard 3.0. To maintain compatibility with existing installations, the UTF8 character set will remain at Unicode version 3.0.

4.4.1.2 Older Oracle Unicode Character Sets

Oracle began supporting Unicode as a database character set starting with Oracle database version 7. AL24UTFFSS was the first Unicode character set supported by Oracle. AL24UTFFSS is an acronym for the multibyte Unicode character encoding scheme UTF-FSS. The naming convention <Language><bit size><encoding> was used, where AL24UTFFSS represents All Languages 24 bits size UTFFSS encoding. AL24UTFFSS was based on Unicode standard 1.1, which is now obsolete. AL24UTFFSS support was dropped as of Oracle9i.

4.5 Oracle Access Manager and Latin-1 Encoding

Earlier releases of Oracle Access Manager supported only Latin-1 encoding, which allowed the product to process a subset of European languages.

Latin-1 encoding was developed jointly by the International Organization for Standardization (ISO, which is not an acronym) and the International Electrotechnical Commission (IEC). This 8-bit character encoding standard for computers is known formally as ISO/IEC 8859 and informally as ISO 8859. The standard is divided into numbered parts; each part is published separately. Part 1 (also known as ISO 8859-1) is the most widely used and encompasses Latin-1 encodings.

ISO 8859-1 (Latin-1) encodings can be represented in a single byte (8-bits) in computer memory and enable support for various Western European languages, such as English, French, German (and some other Western European languages), Eastern European languages (Albanian), as well as Afrikaans and Swahili.

ISO-8859-1 is the default encoding used for legacy HTML documents and for documents transmitted through MIME messages, such as HTTP responses when the document's media type is "text" (for example, "text/html").

The eight-bit ISO-8859 standard was developed as a true extension of ASCII. ISO 8859-1 Latin-1 leaves the original ASCII character-mapping intact and adds additional values greater than the 7-bit range. ISO/IEC 8859-1 and the original 7-bit ASCII remain the most common character encodings in use today.

Backward compatibility is automatic in upgraded environments and Latin-1 encoding remains the default in older installations that you upgrade to 10g (10.1.4.0.1).

Note:

When you add a 10g (10.1.4.0.1) Identity or Access Server to an upgraded environment, you must manually set flags to enable backward compatibility. For details, see Chapter 5, "Overview of 10g (10.1.4.0.1) Behaviors".

4.6 Looking Ahead

Other chapters in this guide provide a more in depth look at concepts, behaviors, manuals, and terminology: