International Language Environments Guide

Unicode Overview

Unicode is the universal character encoding standard used for representation of text for computer processing. Unicode is fully compatible with the international standards ISO/IEC 10646-1:2000 and ISO/IEC 10646–2:2001, and contains all the same characters and encoding points as ISO/IEC 10646. The Unicode Standard provides additional information about the characters and their use. Any implementation that conforms to Unicode also conforms to ISO/IEC 10646.

Unicode provides a consistent way of encoding multilingual plain text and facilitates exchanging international text files. Computer users who deal with multilingual text, business people, linguists, researchers, scientists, and others find that the Unicode Standard greatly simplifies their work. Mathematicians and technicians who regularly use mathematical symbols and other technical characters also find the Unicode Standard valuable.

The maximum possible number of code points Unicode can support is 1,114,112 through seventeen 16-bit planes. Each plane can support 65,536 different code points.

Among the more than one million code points that Unicode can support, version 4.0 curently defines 96,382 characters at plane 0, 1, 2, and 14. Planes 15 and 16 are for private use characters, also known as user-defined characters. Planes 15 and 16 together can support total 131,068 user-defined characters.

Unicode can be encoded using any of the following character encoding schemes:

UTF-8
UTF-16
UTF-32

UTF-8 is a variable-length encoding form of Unicode that preserves ASCII character code values transparently. This form is used as file code in Oracle Solaris Unicode locales.

UTF-16 is a 16-bit encoding form of Unicode. In UTF-16, characters up to 65,535 are encoded as single 16-bit values. Characters mapped above 65,535 to 1,114,111 are encoded as pairs of 16-bit values (surrogates).

UTF-32 is a fixed-length, 21-bit encoding form of Unicode usually represented in a 32-bit container or data type. This form is used as the process code (wide-character code) in Oracle Solaris Unicode locales.

For more details on the Unicode Standard and ISO/IEC 10646 and their various representative forms, refer to the following sources:

The Unicode Standard, Version 4.0 from the Unicode Consortium
ISO/IEC 10646-1:2000, Information Technology-Universal Multiple-Octet Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane
ISO/IEC 10646-2: Information Technology-Universal Multiple-Octet Character Set (UCS) - Part 2: Secondary Multilingual Plane for Scripts and Symbols, Supplementary Plane for CJK Ideographs, Special Purpose Plane
The Unicode Consortium web site at http://www.unicode.org/.

Unicode Locale: `en_US.UTF-8` Support

The Unicode/UTF-8 locales support Unicode 4.0. The en_US.UTF-8 locale provides multiscript processing support by using UTF-8 as its codeset. This locale handles processing of input and output text in multiple scripts, and was the first locale with this capability in the Oracle Solaris operating system. The capabilities of other UTF-8 locales are similar to those of en_us.UTF-8. The discussion of en_US.UTF-8 that follows applies equally to these locales.

Note –

UTF-8 is a file-system safe Universal Character Set Transformation Format of Unicode/ISO/IEC 10646-1 formulated by X/Open-Uniforum Joint Internationalization Working Group (XoJIG) in 1992 and approved by ISO and IEC, as Amendment 2 to ISO/IEC 10646-1:1993 in 1996. This standard has been adopted by the Unicode Consortium, the International Standards Organization, and the International Electrotechnical Commission as a part of Unicode 4.0 and ISO/IEC 10646-1.

Unicode locales in the Oracle Solaris environment support the processing of every code point value that is defined in Unicode 4.0 and ISO/IEC 10646-1 and 10646-2. Supported scripts include pan-European and Asian scripts and also complex text layout scripts for the Arabic, Hebrew, Indic, and Thai languages.

Note –

Some Unicode locales, notably the Asian locales, include more Kanji or Hanzi glyphs.

Due to limited font resources, the current Oracle Solaris Unicode locales include character glyphs from the following character sets.

ISO 8859-1 (most Western European languages, such as English, French, Spanish, and German)
ISO 8859-2 (most Central European languages, such as Czech, Polish, and Hungarian)
ISO 8859-4 (Scandinavian and Baltic languages)
ISO 8859-5 (Russian)
ISO 8859-6 (Arabic, including many more presentation-form character glyphs)
ISO 8859–7 (Greek)
ISO 8859–8 (Hebrew)
ISO 8859-9 (Turkish)
TIS 620.2533 (Thai, including many more presentation-form character glyphs)
ISO 8859–15 (most Western European languages with euro sign)
GB 2312–1980 (Simplified Chinese)
JIS X 0201–1976, JIS X 0208–1990 (Japanese)
KSC 5601–1992 Annex 3 (Korean)
GB 18030 (Simplified Chinese)
HKSCS (Traditional Chinese, Hong Kong)
Big5 (Traditional Chinese, Taiwan)
IS 13194.1991, also known as ISCII (Hindi, including many more presentation-form character glyphs)

If you try to view characters for which the en_US.UTF-8 locale does not have corresponding glyphs, the locale displays a no-glyph glyph instead, as shown in the following illustration:

The locale is selectable at installation time and may be designated as the system default locale.

The same level of en_US.UTF-8 locale support is provided for both 64-bit and 32-bit Oracle Solaris systems.

Note –

Motif and CDE desktop applications and libraries support the en_US.UTF-8 locale. However, XView™ and OLIT libraries do not support the en_US.UTF-8 locale.

Unicode Overview

Unicode Locale: en_US.UTF-8 Support

Unicode Locale: `en_US.UTF-8` Support