Unicode Overview

Language:

Unicode is the universal character encoding standard used for representation of text for computer processing. Unicode provides a consistent way of encoding multilingual text and facilitates exchanging of international text files.

The standard for coding multilingual text is ISO/IEC 10646. Although the ISO/IEC 10646 and Unicode standards contain all the same characters and encoding points, the Unicode standard provides additional information about the characters and their use.

Oracle Solaris 11 provides system-level support for the Unicode Standard Version 6.0 and ISO/IEC 10646:2011.

Each Unicode character is mapped to a code point, which is an integer between 0 and 1,114,111. Unicode code points are referred to using notation in the form U+nnnn, where nnnn is the code point's hexadecimal number, or by a text string describing the code point. For example, the lower case letter “a” can be represented by U+0061or the text string "LATIN SMALL LETTER A".

Code points can be encoded using different character encoding schemes. In Oracle Solaris Unicode locales, the UTF-8 form is used. UTF-8 is a variable-length encoding form of Unicode that preserves ASCII character code values transparently (see UTF-8 Overview).

For more details on the Unicode Standard and ISO/IEC 10646 and their various representative forms, refer to the following sources:

The Unicode Standard, Version 6.0 from the Unicode Consortium
ISO/IEC 10646:2011, Information Technology-Universal Multiple-Octet Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane
The Unicode Consortium web site