Working With Unicode

Language:

This section provides an introduction to Unicode and also discusses working with UTF-8.

Unicode Overview

Unicode is the universal character encoding standard used for representation of text for computer processing. Unicode provides a consistent way of encoding multilingual text and facilitates exchanging of international text files.

The standard for coding multilingual text is ISO/IEC 10646. Although the ISO/IEC 10646 and Unicode standards contain the same characters and encoding points, the Unicode standard provides additional information about the characters and their use.

Oracle Solaris 11.3 provides system-level support for the Unicode Standard Version 6.0 and ISO/IEC 10646:2011.

Code points can be encoded using different character encoding schemes. In Oracle Solaris Unicode locales, the UTF-8 form is used. UTF-8 is a variable-length encoding form of Unicode that preserves ASCII character code values transparently (see UTF-8 Overview).

Each Unicode character is mapped to a code point, which is an integer between 0 and 1,114,111. Unicode code points are referred to using notation in the form U+nnnn or U+nnnnnn, where the n characters together represent the code point's hexadecimal number, or by a text string describing the code point. For example, the lower case letter “a” can be represented by U+0061or the text string "LATIN SMALL LETTER A".

For more details on the Unicode Standard and ISO/IEC 10646 and their various representative forms, refer to the following sources:

The Unicode Standard, Version 6.0 from the Unicode Consortium
ISO/IEC 10646:2011, Information Technology-Universal Multiple-Octet Character Set (UCS) – Part 1: Architecture and Basic Multilingual Plane
The Unicode Consortium web site

`UTF-8` Overview

UTF-8 is a variable-length encoding form of Unicode. This form is used in Oracle Solaris Unicode locales.

The advantage of this form is that it is backward compatible with the ASCII encoding scheme and avoids the complications of endianness and byte order. Unicode code points are in UTF-8 represented by one to four 8-bit bytes. The following table specifies the bit distribution for UTF-8, showing the ranges of Unicode code points corresponding to one-byte, two-byte, three-byte, and four-byte sequences.

Table 7 Bit Distribution of UTF-8

Code Point Range	Code Point (binary)	1st Byte	2nd Byte	3rd Byte	4th Byte
`U+0000`..`U+007F`	`0xxxxxxx`	`0xxxxxxx`
`U+0080`..`U+07FF`	`00000yyy yyxxxxxx`	`110yyyyy`	`10xxxxxx`
`U+0800`..`U+FFFF`	`zzzzyyyy yyxxxxxx`	`1110zzzz`	`10yyyyyy`	`10xxxxxx`
`U+010000`..`U+10FFFF` \	`000uuuuu zzzzyyyy yyxxxxxx`	`11110uuu`	`10uuzzzz`	`10yyyyyy`	`10xxxxxx`

For more details about the UTF-8 encoding form, refer to the following sources:

The Unicode Standard, Version 6.0, Chapter 3 (http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf), Section 3.9 "Unicode Encoding Forms", pp. 93 - 94
The Unicode Consortium web site