Go to main content

International Language Environments Guide for Oracle® Solaris 11.3

Exit Print View

Updated: December 2018
 
 

Working With Unicode

This section provides an introduction to Unicode and also discusses working with UTF-8.

Unicode Overview

Unicode is the universal character encoding standard used for representation of text for computer processing. Unicode provides a consistent way of encoding multilingual text and facilitates exchanging of international text files.

The standard for coding multilingual text is ISO/IEC 10646. Although the ISO/IEC 10646 and Unicode standards contain the same characters and encoding points, the Unicode standard provides additional information about the characters and their use.

Oracle Solaris 11.3 provides system-level support for the Unicode Standard Version 6.0 and ISO/IEC 10646:2011.

Code points can be encoded using different character encoding schemes. In Oracle Solaris Unicode locales, the UTF-8 form is used. UTF-8 is a variable-length encoding form of Unicode that preserves ASCII character code values transparently (see UTF-8 Overview).

Each Unicode character is mapped to a code point, which is an integer between 0 and 1,114,111. Unicode code points are referred to using notation in the form U+nnnn or U+nnnnnn, where the n characters together represent the code point's hexadecimal number, or by a text string describing the code point. For example, the lower case letter “a” can be represented by U+0061or the text string "LATIN SMALL LETTER A".

For more details on the Unicode Standard and ISO/IEC 10646 and their various representative forms, refer to the following sources:

UTF-8 Overview

UTF-8 is a variable-length encoding form of Unicode. This form is used in Oracle Solaris Unicode locales.

The advantage of this form is that it is backward compatible with the ASCII encoding scheme and avoids the complications of endianness and byte order. Unicode code points are in UTF-8 represented by one to four 8-bit bytes. The following table specifies the bit distribution for UTF-8, showing the ranges of Unicode code points corresponding to one-byte, two-byte, three-byte, and four-byte sequences.

Table 7  Bit Distribution of UTF-8
Code Point Range
Code Point (binary)
1st Byte
2nd Byte
3rd Byte
4th Byte
U+0000..U+007F
0xxxxxxx
0xxxxxxx
U+0080..U+07FF
00000yyy yyxxxxxx
110yyyyy
10xxxxxx
U+0800..U+FFFF
zzzzyyyy yyxxxxxx
1110zzzz
10yyyyyy
10xxxxxx
U+010000..U+10FFFF \
000uuuuu zzzzyyyy yyxxxxxx
11110uuu
10uuzzzz
10yyyyyy
10xxxxxx

For more details about the UTF-8 encoding form, refer to the following sources: