UTF-8 Overview - International Language Environments Guide for Oracle® Solaris 11.2

Language:

`UTF-8` Overview

UTF-8 is a variable-length encoding form of Unicode. This form is used in Oracle Solaris Unicode locales.

The advantage of this form is that it is backward compatible with the ASCII encoding scheme and avoids the complications of endianness and byte order. Unicode code points are in UTF-8 represented by one to four 8-bit bytes. The following table specifies the bit distribution for UTF-8, showing the ranges of Unicode code points corresponding to one-byte, two-byte, three-byte, and four-byte sequences.

Table 2-1 Bit Distribution of UTF-8

Code Point Range	Code Point (binary)	1st Byte	2nd Byte	3rd Byte	4th Byte
`U+0000`..`U+007F`	`0xxxxxxx`	`0xxxxxxx`
`U+0080`..`U+07FF`	`00000yyy yyxxxxxx`	`110yyyyy`	`10xxxxxx`
`U+0800`..`U+FFFF`	`zzzzyyyy yyxxxxxx`	`1110zzzz`	`10yyyyyy`	`10xxxxxx`
`U+010000`..`U+10FFFF`	`000uuuuu zzzzyyyy yyxxxxxx`	`11110uuu`	`10uuzzzz`	`10yyyyyy`	`10xxxxxx`

For more details about the UTF-8 encoding form, refer to the following sources:

The Unicode Standard, Version 6.0, Chapter 3 (http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf), Section 3.9 “Unicode Encoding Forms”, pp. 93 - 94
The Unicode Consortium web site