JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
International Language Environments Guide for Oracle Solaris 11.1     Oracle Solaris 11.1 Information Library
PDF
search filter icon
search icon

Document Information

Preface

1.  Introduction

2.  Unicode and UTF-8 Locale Support

Unicode Overview

UTF-8 Overview

Common Locale Data Repository

Locales With Non-UTF-8 Character Sets

Migrating From Non-UTF-8 Locales to UTF-8 Locales

Plain Text Files

File Names and Directory Names

ZFS

NFS

3.  Working with Languages and Locales

4.  Desktop Keyboard Preferences and Input Methods

5.  Configuring Fonts

6.  Advanced Topics

A.  Available Locales

Index

Unicode Overview

Unicode is the universal character encoding standard used for representation of text for computer processing. Unicode provides a consistent way of encoding multilingual text and facilitates exchanging of international text files.

The standard for coding multilingual text is ISO/IEC 10646. Although the ISO/IEC 10646 and Unicode standards contain all the same characters and encoding points, the Unicode standard provides additional information about the characters and their use.

Oracle Solaris 11 provides system-level support for the Unicode Standard Version 6.0 and ISO/IEC 10646:2011.

Each Unicode character is mapped to a code point, which is an integer between 0 and 1,114,111. Unicode code points are referred to using notation in the form U+nnnn, where nnnn is the code point's hexadecimal number, or by a text string describing the code point. For example, the lower case letter “a” can be represented by U+0061or the text string "LATIN SMALL LETTER A".

Code points can be encoded using different character encoding schemes. In Oracle Solaris Unicode locales, the UTF-8 form is used. UTF-8 is a variable-length encoding form of Unicode that preserves ASCII character code values transparently (see UTF-8 Overview).

For more details on the Unicode Standard and ISO/IEC 10646 and their various representative forms, refer to the following sources:

UTF-8 Overview

UTF-8 is a variable-length encoding form of Unicode. This form is used in Oracle Solaris Unicode locales.

The advantage of this form is that it is backward compatible with the ASCII encoding scheme and avoids the complications of endianness and byte order. Unicode code points are in UTF-8 represented by one to four 8-bit bytes. The following table specifies the bit distribution for UTF-8, showing the ranges of Unicode code points corresponding to one-byte, two-byte, three-byte, and four-byte sequences.

Table 2-1 Bit Distribution of UTF-8

Code Point Range
Code Point (binary)
1st Byte
2nd Byte
3rd Byte
4th Byte
U+0000..U+007F
0xxxxxxx
0xxxxxxx
U+0080..U+07FF
00000yyy yyxxxxxx
110yyyyy
10xxxxxx
U+0800..U+FFFF
zzzzyyyy yyxxxxxx
1110zzzz
10yyyyyy
10xxxxxx
U+010000..U+10FFFF
000uuuuu zzzzyyyy yyxxxxxx
11110uuu
10uuzzzz
10yyyyyy
10xxxxxx

For more details about the UTF-8 encoding form, refer to the following sources: