Bookshelf v7.7: Unicode Character Set

Global Deployment Guide > Supported Character Sets and Collation > Supported Traditional and Unicode Character Sets >

Unicode Character Set

To meet the needs of global operations, a number of software and hardware providers started the Unicode Consortium and created a Unicode standard during the 1990's. The repertoire of this international character code for information processing includes characters for the major scripts of the world, as well as technical symbols in common use. Unicode can represent more than 65,000 characters. Unicode character encoding treats alphabetic characters, ideographic characters, such as Kanji, and symbols identically, which means that they can be used in any mixture with equal facility.

A number of competing encodings have emerged during this period. These encodings are encoding formats which do not affect the assignment of glyphs to code points, which is one of the reasons that Unicode is so useful. The two most popular encodings today are UCS-2 and UTF-8.

UCS-2

UCS-2 stands for Universal Character Set - 2 Bytes. In this standard all characters are represented by two bytes no matter the origin.

The UCS-2 standard is supported on the Microsoft (MS) SQL database and on the IBM database, DB2 UDB.

UTF-8

UTF-8 stands for Unicode Transformation Format, 8-bit Encoding. It was initially developed to address the use of Unicode character data in 8-bit UNIX environments. UTF-8 is an encoding of Unicode which is more efficient for storage of English (ASCII), whereas other language data is expanded and can be represented by up to four bytes. For example, English (ASCII) characters are stored using one byte per character, accented European characters use two bytes, and Asian languages are stored using three bytes per character.

The UTF-8 standard is supported on some of the Oracle databases and on the IBM database, DB2 UDB.

NOTE: Although the UTF-8 standard is supported on IBM and Oracle databases, this does not imply that Siebel Systems supports deployment on these particular databases. For more information, see System Requirements and Supported Platforms.

UTF-16

UTF-16 is analogous to UCS-2. It is a two-byte representation of all characters and may replace the UCS-2 standard in the future. This encoding was developed after it became clear that 16-bits would not be sufficient to represent all of the modern languages. UTF-16 can access 63,000 characters as single Unicode 16-bit units and an additional one million characters through a mechanism known as surrogate pairs. For surrogate pairs, two ranges of Unicode code values are reserved for the high (first) and low (second) values of these pairs. High values are from 0xD800 to 0xDBFF, and low values are from 0xDC00 to 0xDFFF. The number of characters requiring surrogate pairs should be fairly limited, because the most common characters have already been included in the first 64,000 values.

NOTE: Although the UTF-16 standard is supported on IBM and Oracle databases, this does not imply that Siebel Systems supports deployment on these particular databases. For more information, see System Requirements and Supported Platforms.

Global Deployment Guide