Bookshelf v8.0: Unicode Character Sets

Siebel Global Deployment Guide > Overview of Global Deployments > About Supported Character Sets >

Unicode Character Sets

This topic is part of About Supported Character Sets.

To meet the needs of global operations, a number of software and hardware providers started the Unicode Consortium and created a Unicode standard during the 1990s. The repertoire of this international character code for information processing includes characters for the major scripts of the world, as well as technical symbols in common use. Unicode can represent 64 thousand planes of 64 thousand characters each. Unicode character encoding treats alphabetic characters, ideographic characters, such as Kanji, and symbols identically, which means that they can be used in any mixture with equal facility.

The original Unicode standard (1.0) defined a 16-bit entity as the basic unit to represent a character. This standard became the basis of the UCS-2 encoding of Unicode, which specifies 16 bits per character, regardless of which language it may represent.

However, the UCS-2 standard considered 8 consecutive bits of zero value to be valid data, which has a different meaning to programs written in C—it means the end of string. Since most Web and communications software was written in C at the time the Unicode standard was introduced, an alternative encoding of Unicode called UTF-8 became popular. It encodes exactly the same set of characters, but avoids the null byte problem. To do this, it represents data in variable amounts—1, 2, or 3 bytes in length, depending on the character.

Today the Unicode standard has advanced further, and has defined an extension mechanism to encode more than 16 bits worth of information. This revised standard is now referred to as UTF-16. The UTF-8 standard has remained popular among Web users, and has added a fourth byte in size to address the Unicode extension mechanism. Today there are two forms of Unicode in active use, UTF-16 and UTF-8, and Siebel Business Applications use both of them.

For more information about databases and character sets supported by Siebel Business Applications, see Siebel System Requirements and Supported Platforms on Siebel SupportWeb.

UCS-2

UCS-2 stands for Universal Character Set - 2 Bytes. In this standard, all characters are represented by two bytes (16 bits), no matter the origin.

UTF-8

UTF-8 stands for Unicode Transformation Format, 8-bit Encoding. UTF-8 is an encoding of Unicode which is more efficient for storage of English (ASCII), whereas other language data is expanded and can be represented by up to four bytes.

For example, English (ASCII) characters use one byte per character, accented European characters use two bytes, and Asian languages use three bytes per character.

UTF-16

UTF-16 replaces the original UCS-2.

UTF-16 can access 63,000 characters as single Unicode 16-bit units and an additional one million characters through a mechanism known as surrogate pairs. For surrogate pairs, two ranges of Unicode code values are reserved for the high (first) and low (second) values of these pairs. High values are from 0xD800 to 0xDBFF, and low values are from 0xDC00 to 0xDFFF. The number of characters requiring surrogate pairs is fairly limited, because the most common characters have already been included in the first 64,000 values.

Siebel Global Deployment Guide		Copyright © 2007, Oracle. All rights reserved.