Localization, 4 of 9

Understanding Encoding Schemes

About encoded character sets

Computers handle character data as numeric codes instead of their graphical representations. A group of characters and symbols that are typically used together can be mapped to numeric codes. This set of numeric codes is an encoded character set. The earliest character sets were ASCII and EBCDIC, which were two different ways of encoding the letters A-Z and a to z, the numbers 0 to 9, and various punctuation and other symbols commonly used in the English language.

For example, the letter "A" is represented in ASCII as 0x41, and the letter "a" is represented as 0x61. Although many encodings retain the basic ASCII codes, a different character set could use 0x41 and 0x61 to represent mathematical symbols, or escape sequences, or two unrelated letters of a different alphabet. Without knowing which encoding is in use, you cannot know for sure what character is represented by a particular number.

To make their products available beyond the English-speaking market, hardware and software manufacturers developed their own encodings for other languages and alphabets. In the short term, they were successful in bringing their products to an international market. In the long term, these individual initiatives created a hodgepodge of hundreds of encodings that make accurate communications across hardware platforms and software products a challenge.

What is Unicode?

Unicode is an emerging standard for encoding text in all of the written languages of the world. It provides a unique number for every character, regardless of language, program, or platform. Products that use Unicode, such as the Oracle RDBMS and OLAP Services, can be deployed across multiple platforms, languages, and countries without re-engineering, and can handle text in multiple languages without data loss.

Use of UTF-8 encoding

Oracle supports Unicode using the UTF-8 standard, which is a format that transforms all Unicode characters into a variable-length encoding of bytes. One Unicode character can be one, two, or three bytes in this encoding. Surrogate pairs will require four bytes, but there are no such characters defined in the current version of the standard.

All ASCII characters are valid UTF-8 characters, so that existing English ASCII data does not require conversion.

Unicode also has a UTF-16 encoding, which is used by Java and Windows NT. Java and Windows NT are used for many client applications. Conversion is easy between UTF-8 (used by OLAP Services) and UTF-16 (used by client applications), and so does not noticeably impact performance.

Related information

Additional information about the Unicode standard is available in the Oracle9i Globalization and National Language Support Guide and at the Unicode Consortium Web site at www.unicode.org.