Oracle8i National Language Support Guide
Release 2 (8.1.6)

Part Number A76966-01

Library

Product

Contents

Index

Go to previous page Go to next page

3
Choosing a Character Set

This chapter explains NLS topics that you need to know when choosing a character set. These topics are:

What is an Encoded Character Set?

An encoded character set is specified when creating a database, and your choice of character set determines what languages can be represented in the database. This choice also influences how you create the database schema and develop applications that process character data. It also influences interoperability with operating system resources and database performance.

When processing characters, computer systems handle character data as numeric codes rather than as their graphical representation. For instance, when the database stores the letter "A", it actually stores a numeric code that is interpreted by software as that letter.

A group of characters (for example, alphabetic characters, ideographs, symbols, punctuation marks, control characters) can be encoded as a coded character set. A coded character set assigns unique numeric codes to each character in the character repertoire. Table 3-1 shows examples of characters that are assigned a numeric code value.

Table 3-1 Encoded Characters in the ASCII Character Set
Character  Description  Code Value 

! 

Exclamation Mark 

0x21 

# 

Number Sign 

0x23 

$ 

Dollar Sign 

0x24 

1 

The Number 1 

0x31 

2 

The Number 2 

0x32 

3 

The Number 3 

0x33 

A 

An Uppercase A 

0x41 

B 

An Uppercase B 

0x42 

C 

An Uppercase C 

0x43 

a 

A Lowercase a 

0x61 

b 

A Lowercase b 

0x62 

c 

A Lowercase c 

0x63 

There are many different coded character sets used throughout the computer industry and supported by Oracle. Oracle supports most national, international, and vendor-specific encoded character set standards. The complete list of character sets supported by Oracle is included in Appendix A, "Locale Data". Character sets differ in:

These differences are discussed throughout this chapter.

Which Characters to Encode?

The first choice to make when choosing a character set is based on what languages you wish to store in the database. The characters that are encoded in a character set depend on the writing systems that are represented.

Writing Systems

A writing system can be used to represent a language or group of languages. For the purposes of this book, writing systems can be classified into two broad categories, phonetic and ideographic.

Phonetic Writing Systems

Phonetic writing systems consist of symbols which represent different sounds associated with a language. Greek, Latin, Cyrillic, and Devanagari are all examples of phonetic writing systems based on alphabets. Note that alphabets can represent more than one language. For example, the Latin alphabet can represent many Western European languages such as French, German, and English.

Characters associated with a phonetic writing system (alphabet) can typically be encoded in one byte since the character repertoire is usually smaller than 256 characters.

Ideographic Writing Systems

Ideographic writing systems, in contrast, consist of ideographs or pictographs that represent the meaning of a word, not the sounds of a language. Chinese and Japanese are examples of ideographic writing systems that are based on tens of thousands of ideographs. Languages that use ideographic writing systems may use a syllabary as well. Syllabaries provide a mechanism for communicating phonetic information along with the pictographs when necessary. For instance, Japanese has two syllabaries, Hiragana, normally used for grammatical elements, and Katakana, normally used for foreign and onomatopoeic words.

Characters associated with an ideographic writing system must typically be encoded in more than one byte because the character repertoire can be as large as tens of thousands of characters.

Punctuation, Control Characters, Numbers, and Symbols

In addition to encoding the script of a language, other special characters, such as punctuation marks, need to be encoded such as punctuation marks (for example, commas, periods, apostrophes), numbers (for example, Arabic digits 0-9), special symbols (for example, currency symbols, math operators) and control characters for computers (for example, carriage returns, tabs, NULL).

Writing Direction

Most Western languages are written left-to-right from the top to the bottom of the page. East Asian languages are usually written top-to-bottom from the right to the left of the page. Exceptions are frequently made for technical books translated from Western languages. Arabic and Hebrew are written right-to-left from the top to the bottom.

Another consideration is that numbers reverse direction in Arabic and Hebrew. So, even though the text is written right-to-left, numbers within the sentence are written left-to-right. For example, "I wrote 32 books" would be written as "skoob 32 etorw I". Irrespective of the writing direction, Oracle stores the data in logical order. Logical order means the order used by someone typing a language, not how it looks on the screen.

How Many Languages does a Character Set Support?

Different character sets support different character repertoires. Because character sets are typically based on a particular writing script, they can thus support different languages. When character sets were first developed in the United States, they had a limited character repertoire and even now there can be problems using certain characters across platforms. The following CHAR and VARCHAR characters are representable in all Oracle database character sets and transportable to any platform:

If you are using

take care that your data is in well-formed strings.

During conversion from one character set to another, Oracle expects CHAR and VARCHAR items to be well-formed strings encoded in the declared database character set. If you put other values into the string (for example, using the CHR or CONVERT function), the values may be corrupted when they are sent to a database with a different character set.

If you are currently using only two or three well-established character sets, you may not have experienced any problems with character conversion. However, as your enterprise grows and becomes more global, problems may arise with such conversions. Therefore, Oracle Corporation recommends that you store any values other than well-formed strings in RAW columns rather than CHAR or VARCHAR columns.

ASCII Encoding

The ASCII and IBM EBCDIC character sets support a similar character repertoire, but assign different code values to some of the characters. Table 3-2 shows how ASCII is encoded. Row and column headings denote hexadecimal digits. To find the encoded value of a character, read the column number followed by the row number. For example, the value of the character A is 0x41.

Table 3-2 7-Bit ASCII Coded Character Set
  0  1  2  3  4  5  6  7 

0 

NUL 

DLE 

SP 

1 

SOH 

DC1 

2 

STX 

DC2 

3 

ETX 

DC3 

4 

EOT 

DC4 

5 

ENQ 

NAK 

6 

ACK 

SYN 

7 

BEL 

ETB 

8 

BS 

CAN 

9 

TAB 

EM 

A 

LF 

SUB 

B 

VT 

ESC 

C 

FF 

FS 

D 

CR 

GS 

E 

SO 

RS 

F 

SI 

US 

DEL 

Over the years, character sets evolved to support more than just monolingual English in order to meet the growing needs of users around the world. New character sets were quickly created to support other languages. Typically, these new character sets supported a group of related languages, based on the same script. For example, the ISO 8859 character set series was created based on many national or regional standards to support different European languages.

Table 3-3 lSO 8859 Character Sets
Standard  Languages Supported 

ISO 8859-1 

Western European (Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish) 

ISO 8859-2 

Eastern European (Albanian, Croatian, Czech, English, German, Hungarian, Latin, Polish, Romanian, Slovak, Slovenian, Serbian) 

ISO 8859-3 

Southeastern European (Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese, Spanish, Turkish) 

ISO 8859-4 

Northern European (Danish, English, Estonian, Finnish, German, Greenlandic, Latin, Latvian, Lithuanian, Norwegian, Sámi, Slovenian, Swedish) 

ISO 8859-5 

Eastern European (Cyrillic-based: Bulgarian, Byelorussian, Macedonian, Russian, Serbian, Ukrainian) 

ISO 8859-6 

Arabic 

ISO 8859-7 

Greek 

ISO 8859-8 

Hebrew 

ISO 8859-9 

Western European (Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, English, Finnish, French, Frisian, Galician, German, Greenlandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish, Turkish) 

ISO 8859-10 

Northern European (Danish, English, Estonian, Faeroese, Finnish, German, Greenlandic, Icelandic, Irish Gaelic, Latin, Lithuanian, Norwegian, Sámi, Slovenian, Swedish) 

ISO 8859-15 

Western European (Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish) 

Character sets evolved and provided restricted multilingual support, restricted in the sense that they were limited to groups of languages based on similar scripts.

More recently, there has been a push to remove boundaries and limitations on the character data that can be represented through the use of an unrestricted or universal character set. Unicode is one such universal character set that encompasses most major scripts of the modern world. The Unicode character set provides support for a character repertoire of approximately 39,000 characters and continues to grow.

How are These Characters Encoded?

Different types of encoding schemes have been created by the computer industry. These schemes have different performance characteristics, and can influence your database schema and application development requirements for handling character data, so you need to be aware of the characteristics of the encoding scheme used by the character set you choose. The character set you choose will typically use one of the following types of encoding schemes.

Single-Byte Encoding Schemes

Single byte encoding schemes are the most efficient encoding schemes available. They take up the least amount of space to represent characters and are easy to process and program with because one character can be represented in one byte.

7-bit Encoding Schemes

Single-byte 7-bit encoding schemes can define up to 128 characters, and normally support just one language. One of the most common single-byte character sets, used since the early days of computing, is ASCII (American Standard Code for Information Interchange).

8-bit Encoding Schemes

Single-byte 8-bit encoding schemes can define up to 256 characters, and often support a group of related languages. One example being ISO 8859-1, which supports many Western European languages.

Figure 3-1 8-Bit Encoding Schemes


Text description of iso88591.gif follows.

Text description of the illustration iso88591.gif.

Multibyte Encoding Schemes

Multibyte encoding schemes are needed to support ideographic scripts used in Asian languages like Chinese or Japanese since these languages use thousands of characters. These schemes use either a fixed number of bytes to represent a character or a variable number of bytes per character.

Fixed-width Encoding Schemes

In a fixed-width multibyte encoding scheme, each character is represented by a fixed number of n bytes, where n is greater than or equal to two.

Variable-width Encoding Schemes

A variable-width encoding scheme uses one or more bytes to represent a single character. Some multibyte encoding schemes use certain bits to indicate the number of bytes that will represent a character. For example, if two bytes is the maximum number of bytes used to represent a character, the most significant bit can be toggled to indicate whether that byte is part of a single-byte character or the first byte of a double-byte character. In other schemes, control codes differentiate single-byte from double-byte characters. Another possibility is that a shift-out code will be used to indicate that the subsequent bytes are double-byte characters until a shift-in code is encountered.

Oracle's Naming Convention for Character Sets

Oracle uses the following naming convention for character set names:

<language_or_region><#_of_bits_representing_a_char><standard_name>[S] [C] 
[FIXED]

Note that UTF8 and UTFE are exceptions to this naming convention.

For instance:

The optional "S" or "C" at the end of the character set name is sometimes used to help differentiate character sets that can only be used on the server (S) or client (C).

On Macintosh platforms, the server character set should always be used. The Macintosh client character sets are now obsolete. On EBCDIC platforms, if available, the "S" version should be used on the server and the "C" version on the client.

The optional "FIXED" at the end of the character set name is used to denote a fixed-width multibyte encoding.

Tips on Choosing an Oracle Database Character Set

Oracle uses the database character set for:

Four considerations you should make when choosing an Oracle character set for the database are:

  1. What languages does the database need to support?

  2. Interoperability with system resources and applications

  3. Performance implications

  4. Restrictions

Several character sets may meet your current language requirements, but you should consider future language requirements as well. If you know that you will need to expand support in the future for different languages, picking a character set with a wider range now will obviate the need for migration later. The Oracle character sets listed in Appendix A, "Locale Data", are named according to the languages and regions which are covered by a particular character set. In the case of regions covered, some character sets, the ISO character sets for instance, are also listed explicitly by language. You may want to see the actual characters that are encoded in some cases. The actual code pages are not listed in this manual, however, since most are based on national, international, or vendor product documentation, or are available in standards documents.

Interoperability with System Resources and Applications

While the database maintains and processes the actual character data, there are other resources that you must depend on from the operating system. For instance, the operating system supplies fonts that correspond to the character set you have chosen. Input methods that support the language(s) desired and application software must also be compatible with a particular character set.

Ideally, a character set should be available on the operating system and is handled by your application to ensure seamless integration.

Character Set Conversion

If you choose a character set that is different from what is available on the operating system, Oracle can handle character set conversion from the database character set to the operating system character set. However, there is some character set conversion overhead, and you need to make sure that the operating system character set has an equivalent character repertoire to avoid any possible data loss.

Also note that character set conversions can sometimes cause data loss. For example, if you are converting from character set A to character set B, the destination character set B must have the same character set repertoire as A. Any characters that are not available in character set B will be converted to a replacement character, which is most often specified as "?" or a linguistically related character. For example, ä (a with an umlaut) will be converted to "a". If you have distributed environments, consider using character sets with similar character repertoires to avoid loss of data.

Character set conversion may require copying strings between buffers multiple times before the data reaches the client. Therefore, if possible, using the same character sets for the client and the server can avoid character set conversion, and thus optimize performance.

Database Schema

The character datatypes CHAR and VARCHAR2 are specified in bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.

This works out well if the database character set uses a single-byte character encoding scheme because the number of characters will be the same as the number of bytes. If the database character set uses a multibyte character encoding scheme, there is no such correspondence. That is, the number of bytes no longer equals the number of characters since a character can consist of one or more bytes. Thus, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters.

Performance Implications

There can be different performance overheads in handling different encoding schemes, depending on the character set chosen. For best performance, you should try to choose a character set that avoids character set conversion and uses the most efficient encoding for the languages desired. Single-byte character sets are more optimal for performance than multi-byte character sets, and they also are the most efficient in terms of space requirements.

Restrictions

You cannot currently choose an Oracle database character set that is a fixed-width multibyte character set. In particular, the following character sets cannot be used as the database character set:

Table 3-4 Restricted Character Sets

JA16EUCFIXED 

ZHS16GBKFIXED 

JA16DBCSFIXED 

KO16DBCSFIXED 

ZHS16DBCSFIXED 

JA16SJISFIXED 

ZHT32TRISFIXED 

KO16KSC5601FIXED 

ZHS16CGB231280FIXED 

ZHT32EUCFIXED 

ZHT16BIG5FIXED 

ZHT16DBCSFIXED 

Tips on Choosing an Oracle NCHAR Character Set

In some cases, you may wish to have the ability to choose an alternate character set for the database because the properties of a different character encoding scheme may be more desirable for extensive character processing operations, or to facilitate ease-of-programming. In particular, the following data types can be used with an alternate character set:

Specifying an NCHAR character set allows you to specify an alternate character set from the database character set for use in NCHAR, NVARCHAR2, and NCLOB columns. This can be particularly useful for customers using a variable-width multibyte database character set because NCHAR has the capability to support fixed-width multibyte encoding schemes, whereas the database character set cannot. The benefits in using a fixed-width multibyte encoding over a variable-width one are:

When choosing an NCHAR character set, you must ensure that the NCHAR character repertoire is equivalent to or a subset of the database character set repertoire.

Note: all SQL commands will use the database character set, not the NCHAR character set. Therefore, literals can only be specified in the database character set.

Database Schema

When using the NCHAR, NVARCHAR2, and NCLOB data types, the width specification can be in terms of bytes or characters depending on the encoding scheme used. If the NCHAR character set uses a variable-width multibyte encoding scheme, the width specification refers to bytes. If the NCHAR character set uses a fixed-width multibyte encoding scheme, the width specification will be in characters. For example, NCHAR(20), using the variable-width multibyte character set JA16EUC, will allocate 20 bytes while NCHAR(20) using the fixed-width multibyte character set JA16EUCFIXED will allocate 40 bytes.

Performance Implications

Some string operations are faster when you choose a fixed-width character set for the national character set. For instance, string-intensive operations such as the SQL LIKE operator used on an NCHAR fixed-width column outperform LIKE operations on a multi-byte CHAR column. A possible usage scenario is as follows:

With a Database Character Set of

Use an NCHAR Character Set of

Recommendations

Because SQL text such as the literals in SQL statements can only be represented by the database character set, and not the NCHAR character set, you should choose an NCHAR character set that either has an equivalent or subset character repertoire of the database character set.

Considerations for Different Encoding Schemes

Keep the following points in mind when dealing with encoding schemes.

Be Careful when Mixing Fixed-Width and Varying-Width Character Sets

Because fixed-width multi-byte character sets are measured in characters, and varying-width character sets are measured in bytes, be careful if you use a fixed-width multi-byte character set as your national character set on one platform and a varying-width character set on another platform.

As an example, if you use %TYPE or a named type to declare an item on one platform using the declaration information of an item from the other platform, you might receive a constraint limit too small to support the data. So, for example, "NCHAR (10)" on the platform using the fixed-width multi-byte set allocates enough space for 10 characters, but if %TYPE or the use of a named type creates a correspondingly typed item on the other platform, it allocates only 10 bytes. Usually, this is not enough for 10 characters. To be safe:

Storing Data in Multi-Byte Character Sets

Width specifications of the character datatypes CHAR and VARCHAR2 refer to bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.

If the database character set is single byte, and that character set includes only composite characters, the number of characters and the number of bytes are the same. If the database character set is multi-byte, in general, there is no such correspondence. A character can consist of one or more bytes, depending on the specific multi-byte encoding scheme and whether shift-in/shift-out control codes are present. Hence, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters.

A typical situation is when character elements are combined to form a single character. For example, o and an umlaut can be combined to form ö. In the Thai language, up to three separate character elements can be combined to form one character, and one Thai character would require up to 3 bytes when TH8TISASCII or another single-byte Thai character set is used. One Thai character would require up to 9 bytes when the UTF8 character set is used.

One Thai character consists of up to three separate character elements as shown in Figure 3-2, where two of the characters are comprised of three character elements.

Figure 3-2 Combining Characters


Text description of ch37.gif follows.

Text description of the illustration ch37.gif.

In the lower row of Figure 3-2, nine Thai characters are shown in the correct display format. Inside the database, these nine Thai characters are stored just like the upper row of Figure 3-2. They look like thirteen characters, but they are actually nine characters. Note that the upper row is just showing how Thai characters are stored in the database (and it is the same as how Thai characters are represented in computer memory), but the way shown in the upper row is an incorrect way of displaying Thai characters.

When using the NCHAR and NVARCHAR2 data types, the width specification refers to characters when the national character set is fixed-width multi-byte. Otherwise, the width specification refers to bytes.

A separate performance issue is space efficiency (and thus speed) when using smaller-width character sets. These issues potentially trade-off against each other when the choice is between a varying-width and a fixed-width character set.

Naming Database Objects

You can use Oracle to name database objects.

Restrictions on Character Sets Used to Express Names and Text

Table 3-5 lists the restrictions on the character sets that can be used to express names and other text in Oracle.

Table 3-5 Restrictions on Character Sets Used to Express Names and Text
Name  Single-
Byte
Fixed
 
Varying
Width
 
Multi-Byte
Fixed Width Character Sets
 
Comments 

Column Names 

Yes 

Yes 

No 

 

Schema Objects 

Yes 

Yes 

No 

 

comments 

Yes 

Yes 

No 

 

database link names 

Yes 

No 

No 

 

database names 

Yes 

No 

No 

 

filenames (datafile, logfile, controlfile, initialization parameter file) 

Yes 

No 

No 

 

instance names 

Yes 

No 

No 

 

directory names 

Yes 

No  

No 

 

keywords 

Yes 

No 

No 

Can be expressed in English ASCII or EBCDIC characters only 

recovery manager filenames  

Yes 

No 

No 

 

rollback segment names 

Yes 

No 

No 

The ROLLBACK_SEGMENTS parameter does not support NLS 

stored script names 

Yes 

Yes 

No 

 

tablespace names 

Yes 

Yes 

No 

 

For a list of supported string formats and character sets, including LOB data (LOB, BLOB, CLOB, and NCLOB), see Table 3-7.

The character encoding scheme used by the database is defined at database creation as part of the CREATE DATABASE statement. All data columns of type CHAR, CLOB, VARCHAR2, and LONG, including columns in the data dictionary, have their data stored in the database character set. In addition, the choice of database character set determines which characters can name objects in the database. Data columns of type NCHAR, NCLOB, and NVARCHAR2 use the national character set.

After the database is created, the character set choices cannot be changed, with some exceptions, without re-creating the database. Hence, it is important to consider carefully which character set(s) to use. The database character set should always be a superset or equivalent of the client's operating system's native character set. The character sets used by client applications that access the database usually determine which superset is the best choice.

If all client applications use the same character set, then this is the normal choice for the database character set. When client applications use different character sets, the database character set should be a superset (or equivalent) of all the client character sets. This ensures that every character is represented when converting from a client character set to the database character set.

When a client application operates with a terminal that uses a different character set, then the client application's characters must be converted to the database character set, and vice versa. This conversion is performed automatically, and is transparent to the client application, except that the number of bytes for a character string may be different in the client character set and the database character set. The character set used by the client application is defined by the NLS_LANG parameter. Similarly, the character set used for national character set data is defined by the NLS_NCHAR parameter.

Summary of Data Types and Supported Encoding Schemes

Table 3-6 lists the supported encoding schemes associated with different data types.

Table 3-6 Supported Encoding Schemes for Data Types
Data Type  Single-Byte  Multi-byte
Varying Width
 
Multi-byte
Fixed Width
 

CHAR 

Yes 

Yes 

No 

NCHAR 

Yes 

Yes 

Yes 

BLOB 

Yes 

Yes 

Yes 

CLOB 

Yes 

Yes 

No 

NCLOB 

Yes 

Yes 

Yes 

Table 3-7 lists the supported data types associated with Abstract Data Types (ADT).

Table 3-7 Supported Data Types for Abstract Data Types
Abstract DataType  CHAR  NCHAR  BLOB  CLOB  NCLOB 

Object 

Yes 

No 

Yes 

Yes 

No 

Collection 

Yes 

No 

Yes 

Yes 

No 


Note:

BLOBs process characters as a series of byte sequences. The data is not subject to any NLS-sensitive operations. 


Changing the Character Set After Database Creation

In some cases, you may wish to change the existing database character set. For instance, you may find that the number of languages that need to be supported in your database have increased. In most cases, you will need to do a full export/import to properly convert all data to the new character set. However, if, and only if, the new character set is a strict superset of the current character set, it is possible to use the ALTER DATABASE CHARACTER SET statement to expedite the change in the database character set.

The target character set is a strict superset if and only if each and every codepoint in the source character set is available in the target character set, with the same corresponding codepoint value. For instance, the following migration scenarios can take advantage of the ALTER DATABASE CHARACTER SET statement because US7ASCII is a strict subset of WE8ISO8859P1, ZHS16GBK, and UTF8:

Table 3-8 Sample Migration Scenarios

Current Character Set 

New Character Set 

New Character Set is Strict Superset? 

US7ASCII 

WE8ISO8859P1 

Yes 

US7ASCII 

ZHS16GBK 

Yes 

US7ASCII 

UTF8 

Yes 

Attempting to change the database character set to a character set that is not a strict superset can result in data loss and data corruption. To ensure data integrity, whenever migrating to a new character set that is not a strict superset, you must use export/import. It is essential to do a full backup of the database before using the ALTER DATABASE [NATIONAL] CHARACTER SET statement, since the command cannot be rolled back. The syntax is:

ALTER DATABASE [<db_name>] CHARACTER SET <new_character_set>;
ALTER DATABASE [<db_name>] NATIONAL CHARACTER SET <new_NCHAR_character_set>;

The database name is optional. The character set name should be specified without quotes, for example:

ALTER DATABASE CHARACTER SET WE8ISO8859P1;

To change the database character set, perform the following steps. Not all of them are absolutely necessary, but they are highly recommended:

SQL> SHUTDOWN IMMEDIATE;   -- or NORMAL
    <do a full backup>

SQL> STARTUP MOUNT;
SQL> ALTER SYSTEM ENABLE RESTRICTED SESSION;
SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES=0;
SQL> ALTER DATABASE OPEN;
SQL> ALTER DATABASE CHARACTER SET <new_character_set_name>;
SQL> SHUTDOWN IMMEDIATE;   -- or NORMAL
SQL> STARTUP;

To change the national character set, replace the ALTER DATABASE CHARACTER SET statement with the ALTER DATABASE NATIONAL CHARACTER SET statement. You can issue both statements together if desired.

Customizing Character Sets

In some cases, you may wish to tailor a character set to meet specific user needs. In Oracle8i, users can extend an existing encoded character set definition to suit their needs. User-defined Characters (UDC) are often used to encode special characters representing:

This section describes how Oracle supports UDC. It describes:

Character Sets with User-Defined Characters

User-defined characters are typically supported within East Asian character sets. These East Asian character sets have at least one range of reserved codepoints for use as user-defined characters. For example, Japanese Shift JIS preserves 1880 codepoints for UDC as follows:

Table 3-9 Shift JIS Codepoint Example
Japanese Shift JIS UDC Range  Number of Codepoints 

0xf040-0xf07e, 0xf080-0xf0fc 

188 

0xf140-0xf17e, 0xf180-0xf1fc 

188 

0xf240-0xf27e, 0xf280-0xf2fc 

188 

0xf340-0xf37e, 0xf380-0xf3fc 

188 

0xf440-0xf47e, 0xf480-0xf4fc 

188 

0xf540-0xf57e, 0xf580-0xf5fc 

188 

0xf640-0xf67e, 0xf680-0xf6fc 

188 

0xf740-0xf77e, 0xf780-0xf7fc 

188 

0xf840-0xf87e, 0xf880-0xf8fc 

188 

0xf940-0xf97e, 0xf980-0xf9fc 

188 

The Oracle character sets listed in Table 3-10 contain pre-defined ranges that allow you to support User Defined Characters:

Table 3-10 Oracle Character Sets with UDC
Character Set Name  Number of UDC Codepoints Available 

JA16DBCS 

4370 

JA16DBCSFIXED 

4370 

JA16EBCDIC930 

4370 

JA16SJIS 

1880 

JA16SJISFIXED 

1880 

JA16SJISYEN 

1880 

KO16DBCS 

1880 

KO16DBCSFIXED 

1880 

KO16MSWIN949 

1880 

ZHS16DBCS 

1880 

ZHS16DBCSFIXED 

1880 

ZHS16GBK 

2149 

ZHS16GBKFIXED 

2149 

ZHT16DBCS 

6204 

ZHT16MSWIN950 

6217 

Oracle's Character Set Conversion Architecture

The codepoint value that represents a particular character may vary among different character sets. For example, the Japanese kanji character:

Figure 3-3 Kanji Example


Text description of char2.gif follows.

Text description of the illustration char2.gif.

is encoded as follows in different Japanese character sets:

Table 3-11 Kanji Example with Character Conversion
Character Set  Unicode  JA16SJIS  JA16EUC  JA16DBCS 

Character Value ofText description of char2.gif follows.

Text description of the illustration char2.gif.  

0x4E9C 

0x889F 

0xB0A1 

0x4867 

In Oracle, all character sets are defined in terms of a Unicode 2.1 code point. That is each character is defined as a Unicode 2.1 code value. Character conversion takes place transparently to users by using Unicode as the intermediate form. For example, when a JA16SJIS client connects to a JA16EUC database, the character shown in Figure 3-3, "Kanji Example" (value 0x889F) entered from the JA16SJIS client is internally converted to Unicode (value 0x4E9C), and then converted to JA16EUC(value 0xB0A1).

Unicode 2.1 Private Use Area

Unicode 2.1 reserves the range 0xE000-0xF8FF for the Private Use Area (PUA). The PUA is intended for private use character definition by end users or vendors.

UDC can be converted between two Oracle character sets by using Unicode 2.1 PUA as the intermediate form, the same as standard characters.

UDC Cross References

UDC cross references between Japanese character sets, Korean character sets, Simplified Chinese character sets and Traditional Chinese character sets are contained in the following distribution sets:

${ORACLE_HOME}/ocommon/nls/demo/udc_ja.txt
${ORACLE_HOME}/ocommon/nls/demo/udc_ko.txt
${ORACLE_HOME}/ocommon/nls/demo/udc_zhs.txt
${ORACLE_HOME}/ocommon/nls/demo/udc_zht.txt

These cross references are useful when registering User Defined Characters across operating systems. For example, when registering a new UDC on both a Japanese Shift-JIS operating system and a Japanese IBM Host operating system, you may want to pick up 0xF040 on Shift-JIS operating system and 0x6941 on IBM Host operating system for the new UDC so that Oracle can convert correctly between JA16SJIS and JA16DBCS. You can find out that both Shift-JIS UDC value 0xF040 and IBM Host UDC value 0x6941 are mapped to the same Unicode PUA value 0xE000 in the UDC cross reference.

For further details on how to customize a character set definition file, see Appendix B, "Customizing Locale Data".

Monolingual Database Example

Same Character Set on the Client and the Server

This section describes the simplest example of an NLS database setup.

Both the client and server in Figure 3-4, "Monolingual Scenario", are running with the same language environment, and are both using the same character encoding. The monolingual scenario has the advantage of fast response because the overhead associated with character set conversion is avoided.

Figure 3-4 Monolingual Scenario


Text description of ch3a.gif follows.

Text description of the illustration ch3a.gif.

Character Set Conversion

Character set conversion is often necessary in a client/server computing environment where a client application may reside on a different computer platform from that of the server, and both platforms may not use the same character encoding schemes. Character data passed between client and server must be converted between the two encoding schemes. Character conversion occurs automatically and transparently via Net8.

A conversion is possible between any two character sets, as shown in Figure 3-5:

Figure 3-5 Character Set Conversion Example


Text description of ch33.gif follows.

Text description of the illustration ch33.gif.

However, in cases where a target character set does not contain all characters in the source data, replacement characters are used. If, for example, a server uses US7ASCII and a German client WE8ISO8859P1, the German character ß is replaced with ? and the character ä is replaced with a.

Replacement characters may be defined for specific characters as part of a character set definition. Where a specific replacement character is not defined, a default replacement character is used. To avoid the use of replacement characters when converting from client to database character set, the server character set should be a superset (or equivalent) of all the client character sets. In Figure 3-4, "Monolingual Scenario", the server's character set was not chosen wisely. If German data is expected to be stored on the server, a character set which supports German letters is needed, for example, WE8ISO8859P1 for both the server and the client.

In some varying-width multi-byte cases, character set conversion may introduce noticeable overhead. Users need to carefully evaluate their situation and choose character sets to avoid conversion as much as possible. Having the appropriate character set for the database and the client will avoid the overhead of character conversion, as well as any possible data loss.

Multilingual Database Example

Note that some character sets support multiple languages. For example, WE8ISO8859P1 supports the following Western European languages:

Table 3-12 WE8ISO8859P1 Example

Catalan 

Finnish 

Italian 

Swedish 

Danish 

French 

Norwegian 

 

Dutch 

German 

Portuguese 

 

English 

Icelandic 

Spanish 

 

The reason WE8ISO8859P1 supports the languages above is because they are all based on a similar writing script. This situation is often called restricted multilingual support. Restricted because this character set supports a group of related writing systems or scripts. In Table 3-12, WE8ISO8859-1 supports Latin-based scripts.

Restricted Multilingual Support

In Figure 3-6, both clients have access to the server's data.

Figure 3-6 Restricted Multilingual Support Example


Text description of ch32.gif follows.

Text description of the illustration ch32.gif.

Unrestricted Multilingual Support

Often, unrestricted multilingual support is needed, and a universal character set such as Unicode is necessary as the server database character set. Unicode has two major encoding schemes: UCS2 and UTF8. UCS2 is a two-byte fixed-width format; UTF8 is a multi-byte format with a variable width. Oracle8i provides support for the UTF8 format. This enhancement is transparent to clients who already provide support for multi-byte character sets.

Character set conversion between a UTF8 database and any single-byte character set introduces very little overhead. Conversion between UTF8 and any multi-byte character set has some overhead but there is no conversion loss problem except that some multi-byte character sets do not support user-defined characters during character set conversion to and from UTF8. See Appendix A, "Locale Data", for further information.

Figure 3-7, "Unrestricted Multilingual Support Example", shows how a database can support many different languages. Here, Japanese, French, and German clients are all accessing the same database based on the Unicode character set. Please note that each client accesses only data it can process. If Japanese data were retrieved, modified, and stored back by the German client, all Japanese characters would be lost during the character set conversion.

Figure 3-7 Unrestricted Multilingual Support Example


Text description of ch34.gif follows.

Text description of the illustration ch34.gif.


Go to previous page Go to next page
Oracle
Copyright © 1996-2000, Oracle Corporation.

All Rights Reserved.

Library

Product

Contents

Index