Oracle9i Globalization Support Guide
Release 1 (9.0.1)

Part Number A90236-02
Go To Documentation Library
Home
Go To Product List
Book List
Go To Table Of Contents
Contents
Go To Index
Index

Master Index

Feedback

Go to previous page Go to next page

2
Choosing a Character Set

This chapter explains how to choose a character set. It includes the following topics:

Character Set Encoding

When computer systems process characters, they use numeric codes instead of the graphical representation of the character. For example, when the database stores the letter A, it actually stores a numeric code that is interpreted by software as that letter. These numeric codes are important in all databases. They are especially important when working in a global environment because of the need to convert between different character sets.

What is an Encoded Character Set?

An encoded character set is specified when you create a database. The choice of character set determines what languages can be represented in the database. This choice influences how you create the database schema and develop applications that process character data. It also influences interoperability with operating system resources and database performance.

A group of characters (for example, alphabetic characters, ideographs, symbols, punctuation marks, and control characters) can be encoded as an encoded character set. An encoded character set assigns unique numeric codes to each character in the character repertoire. Table 2-1 shows examples of characters that are assigned a numeric code value.

Table 2-1 Encoded Characters in the ASCII Character Set
Character  Description  Code Value 

! 

Exclamation Mark 

21 

# 

Number Sign 

23 

$ 

Dollar Sign 

24 

1 

Number 1 

31 

2 

Number 2 

32 

3 

Number 3 

33 

A 

Uppercase A 

41 

B 

Uppercase B 

42 

C 

Uppercase C 

43 

a 

Lowercase a 

61 

b 

Lowercase b 

62 

c 

Lowercase c 

63 

There are many different coded character sets used throughout the computer industry. Oracle supports most national, international, and vendor-specific encoded character set standards. The complete list of character sets supported by Oracle is listed in Appendix A, "Locale Data". Character sets differ in the following ways:

These differences are discussed throughout this chapter.

Which Characters to Encode?

When you choose a character set, first decide what languages you wish to store in the database. The characters that are encoded in a character set depend on the writing systems that are represented.

Writing Systems

A writing system can be used to represent a language or group of languages. For the purposes of this book, writing systems can be classified into two categories: phonetic and ideographic.

Phonetic Writing Systems

Phonetic writing systems consist of symbols that represent different sounds associated with a language. Greek, Latin, Cyrillic, and Devanagari are all examples of phonetic writing systems based on alphabets. Note that alphabets can represent more than one language. For example, the Latin alphabet can represent many Western European languages such as French, German, and English.

Characters associated with a phonetic writing system (alphabet) can typically be encoded in one byte because the character repertoire is usually smaller than 256 characters.

Ideographic Writing Systems

Ideographic writing systems consist of ideographs or pictographs that represent the meaning of a word, not the sounds of a language. Chinese and Japanese are examples of ideographic writing systems that are based on tens of thousands of ideographs. Languages that use ideographic writing systems may use a syllabary as well. Syllabaries provide a mechanism for communicating phonetic information along with the pictographs when necessary. For instance, Japanese has two syllabaries: Hiragana, normally used for grammatical elements, and Katakana, normally used for foreign and onomatopoeic words.

Characters associated with an ideographic writing system typically must be encoded in more than one byte because the character repertoire has tens of thousands of characters.

Punctuation, Control Characters, Numbers, and Symbols

In addition to encoding the script of a language, other special characters, such as punctuation marks, need to be encoded such as punctuation marks (for example, commas, periods, and apostrophes), numbers (for example, Arabic digits 0-9), special symbols (for example, currency symbols and math operators) and control characters for computers (for example, carriage returns, tabs, and NULL).

Writing Direction

Most Western languages are written left to right from the top to the bottom of the page. East Asian languages are usually written top to bottom from the right to the left of the page, though exceptions are frequently made for technical books translated from Western languages. Arabic and Hebrew are written right to left from the top to the bottom.

Another consideration is that numbers reverse direction in Arabic and Hebrew. So even though the text is written right to left, numbers within the sentence are written left to right. For example, "I wrote 32 books" would be written as "skoob 32 etorw I". Regardless of the writing direction, Oracle stores the data in logical order. Logical order means the order used by someone typing a language, not how it looks on the screen.

How Many Languages Does a Character Set Support?

Different character sets support different character repertoires. Because character sets are typically based on a particular writing script, they can thus support different languages. When character sets were first developed in the United States, they had a limited character repertoire and even now there can be problems using certain characters across platforms. The following CHAR and VARCHAR characters are represented in all Oracle database character sets and transportable to any platform:

If you are using:

then take care that your data is in well-formed strings.

During conversion from one character set to another, Oracle expects CHAR and VARCHAR items to be well-formed strings encoded in the declared database character set. If you put other values into the string (for example, using the CHR or CONVERT function), the values may be corrupted when they are sent to a database with a different character set.

If you are currently using only two or three well-established character sets, you may not have experienced any problems with character conversion. However, as your enterprise grows and becomes more global, problems may arise with such conversions. Therefore, Oracle Corporation recommends that you use Unicode databases and datatypes.

See Also:

Chapter 5, "Supporting Multilingual Databases with Unicode" 

ASCII Encoding

The ASCII and IBM EBCDIC character sets support a similar character repertoire, but assign different code values to some of the characters. Table 2-2 shows how ASCII is encoded. Row and column headings denote hexadecimal digits. To find the encoded value of a character, read the column number followed by the row number. For example, the value of the character A is 0x41.

Table 2-2 7-Bit ASCII Coded Character Set
  0  1  2  3  4  5  6  7 

0 

NUL 

DLE 

SP 

1 

SOH 

DC1 

2 

STX 

DC2 

3 

ETX 

DC3 

4 

EOT 

DC4 

5 

ENQ 

NAK 

6 

ACK 

SYN 

7 

BEL 

ETB 

8 

BS 

CAN 

9 

TAB 

EM 

A 

LF 

SUB 

B 

VT 

ESC 

C 

FF 

FS 

D 

CR 

GS 

E 

SO 

RS 

F 

SI 

US 

DEL 

Over the years, character sets evolved to support more than just monolingual English in order to meet the growing needs of users around the world. New character sets were quickly created to support other languages. Typically, these new character sets supported a group of related languages, based on the same script. For example, the ISO 8859 character set series was created to support different European languages.

Table 2-3 lSO 8859 Character Sets
Standard  Languages Supported 

ISO 8859-1 

Western European (Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish) 

ISO 8859-2 

Eastern European (Albanian, Croatian, Czech, English, German, Hungarian, Latin, Polish, Romanian, Slovak, Slovenian, Serbian) 

ISO 8859-3 

Southeastern European (Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese, Spanish, Turkish) 

ISO 8859-4 

Northern European (Danish, English, Estonian, Finnish, German, Greenlandic, Latin, Latvian, Lithuanian, Norwegian, Sámi, Slovenian, Swedish) 

ISO 8859-5 

Eastern European (Cyrillic-based: Bulgarian, Byelorussian, Macedonian, Russian, Serbian, Ukrainian) 

ISO 8859-6 

Arabic 

ISO 8859-7 

Greek 

ISO 8859-8 

Hebrew 

ISO 8859-9 

Western European (Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, English, Finnish, French, Frisian, Galician, German, Greenlandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish, Turkish) 

ISO 8859-10 

Northern European (Danish, English, Estonian, Faeroese, Finnish, German, Greenlandic, Icelandic, Irish Gaelic, Latin, Lithuanian, Norwegian, Sámi, Slovenian, Swedish) 

ISO 8859-13 

Baltic Rim (English, Estonian, Finnish, Latin, Latvian, Norwegian) 

ISO 8859-14 

Celtic (Albanian, Basque, Breton, Catalan, Cornish, Danish, English, Galician, German, Greenlandic, Irish Gaelic, Italian, Latin, Luxemburgish, Manx Gaelic, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish, Welsh) 

ISO 8859-15 

Western European (Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish) 

Character sets evolved and provided restricted multilingual support. They were restricted in the sense that they were limited to groups of languages based on similar scripts. More recently, there has been a push to remove boundaries and limitations on the character data that can be represented through the use of an unrestricted or universal character set. Unicode is one such universal character set that encompasses most major scripts of the modern world. The Unicode character set provides support for a character repertoire of approximately 49,000 characters and continues to grow.

How are Characters Encoded?

Different types of encoding schemes have been created by the computer industry. The character set you choose affects what kind of encoding scheme will be used. This is important because different encoding schemes have different performance characteristics, and these characteristics can influence your database schema and application development requirements. The character set you choose will typically use one of the following types of encoding schemes:

Single-Byte Encoding Schemes

Single byte encoding schemes are the most efficient encoding schemes available. They take up the least amount of space to represent characters and are easy to process and program with because one character can be represented in one byte.

7-Bit Encoding Schemes

Single-byte 7-bit encoding schemes can define up to 128 characters and normally support just one language. One of the most common single-byte character sets, used since the early days of computing, is ASCII (American Standard Code for Information Interchange).

8-Bit Encoding Schemes

Single-byte 8-bit encoding schemes can define up to 256 characters and often support a group of related languages. One example is ISO 8859-1, which supports many Western European languages. Figure 2-1 illustrates a typical 8-bit encoding scheme.

Figure 2-1 8-Bit Encoding Schemes


Text description of iso88591.gif follows.
Text description of the illustration iso88591.gif

Multibyte Encoding Schemes

Multibyte encoding schemes are needed to support ideographic scripts used in Asian languages like Chinese or Japanese because these languages use thousands of characters. These schemes use either a fixed number of bytes to represent a character or a variable number of bytes per character.

Fixed-Width Multibyte Encoding Schemes

In a fixed-width multibyte encoding scheme, each character is represented by a fixed number of n bytes, where n is greater than or equal to two.

Variable-Width Multibyte Encoding Schemes

A variable-width encoding scheme uses one or more bytes to represent a single character. Some multibyte encoding schemes use certain bits to indicate the number of bytes that will represent a character. For example, if two bytes is the maximum number of bytes used to represent a character, the most significant bit can be toggled to indicate whether that byte is a single-byte character or the first byte of a double-byte character. In other schemes, control codes differentiate single-byte from double-byte characters. Another possibility is that a shift-out code is used to indicate that the subsequent bytes are double-byte characters until a shift-in code is encountered.

Oracle's Naming Convention for Character Sets

Oracle uses the following naming convention for character set names:

<language_or_region><#_of_bits_representing_a_character><standard_name>[S | C]

Note that UTF8 and UTFE are exceptions to this naming convention.

Some examples are:

The optional "S" or "C" at the end of the character set name is used to differentiate character sets that can be used only on the server (S) or only on the client (C).

On Macintosh platforms, the server character set should always be used. The Macintosh client character sets are obsolete. On EBCDIC platforms, if available, the "S" version should be used on the server and the "C" version on the client.

Choosing an Oracle Database Character Set

Oracle uses the database character set for:

Consider the following questions when you choose an Oracle character set for the database:

Several character sets may meet your current language requirements, but you should consider future language requirements as well. If you know that you will need to expand support in the future for different languages, picking a character set with a wider range now will prevent the need for migration later. The Oracle character sets listed in Appendix A, "Locale Data" are named according to the languages and regions which are covered by a particular character set. In the case of regions covered, some character sets (for example, the ISO character sets) are also listed explicitly by language. You may want to see the actual characters that are encoded. Most character sets are based on national, international, or vendor product documentation, or are available in standards documents.

Interoperability with System Resources and Applications

While the database maintains and processes the actual character data, there are other resources that you must depend on from the operating system. For example, the operating system supplies fonts that correspond to the character set you have chosen. Input methods that support the desired languages and application software must also be compatible with a particular character set.

Ideally, a character set should be available on the operating system and is handled by your application to ensure seamless integration.

Character Set Conversion

If you choose a character set that is different from what is available on the operating system, the Oracle database can convert the operating system character set to the database character set. However, there is some character set conversion overhead, and you need to make sure that the operating system character set has an equivalent character repertoire to avoid data loss.

Character set conversions can sometimes cause data loss. For example, if you are converting from character set A to character set B, the destination character set B must have the same character set repertoire as A. Any characters that are not available in character set B will be converted to a replacement character, which is most often specified as a question mark, (?), or a linguistically related character. For example, ä (a with an umlaut) will be converted to a. If you have distributed environments, consider using character sets with similar character repertoires to avoid loss of data.

Character set conversion may require copying strings between buffers multiple times before the data reaches the client. Therefore, if possible, use the same character sets for the client and the server to optimize performance.

See Also:

Chapter 10, "Character Set Scanner Utility" 

Database Schemas

By default, the character datatypes CHAR and VARCHAR2 are specified in bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.

This works well if the database character set uses a single-byte character encoding scheme because the number of characters will be the same as the number of bytes. If the database character set uses a multibyte character encoding scheme, there is no such correspondence. That is, the number of bytes no longer equals the number of characters since a character can consist of one or more bytes. Thus, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters. You can overcome this problem by switching to character semantics when defining the column size.

See Also:

Oracle9i Database Concepts for more information about character semantics 

Performance Implications

There can be different performance overheads in handling different encoding schemes, depending on the character set chosen. For best performance, you should try to choose a character set that avoids character set conversion and uses the most efficient encoding for the languages desired. Single-byte character sets are more optimal for performance than multibyte character sets, and they also are the most efficient in terms of space requirements. However, single-byte character sets limit how many languages you can use.

Restriction

ASCII-based character sets are supported only on ASCII-based platforms. Similarly, you can use an EBCDIC-based character set only on EBCDIC-based platforms.

The database character set is used to identify SQL and PL/SQL source code. In order to do this, it must have either EBCDIC or 7-bit ASCII as a subset, whichever is native to the platform. Therefore, it is not possible to use a fixed-width, multibyte character set as the database character set. Currently, this restriction applies only to the AL16UTF16 character set.

Choosing an Oracle NCHAR Character Set

In some cases, you may wish to choose an alternate character set for the database because:

SQL NCHAR datatypes have been redefined to support Unicode data only. You can store the data in either UTF-8 or UTF-16 encodings.

See Also:

Chapter 5, "Supporting Multilingual Databases with Unicode" 

Restrictions on Character Sets Used to Express Names and Text

Table 2-4 lists the restrictions on the character sets that can be used to express names and other text in Oracle.

Table 2-4 Restrictions on Character Sets Used to Express Names and Text  
Name  Single-Byte or Fixed-Width  Variable
Width
 
Comments 

column names 

Yes 

Yes 

 

schema objects 

Yes 

Yes 

 

comments 

Yes 

Yes 

 

database link names 

Yes 

No 

 

database names 

Yes 

No 

 

filenames (datafile, log file, control file, initialization parameter file) 

Yes 

No 

 

instance names 

Yes 

No 

 

directory names 

Yes 

No  

 

keywords 

Yes 

No 

Can be expressed in English ASCII or EBCDIC characters only 

recovery manager filenames  

Yes 

No 

 

rollback segment names 

Yes 

No 

The ROLLBACK_SEGMENTS parameter does not support NLS 

stored script names 

Yes 

Yes 

 

tablespace names 

Yes 

No 

 

For a list of supported string formats and character sets, including LOB data (LOB, BLOB, CLOB, and NCLOB), see Table 2-6.

The character encoding scheme used by the database is defined at database creation as part of the CREATE DATABASE statement. All SQL CHAR datatype columns (CHAR, CLOB, VARCHAR2, and LONG), including columns in the data dictionary, have their data stored in the database character set. In addition, the choice of database character set determines which characters can name objects in the database. SQL NCHAR datatype columns (NCHAR, NCLOB, and NVARCHAR2) use the national character set.

After the database is created, the character set choices cannot be changed, with some exceptions, without re-creating the database. Hence, it is important to consider carefully which character sets to use. The database character set should always be a superset or equivalent of the client's operating system's native character set. The character sets used by client applications that access the database usually determine which superset is the best choice.

If all client applications use the same character set, then this is the normal choice for the database character set. When client applications use different character sets, the database character set should be a superset of all the client character sets. This ensures that every character is represented when converting from a client character set to the database character set.

When a client application operates with a terminal that uses a different character set, then the client application's characters must be converted to the database character set, and vice versa. This conversion is performed automatically, and is transparent to the client application, except that the number of bytes for a character string may be different in the client character set and the database character set. The character set used by the client application is defined by the NLS_LANG parameter.

Summary of Datatypes and Supported Encoding Schemes

Table 2-5 lists the supported encoding schemes associated with different datatypes.

Table 2-5 Supported Encoding Schemes for Datatypes
Datatype  Single Byte  Multibyte
Non-Unicode
 
Multibyte
Unicode
 

CHAR 

Yes 

Yes 

Yes 

VARCHAR2 

Yes 

Yes 

Yes 

NCHAR 

No 

No 

Yes 

NVARCHAR2 

No 

No 

Yes 

BLOB 

Yes 

Yes 

Yes 

CLOB 

Yes 

Yes 

Yes 

LONG 

Yes 

Yes 

Yes 

NCLOB 

No 

No 

Yes 

Table 2-6 lists the supported datatypes associated with Abstract Data Types (ADT).

Table 2-6 Supported Datatypes for Abstract Datatypes
Abstract Datatype  CHAR  NCHAR  BLOB  CLOB  NCLOB 

Object 

Yes 

No 

Yes 

Yes 

No 

Collection 

Yes 

No 

Yes 

Yes 

No 


Note:

BLOBs process characters as a series of byte sequences. The data is not subject to any NLS-sensitive operations. 


Changing the Character Set After Database Creation

In some cases, you may wish to change the existing database character set. For example, you may find that the number of languages that need to be supported in your database have increased. In most cases, you will need to do a full export/import to properly convert all data to the new character set. However, if, and only if, the new character set is a strict superset of the current character set, it is possible to use the ALTER DATABASE CHARACTER SET statement to expedite the change in the database character set.

See Also:

Chapter 10, "Character Set Scanner Utility" for more information about character set conversion 

Monolingual Database Scenario

The simplest example of an NLS database setup is when both the client and the server run in the same language environment and use the same character encoding. This monolingual scenario has the advantage of fast response because the overhead associated with character set conversion is avoided. Figure 2-2, illustrates this:

Figure 2-2 Monolingual Database Scenario


Text description of nls81025.gif follows
Text description of the illustration nls81025.gif

You can also use a multitier architecture, as illustrated in Figure 2-3:

Figure 2-3 Multitier Monolingual Database Scenario


Text description of nls81026.gif follows
Text description of the illustration nls81026.gif

Character Set Conversion

You may need to convert character sets in a client/server computing environment because a client application resides on a different computer platform from that of the server, and both platforms do not use the same character encoding schemes. Character data passed between client and server must be converted between the two encoding schemes. Character conversion occurs automatically and transparently via Oracle Net.

You can convert between any two character sets, as shown in Figure 2-4:

Figure 2-4 Character Set Conversion


Text description of nls81027.gif follows
Text description of the illustration nls81027.gif

However, in cases where a target character set does not contain all characters in the source data, replacement characters are used. If, for example, a server uses US7ASCII and a German client WE8ISO8859P1, the German character ß is replaced with ? and ä is replaced with a.

Replacement characters may be defined for specific characters as part of a character set definition. When a specific replacement character is not defined, a default replacement character is used. To avoid the use of replacement characters when converting from client to database character set, the server character set should be a superset (or equivalent) of all the client character sets. In Figure 2-2, the server's character set was not chosen wisely. If German data is expected to be stored on the server, a character set that supports German letters, such as WE8ISO8859P1, is needed for both the server and the client.

In some variable-width multibyte cases, character set conversion may introduce noticeable overhead. You need to carefully evaluate your situation and choose character sets to avoid conversion as much as possible. Having the appropriate character set for the database and the client will avoid the overhead of character conversion, as well as possible data loss.

Multilingual Database Scenarios

Note that some character sets support multiple languages. This is typical when the languages have related writing systems or scripts. For example, Table 2-7 illustrates that WE8ISO8859P1 supports the following Western European languages:

Table 2-7 WE8ISO8859P1 Example

Catalan 

Finnish 

Icelandic 

Portuguese 

Danish 

French 

Italian 

Spanish 

Dutch 

German 

Norwegian 

Swedish 

English 

 

 

 

The reason WE8ISO8859P1 supports the languages above is because they are all based on a similar writing script. This situation is called restricted multilingual support. In this case, they are all Latin-based scripts.

Restricted Multilingual Support

In Figure 2-5, both clients have access to the server's data, though the German client requires character conversion because it is using the WE8DEC character set.

Figure 2-5 Restricted Multilingual Support


Text description of nls81028.gif follows
Text description of the illustration nls81028.gif

Character conversion is necessary, but both French and German are Latin-based scripts, so you can use WE8ISO8859P1.

Unrestricted Multilingual Support

Often, unrestricted multilingual support is needed, and a universal character set such as Unicode is necessary as the server database character set. Unicode has two major encoding schemes: UTF-16 and UTF-8. UTF-16 is a two-byte fixed-width format; UTF-8 is a multibyte format with a variable width. The Oracle9i database provides support for UTF-8 as a database character set and both UTF-8 and UTF-16 as the national character set. This enhancement is transparent to clients who already provide support for multi-byte character sets.

Character set conversion between a UTF-8 database and any single-byte character set introduces very little overhead. Conversion between UTF-8 and any multibyte character set has some overhead but there is no conversion loss problem except that some multibyte character sets do not support user-defined characters during character set conversion to and from UTF-8.

See Also:

Appendix A, "Locale Data" 

Figure 2-6, shows how a database can support many different languages. Here, Japanese, French, and German clients are all accessing the same database based on the Unicode character set. Note that each client accesses only data that it can process. If Japanese data were retrieved, modified, and stored by the German client, all Japanese characters would be lost during the character set conversion.

Figure 2-6 Unrestricted Multilingual Support Scenario


Text description of nls81029.gif follows
Text description of the illustration nls81029.gif

Figure 2-6 illustrates a Unicode solution for a client/server architecture. You can also use a multitier architecture, as illustrated in Figure 2-7.

Figure 2-7 Multitier Unrestricted Multilingual Support Scenario


Text description of nls81030.gif follows
Text description of the illustration nls81030.gif

Figure 2-7 illustrates a multitier Unicode solution. Using this all-UTF8 architecture, you eliminate the need for character conversion.

See Also:

Chapter 5, "Supporting Multilingual Databases with Unicode" 


Go to previous page Go to next page
Oracle
Copyright © 1996-2001, Oracle Corporation.

All Rights Reserved.
Go To Documentation Library
Home
Go To Product List
Book List
Go To Table Of Contents
Contents
Go To Index
Index

Master Index

Feedback