| Oracle8i National Language Support Guide Release 8.1.5 A67789-01 | 
 | 
This chapter explains NLS topics that you need to know when choosing a character set. These topics are:
A character set is specified when creating a database, and your choice of character set will determine what languages can be represented in the database. This choice will influence how you create the database schema and develop applications that process character data. It will also influence interoperability with operating system resources and database performance.
When processing characters, computer systems handle character data as numeric codes rather than as their graphical representation. For instance, when the database stores the letter "A", it actually stores a numeric code that is interpreted by software as that letter.
A group of characters (e.g., alphabetic characters, ideographs, symbols, punctuation marks, control characters) can be encoded as a coded character set. A coded character set assigns unique numeric codes to each character in the character repertoire. The following table is an example of characters that are assigned a numeric code value.
There are many different coded character sets used throughout the computer industry and supported by Oracle. Oracle supports most national, international, and vendor-specific encoded character set standards. The complete list of character sets supported by Oracle is included in Appendix A, "Locale Data". Character sets differ in:
These differences will be discussed throughout this chapter.
The first choice to make when choosing a character set will be based on what languages you wish to store in the database. The characters that are encoded in a character set depend on the writing systems that will be represented.
A writing system can be used to represent a language or group of languages. For the purposes of this book, writing systems can be classified into two broad categories, phonetic and ideographic.
Phonetic writing systems consist of symbols which represent different sounds associated with a language. Greek, Latin, Cyrillic, and Devanagari are all example of phonetic writing systems based on alphabets. Note that alphabets can represent more than one language. For example, the Latin alphabet can represent many Western European languages such as French, German, and English.
Characters associated with a phonetic writing system (alphabet) can typically be encoded in one byte since the character repertoire is usually smaller than 256 characters.
Ideographic writing systems, in contrast, consist of ideographs or pictographs that represent the meaning of a word, not the sounds of a language. Chinese and Japanese are examples of ideographic writing systems which are based on tens of thousands of ideographs. Languages that use ideographic writing systems may use a syllabary as well. Syllabaries provide a mechanism for communicating phonetic information along with the pictographs when necessary. For instance, Japanese has two syllabaries, katakana, normally used for foreign and onomatopoeic words.
Characters associated with an ideographic writing system must typically be encoded in more than one byte because the character repertoire can be as large as tens of thousands of characters.
In addition to encoding the script of a language, other special characters need to be encoded such as punctuation marks (e.g., commas, periods, apostrophes), numbers (e.g., Arabic digits 0-9), special symbols (e.g., currency symbols, math operators) and control characters for computers (e.g., carriage returns, tabs, NULL).
Most Western languages are written left-to-right from the top to the bottom of the page. East Asian languages are usually written top-to-bottom from the right to the left of the page. Exceptions are frequently made for technical books translated from Western languages.
Another consideration is that numbers reverse direction in Arabic and Hebrew. So, even though the text is written right-to-left, numbers within the sentence are written left-to-right. For example, "I wrote 32 books" would be written as "skoob 32 etorw I". Irrespective of the writing direction, Oracle stores the data in logical order. Logical order means the order used by someone typing a language, not how it looks on the screen.
Different character sets support different character repertoires. Because character sets are typically based on a particular writing script, they can thus support different languages. When character sets were first developed in the United States, they had a limited character repertoire that incorporated:
For example, the ASCII and IBM EBCDIC character sets support the same character repertoire, but assign different code values to some of the characters. Table 3-2, "7-Bit ASCII Coded Character Set" shows how ASCII is encoded. Row and column headings denote hexadecimal digits. To find the encoded value of a character, read the column number followed by the row number. For example, the value of A is 0x41.
Over the years, character sets evolved to support more than just monolingual English in order to meet the growing needs of users around the world. New character sets were quickly created to support other languages. Typically, these new character sets supported a group of related languages, based on the same script. For example, the ISO 8859 character set series was created based on many national or regional standards to support different European languages.
Character sets evolved and provided restricted multilingual support, restricted in the sense that they were limited to groups of languages based on similar scripts.
More recently, there has been a push to remove boundaries and limitations on the character data that can be represented through the use of an unrestricted or universal character set. Unicode is one such universal character set that encompasses most major scripts of the modern and ancient world. The Unicode character set provides support for a character repertoire of approximately 39,000 characters and continues to grow.
Different types of encoding schemes have been created by the computer industry. These schemes have different performance characteristics, and can influence your database schema and application development requirements for handling character data, so you need to be aware of the characteristics of the encoding scheme used by the character set you choose. The character set you choose will typically use one of the following types of encoding schemes.
Single byte encoding schemes are the most efficient encoding schemes available. They take up the least amount of space to represent characters and are easy to process and program with because one character can be represented in one byte.
Single-byte 7-bit encoding schemes can define up to 128 characters, and normally support just one language. Two of the most popular single-byte character sets, used since the early days of computing, are ASCII (American Standard Code for Information Interchange) and US EBCDIC.
Single-byte 8-bit encoding schemes can define up to 256 characters, and often support a group of related languages. One example being ISO 8859-1, which supports many Western European languages.
 
   
Multibyte encoding schemes are needed to support ideographic scripts used in Asian languages like Chinese or Japanese since these languages use thousands of characters. These schemes use either a fixed number of bytes to represent a character or a variable number of bytes per character.
In a fixed-width multibyte encoding scheme, each character is represented by a fixed number of n bytes, where n>=2.
A variable-width encoding scheme uses one or more bytes to represent a single character. Some multibyte encoding schemes use certain bits to indicate the number of bytes that will represent a character. For example, if two bytes is the maximum number of bytes used to represent a character, the most significant bit can be toggled to indicate whether that byte is part of a single-byte character or the second byte of a double-byte character. In other schemes, control codes differentiate single-byte from double-byte characters. Another possibility is that a shift-out code will be used to indicate that the following bytes are double-byte characters until a shift-in code is encountered.
Oracle uses the following naming convention for character set names:
<language_or_region><#_of_bits_representing_a_char><standard_name>[S] [C] [FIXED]
For instance:
The optional "S" or "C" at the end of the character set name is sometimes used to help differentiate character sets that can only be used on the server (S) or client (C).
On Macintosh platforms, the server character set should always be used. The Macintosh client character sets are now obsolete. On EBCDIC platforms, if available, the 'S' version should be used on the server and the 'C' version on the client.
The optional "FIXED" at the end of the character set name is used to denote a fixed-width multibyte encoding.
Oracle uses the database character set for:
Four things you should consider when choosing an Oracle character set for the database are:
Several character sets may meet your current language requirements, but you should consider future language requirements as well. If you know that you will need to expand support in the future for different languages, picking a character set with a wider range now will obviate the need for migration later. The Oracle character sets listed in Appendix A, "Locale Data" are named according to the languages and regions which are covered by a particular character set. In the case of regions covered, some character sets, the ISO character sets for instance, are also listed explicitly by language. You may want to see the actual characters that are encoded in some cases. The actual code pages are not listed in this manual, however, since most are based on national, international, or vendor product documentation, or are available in standards documents.
While the database maintains and processes the actual character data, there are other resources that you must depend on from the operating system. For instance, the operating system supplies fonts that correspond to the character set you have chosen. Input methods that support the language(s) desired and application software must also be compatible with a particular character set.
Ideally, a character set should be available on the operating system and is handled by your application to ensure seamless integration.
If you choose a character set that is different from what is available on the operating system, Oracle can handle character set conversion from the database character set to the operating system character set. However, there is some character set conversion overhead, and you need to make sure that the operating system character set has an equivalent character repertoire to avoid any possible data loss.
Also note that character set conversions can sometimes cause data loss. For example, if you are converting from character set A to character set B, the destination character set (B) must have the same character set repertoire as A. Any characters that are not available in character set B will be converted to a replacement character, which is most often specified as "?" or a linguistically related character. For example, ä (a with an umlaut) will be converted to "a". If you have distributed environments, consider using character sets with similar character repertoires to avoid loss of data.
Character set conversion may require copying strings between buffers several times with Oracle before the data reaches the client. Therefore, using the same character sets for the client and the server can avoid character set conversion, and thus optimize performance.
The character datatypes CHAR and VARCHAR2 are specified in bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.
This works out well if the database character set uses a single-byte character encoding scheme because the number of characters will be the same as the number of bytes. If the database character set uses a multibyte character encoding scheme, there will be no such correspondence. That is, the number of bytes will no longer equal the number of characters since a character can consist of one or more bytes. Thus, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters.
There can be different performance overheads in handling different encoding schemes depending on the character set chosen. For best performance, you should try to choose a character set that avoids character conversion and uses the most efficient encoding for the languages desired. Single-byte character sets are more optimal for performance than multi-byte character sets, and they also are the most efficient in terms of space requirements.
You cannot currently choose an Oracle database character set that is a fixed-width multibyte character set. In particular, the following character sets cannot be used as the database character set:
| JA16EUCFIXED | 
| ZHS16GBKFIXED | 
| JA16DBCSFIXED | 
| KO16DBCSFIXED | 
| ZHS16DBCSFIXED | 
| JA16SJISFIXED | 
| ZHT32TRISFIXED | 
In some cases, you may wish to have the ability to choose an alternate character set for the database because the properties of a different character encoding scheme may be more desirable for extensive character processing operations, or to facilitate ease-of-programming. In particular, the following data types can be used with an alternate character set:
Specifying an NCHAR character set allows you to specify an alternate character set from the database character set for use in NCHAR, NVARCHAR2, and NCLOB columns. This can be particularly useful for customers using a variable-width multibyte database character set because NCHAR has the capability to support fixed-width multibyte encoding schemes, whereas the database character set cannot. The benefits in using a fixed-width multibyte encoding over a variable-width one are:
When choosing an NCHAR character set, you must ensure that the NCHAR character repertoire is equivalent to or a subset of the database character set repertoire.
Note: all SQL commands will use the database character set, not the NCHAR character set. Therefore, literals can only be specified in the database character set.
When using the NCHAR, NVARCHAR2, and NCLOB data types, the width specification can be in terms of bytes or characters depending on the encoding scheme used. If the NCHAR character set uses a variable-width multibyte encoding scheme, the width specification refers to bytes. If the NCHAR character set uses a fixed-width multibyte encoding scheme, the width specification will be in characters. For example, NCHAR(20). using the variable-width multibyte character set JA16EUC will allocate 20 bytes while NCHAR(20) using the fixed-width multibyte character set JA16EUCFIXED will allocate 40 bytes.
Some string operations will be faster if you choose a fixed-width character set for the national character set. For instance, string-intensive operations such as the SQL LIKE operator used on a NCHAR fixed-width column will outperform LIKE operations on a multi-byte column. A possible usage scenario is as follows:
| Database Character Set | NCHAR Character Set | 
| JA16EUC | JA16EUCFIXED | 
Since SQL text can only be represented by the database character set, and not the NCHAR character set, you must choose a NCHAR character set with which either has an equivalent or subset character repertoire of the database character set.
There are several points to keep in mind when dealing with encoding schemes.
Because fixed-width multi-byte character sets are measured in characters but varying-width character sets are measured in bytes, be careful if you use a fixed-width multi-byte character set as your national character set on one platform and a varying-width character set on another platform.
For example, if you use %TYPE or a named type to declare an item on one platform using the declaration information of an item from the other platform, you might receive a constraint limit too small to support the data. For example, "NCHAR (10)" on the platform using the fixed-width multi-byte set will allocate enough space for 10 characters, but if %TYPE or the use of a named type creates a correspondingly typed item on the other platform, it will allocate only 10 bytes. Usually, this is not enough for 10 characters. To be safe, do one of the following:
Width specifications of the character datatypes CHAR and VARCHAR2 refer to bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.
If the database character set is single byte, the number of characters and number of bytes will be the same. If the database character set is multi-byte, there will in general be no such correspondence. A character can consist of one or more bytes, depending on the specific multi-byte encoding scheme and whether shift-in/shift-out control codes are present. Hence, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters.
When using the NCHAR and NVARCHAR2 data types, the width specification refers to characters if the national character set is fixed-width multi-byte. Otherwise, the width specification refers to bytes.
A separate performance issue is space efficiency (and thus speed) when using smaller-width character sets. These issues potentially trade-off against each other when the choice is between a varying-width and a fixed-width character set.
Oracle allows you to name database objects.
Table 3-4 lists the restrictions on the character sets that can be used to express names and other text in Oracle.
For a list of supported string formats and character sets, including LOB data (LOB, BLOB, CLOB, and NCLOB), see Table 3-6.
The character encoding scheme used by the database is defined at database creation as part of the CREATE DATABASE statement. All data columns of type CHAR, CLOB, VARCHAR2, and LONG, including columns in the data dictionary, have their data stored in the database character set. In addition, the choice of database character set determines which characters can name objects in the database. Data columns of type NCHAR, NCLOB, and NVARCHAR2 use the national character set.
Once the database is created, the character set choices cannot be changed without re-creating the database. Hence, it is important to consider carefully which character set(s) to use. The database character set should always be a superset or equivalent of the client's operating system's native character set. The character sets used by client applications that access the database will usually determine which superset is the best choice.
If all client applications use the same character set, then this is the normal choice for the database character set. When client applications use different character sets, the database character set should be a superset (or equivalent) of all the client character sets. This will ensure that every character is represented when converting from a client character set to the database character set.
When a client application operates with a terminal that uses a different character set, then the client application's characters must be converted to the database character set, and vice versa. This conversion is performed automatically, and is transparent to the client application. The character set used by the client application is defined by the NLS_LANG parameter. Similarly, the character set used for national character set data is defined by the NLS_NCHAR parameter.
Table 3-5 lists the supported encoding schemes associated with different datatypes.
| Data Type | Single-Byte | Multi-byte Varying Width | Multi-byte Fixed Width | 
|---|---|---|---|
| CHAR | Yes | Yes | No | 
| NCHAR | Yes | Yes | Yes | 
| BLOB | Yes | Yes | Yes | 
| CLOB | Yes | Yes | No | 
| NCLOB | Yes | Yes | Yes | 
Table 3-6 lists the supported data types associated with Abstract Data Types (ADT).
| Abstract DataType | CHAR | NCHAR | BLOB | CLOB | NCLOB | 
|---|---|---|---|---|---|
| Object | Yes | No | Yes | Yes | No | 
| Collection | Yes | No | Yes | Yes | No | 
Note: BLOBs process characters as a series of byte sequences. The data is not subject to any NLS-sensitive operations.
In some cases, you may wish to change the existing database character set. For instance, you may find that the number of languages that need to be supported in your database have increased. In most cases, you will need to do a full export/import to properly convert all data to the new character set. However, if and only if, the new character set is a strict superset of the current character set, it is possible to use the ALTER DATABASE CHARACTER SET to expedite the change in the database character set.
The target character set is a strict superset if and only if each and every codepoint in the source character set is available in the target character set, with the same corresponding codepoint value. For instance the following migration scenarios can take advantage of the ALTER DATABASE CHARACTER SET command since US7ASCII is a strict subset of WE8ISO8859P1, AL24UTFFSS, and UTF8:
| Current Character Set | New Character Set | New Character Set is strict superset? | 
| US7ASCII | WE8ISO8859P1 | yes | 
| US7ASCII | ALT24UTFFSS | yes | 
| US7ASCII | UTF8 | yes | 
WARNING: Attempting to change the database character set to a character set that is not a strict superset can result in data loss and data corruption. To ensure data integrity, whenever migrating to a new character set that is not a strict superset, you must use export/import. It is essential to do a full backup of the database before using the ALTER DATABASE [NATIONAL] CHARACTER SET statement, since the command cannot be rolled back. The syntax is:
ALTER DATABASE [<db_name>] CHARACTER SET <new_character_set>; ALTER DATABASE [<db_name>] NATIONAL CHARACTER SET <new_NCHAR_character_set>;
The database name is optional. The character set name should be specified without quotes, for example:
ALTER DATABASE CHARACTER SET WE8ISO8859P1;
To change the database character set, perform the following steps. Not all of them are absolutely necessary, but they are highly recommended:
SQL> SHUTDOWN IMMEDIATE; -- or NORMAL <do a full backup> SQL> STARTUP MOUNT; SQL> ALTER SYSTEM ENABLE RESTRICED SESSION; SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES=0; SQL> ALTER DATABASE OPEN; SQL> ALTER DATABASE CHARACTER SET <new_character_set_name>; SQL> SHUTDOWN IMMEDIATE; -- or NORMAL SQL> STARTUP;
To change the national character set, replace the ALTER DATABASE CHARACTER SET statement with ALTER DATABASE NATIONAL CHARACTER SET. You can issue both commands together if desired.
In some cases, you may wish to tailor a character set to meet specific user needs. In Oracle8i, users can extend an existing encoded character set definition to suit their needs. User-defined Characters (UDC) are often used to encode special characters representing.
This section describes how Oracle supports UDC. It describes:
User-defined characters are typically supported within East Asian character sets. These East Asian character sets have at least one range of reserved codepoints for use as user-defined characters. For example, Japanese Shift JIS preserves 1880 codepoints for UDC as follows:
The Oracle character sets listed below contain pre-defined ranges that allow you to support User Defined Characters:
The codepoint value that represents a particular character may vary among different character sets. For example, the Japanese kanji character
 
is encoded as follows in different Japanese character sets:
| Character Set | Unicode | JA16SJIS | JA16EUC | JA16DBCS | 
|---|---|---|---|---|
| 
 | 0x4E9C | 0x88F9 | 0xB0A1 | 0x4867 | 
In Oracle, all character sets are defined in terms of a Unicode 2.0 code point. That is each character is defined as a Unicode 2.0 code value. Character conversion takes place transparently to users by using Unicode as the intermediate form. For example, when a JA16SJIS client connects to a JA16EUC database, the character
 
   
(value 0x88F9) entered from the JA16SJIS client is internally converted to Unicode (value 0x4E9C), then it is converted to JA16EUC(value 0xB0A1).
Unicode 2.0 reserves the range 0xE000-0xF8FF for the Private Use Area (PUA). The PUA is intended for private use character definition by end users or vendors.
UDC can be converted between two Oracle character sets by using Unicode 2.0 PUA as the intermediate form as same as standard characters.
UDC cross references between Japanese character sets, Korean character sets, Simplified Chinese character sets and Traditional Chinese character sets are contained in the following distribution sets:
${ORACLE_HOME}/ocommon/nls/demo/udc_ja.txt ${ORACLE_HOME}/ocommon/nls/demo/udc_ko.txt ${ORACLE_HOME}/ocommon/nls/demo/udc_zhs.txt ${ORACLE_HOME}/ocommon/nls/demo/udc_zht.txt
These cross references are useful when registering User Defined Characters across operating systems. For example, when registering a new UDC on both Japanese Shift-JIS operating system and Japanese IBM Host operating system, you may want to pick up 0xF040 on Shift-JIS operating system and 0x6941 on IBM Host operating system for the new UDC so that Oracle is able to convert correctly between JA16SJIS and JA16DBCS. You can find out that both Shift-JIS UDC value 0xF040 and IBM Host UDC value 0x6941 are mapped to same Unicode PUA value 0xE000 in the UDC cross reference.
For further details on how to customize a character set definition file, see Appendix B, "Customizing Locale Data".
The simplest example of an NLS database setup is as follows. Both the client and server are running with the same language environment, and are both using the same character encoding. The monolingual scenario has the advantage of fast response because the overhead associated with character set conversion is avoided.
 
   
Character set conversion is often necessary in a client/server computing environment where a client application may reside on a different computer platform from that of the server, and both platforms may not use the same character encoding schemes. Character data passed between client and server has to be converted between the two encoding schemes. Character conversion occurs automatically and transparently via Net8.
A conversion is possible between any two character sets. For example,
 
   
However, in cases where a target character set does not contain all characters in the source data, replacement characters must be used. If, for example, a server used US7ASCII and a German client WE8ISO8859P1, the German character ß would be replaced with ? and the character ä would be replaced with a.
Replacement characters may be defined for specific characters as part of a character set definition. Where a specific replacement character is not defined, a default replacement character is used. To avoid the use of replacement characters when converting from client to database character set, the server character set should be a superset (or equivalent) of all the client character sets. In the above example, the server's character set was not chosen wisely. If German data is expected to be stored on the server, a character set which supports German letters is needed, for example, WE8ISO8859P1 for both the server and the client.
In some varying-width multi-byte cases, character set conversion may introduce noticeable overhead. Users need to carefully evaluate their situation and choose character sets to avoid conversion as much as possible. Having the appropriate character set for the database and the client will avoid the overhead of character conversion, as well as any possible data loss.
Note that some character sets support multiple languages. For example, WE8ISO8859P1 supports the following Western European languages:
| Danish | Finnish | Italian | Swedish | 
| Dutch | French | Norwegian | 
 | 
| English | German | Portuguese | 
 | 
| Faeroese | Icelandic | Spanish | 
 | 
This is because they are all based on a similar writing script. This situation is often called restricted multilingual support. Restricted because this character set supports a group of related languages. In this case, ISO8859-1 supports Latin-based languages.
In the following graphic, both clients have access to the server's data.
 
   
Often, unrestricted multilingual support is needed, and a universal character set such as Unicode is necessary as the server database character set. Unicode has two major encoding schemes: UCS-2 and UTF-8. UCS-2 is a two-byte fixed-width format; UTF-8 is a multi-byte format with a variable width. Oracle8i provides support for the UTF-8 format. This enhancement is transparent to clients who already provide support for multi-byte character sets.
Character set conversion between a UTF8 database and any single-byte character set introduces very little overhead. Conversion between UTF8 and any multi-byte character set has some overhead but there is no conversion loss problem.
The following diagram shows how a database can support many different languages. Here, Japanese, French, and German clients are all accessing the same database based on the Unicode character set.
