Oracle9i Globalization Support Guide Release 1 (9.0.1) Part Number A90236-02 |
|
This chapter explains how to choose a character set. It includes the following topics:
When computer systems process characters, they use numeric codes instead of the graphical representation of the character. For example, when the database stores the letter A, it actually stores a numeric code that is interpreted by software as that letter. These numeric codes are important in all databases. They are especially important when working in a global environment because of the need to convert between different character sets.
An encoded character set is specified when you create a database. The choice of character set determines what languages can be represented in the database. This choice influences how you create the database schema and develop applications that process character data. It also influences interoperability with operating system resources and database performance.
A group of characters (for example, alphabetic characters, ideographs, symbols, punctuation marks, and control characters) can be encoded as an encoded character set. An encoded character set assigns unique numeric codes to each character in the character repertoire. Table 2-1 shows examples of characters that are assigned a numeric code value.
There are many different coded character sets used throughout the computer industry. Oracle supports most national, international, and vendor-specific encoded character set standards. The complete list of character sets supported by Oracle is listed in Appendix A, "Locale Data". Character sets differ in the following ways:
These differences are discussed throughout this chapter.
When you choose a character set, first decide what languages you wish to store in the database. The characters that are encoded in a character set depend on the writing systems that are represented.
A writing system can be used to represent a language or group of languages. For the purposes of this book, writing systems can be classified into two categories: phonetic and ideographic.
Phonetic writing systems consist of symbols that represent different sounds associated with a language. Greek, Latin, Cyrillic, and Devanagari are all examples of phonetic writing systems based on alphabets. Note that alphabets can represent more than one language. For example, the Latin alphabet can represent many Western European languages such as French, German, and English.
Characters associated with a phonetic writing system (alphabet) can typically be encoded in one byte because the character repertoire is usually smaller than 256 characters.
Ideographic writing systems consist of ideographs or pictographs that represent the meaning of a word, not the sounds of a language. Chinese and Japanese are examples of ideographic writing systems that are based on tens of thousands of ideographs. Languages that use ideographic writing systems may use a syllabary as well. Syllabaries provide a mechanism for communicating phonetic information along with the pictographs when necessary. For instance, Japanese has two syllabaries: Hiragana, normally used for grammatical elements, and Katakana, normally used for foreign and onomatopoeic words.
Characters associated with an ideographic writing system typically must be encoded in more than one byte because the character repertoire has tens of thousands of characters.
In addition to encoding the script of a language, other special characters, such as punctuation marks, need to be encoded such as punctuation marks (for example, commas, periods, and apostrophes), numbers (for example, Arabic digits 0-9), special symbols (for example, currency symbols and math operators) and control characters for computers (for example, carriage returns, tabs, and NULL
).
Most Western languages are written left to right from the top to the bottom of the page. East Asian languages are usually written top to bottom from the right to the left of the page, though exceptions are frequently made for technical books translated from Western languages. Arabic and Hebrew are written right to left from the top to the bottom.
Another consideration is that numbers reverse direction in Arabic and Hebrew. So even though the text is written right to left, numbers within the sentence are written left to right. For example, "I wrote 32 books" would be written as "skoob 32 etorw I". Regardless of the writing direction, Oracle stores the data in logical order. Logical order means the order used by someone typing a language, not how it looks on the screen.
Different character sets support different character repertoires. Because character sets are typically based on a particular writing script, they can thus support different languages. When character sets were first developed in the United States, they had a limited character repertoire and even now there can be problems using certain characters across platforms. The following CHAR
and VARCHAR
characters are represented in all Oracle database character sets and transportable to any platform:
% |
` |
' |
( |
) |
* |
+ |
- |
, |
. |
/ |
\ |
: |
; |
< |
> |
= |
! |
_ |
& |
~ |
{ |
} |
| |
^ |
? |
$ |
# |
@ |
" |
[ |
] |
If you are using:
then take care that your data is in well-formed strings.
During conversion from one character set to another, Oracle expects CHAR
and VARCHAR
items to be well-formed strings encoded in the declared database character set. If you put other values into the string (for example, using the CHR
or CONVERT
function), the values may be corrupted when they are sent to a database with a different character set.
If you are currently using only two or three well-established character sets, you may not have experienced any problems with character conversion. However, as your enterprise grows and becomes more global, problems may arise with such conversions. Therefore, Oracle Corporation recommends that you use Unicode databases and datatypes.
The ASCII and IBM EBCDIC character sets support a similar character repertoire, but assign different code values to some of the characters. Table 2-2 shows how ASCII is encoded. Row and column headings denote hexadecimal digits. To find the encoded value of a character, read the column number followed by the row number. For example, the value of the character A is 0x41.
Over the years, character sets evolved to support more than just monolingual English in order to meet the growing needs of users around the world. New character sets were quickly created to support other languages. Typically, these new character sets supported a group of related languages, based on the same script. For example, the ISO 8859 character set series was created to support different European languages.
Character sets evolved and provided restricted multilingual support. They were restricted in the sense that they were limited to groups of languages based on similar scripts. More recently, there has been a push to remove boundaries and limitations on the character data that can be represented through the use of an unrestricted or universal character set. Unicode is one such universal character set that encompasses most major scripts of the modern world. The Unicode character set provides support for a character repertoire of approximately 49,000 characters and continues to grow.
Different types of encoding schemes have been created by the computer industry. The character set you choose affects what kind of encoding scheme will be used. This is important because different encoding schemes have different performance characteristics, and these characteristics can influence your database schema and application development requirements. The character set you choose will typically use one of the following types of encoding schemes:
Single byte encoding schemes are the most efficient encoding schemes available. They take up the least amount of space to represent characters and are easy to process and program with because one character can be represented in one byte.
Single-byte 7-bit encoding schemes can define up to 128 characters and normally support just one language. One of the most common single-byte character sets, used since the early days of computing, is ASCII (American Standard Code for Information Interchange).
Single-byte 8-bit encoding schemes can define up to 256 characters and often support a group of related languages. One example is ISO 8859-1, which supports many Western European languages. Figure 2-1 illustrates a typical 8-bit encoding scheme.
Multibyte encoding schemes are needed to support ideographic scripts used in Asian languages like Chinese or Japanese because these languages use thousands of characters. These schemes use either a fixed number of bytes to represent a character or a variable number of bytes per character.
In a fixed-width multibyte encoding scheme, each character is represented by a fixed number of n bytes, where n is greater than or equal to two.
A variable-width encoding scheme uses one or more bytes to represent a single character. Some multibyte encoding schemes use certain bits to indicate the number of bytes that will represent a character. For example, if two bytes is the maximum number of bytes used to represent a character, the most significant bit can be toggled to indicate whether that byte is a single-byte character or the first byte of a double-byte character. In other schemes, control codes differentiate single-byte from double-byte characters. Another possibility is that a shift-out code is used to indicate that the subsequent bytes are double-byte characters until a shift-in code is encountered.
Oracle uses the following naming convention for character set names:
<language_or_region><#_of_bits_representing_a_character><standard_name>[S | C]
Note that UTF8 and UTFE are exceptions to this naming convention.
Some examples are:
The optional "S" or "C" at the end of the character set name is used to differentiate character sets that can be used only on the server (S) or only on the client (C).
On Macintosh platforms, the server character set should always be used. The Macintosh client character sets are obsolete. On EBCDIC platforms, if available, the "S" version should be used on the server and the "C" version on the client.
Oracle uses the database character set for:
CHAR
datatypes (CHAR
, VARCHAR2
, CLOB
, and LONG)
Consider the following questions when you choose an Oracle character set for the database:
Several character sets may meet your current language requirements, but you should consider future language requirements as well. If you know that you will need to expand support in the future for different languages, picking a character set with a wider range now will prevent the need for migration later. The Oracle character sets listed in Appendix A, "Locale Data" are named according to the languages and regions which are covered by a particular character set. In the case of regions covered, some character sets (for example, the ISO character sets) are also listed explicitly by language. You may want to see the actual characters that are encoded. Most character sets are based on national, international, or vendor product documentation, or are available in standards documents.
While the database maintains and processes the actual character data, there are other resources that you must depend on from the operating system. For example, the operating system supplies fonts that correspond to the character set you have chosen. Input methods that support the desired languages and application software must also be compatible with a particular character set.
Ideally, a character set should be available on the operating system and is handled by your application to ensure seamless integration.
If you choose a character set that is different from what is available on the operating system, the Oracle database can convert the operating system character set to the database character set. However, there is some character set conversion overhead, and you need to make sure that the operating system character set has an equivalent character repertoire to avoid data loss.
Character set conversions can sometimes cause data loss. For example, if you are converting from character set A to character set B, the destination character set B must have the same character set repertoire as A. Any characters that are not available in character set B will be converted to a replacement character, which is most often specified as a question mark, (?), or a linguistically related character. For example, ä
(a
with an umlaut) will be converted to a
. If you have distributed environments, consider using character sets with similar character repertoires to avoid loss of data.
Character set conversion may require copying strings between buffers multiple times before the data reaches the client. Therefore, if possible, use the same character sets for the client and the server to optimize performance.
By default, the character datatypes CHAR
and VARCHAR2
are specified in bytes, not characters. Hence, the specification CHAR(20)
in a table definition allows 20 bytes for storing character data.
This works well if the database character set uses a single-byte character encoding scheme because the number of characters will be the same as the number of bytes. If the database character set uses a multibyte character encoding scheme, there is no such correspondence. That is, the number of bytes no longer equals the number of characters since a character can consist of one or more bytes. Thus, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters. You can overcome this problem by switching to character semantics when defining the column size.
There can be different performance overheads in handling different encoding schemes, depending on the character set chosen. For best performance, you should try to choose a character set that avoids character set conversion and uses the most efficient encoding for the languages desired. Single-byte character sets are more optimal for performance than multibyte character sets, and they also are the most efficient in terms of space requirements. However, single-byte character sets limit how many languages you can use.
ASCII-based character sets are supported only on ASCII-based platforms. Similarly, you can use an EBCDIC-based character set only on EBCDIC-based platforms.
The database character set is used to identify SQL and PL/SQL source code. In order to do this, it must have either EBCDIC or 7-bit ASCII as a subset, whichever is native to the platform. Therefore, it is not possible to use a fixed-width, multibyte character set as the database character set. Currently, this restriction applies only to the AL16UTF16 character set.
In some cases, you may wish to choose an alternate character set for the database because:
SQL NCHAR
datatypes have been redefined to support Unicode data only. You can store the data in either UTF-8 or UTF-16 encodings.
Table 2-4 lists the restrictions on the character sets that can be used to express names and other text in Oracle.
For a list of supported string formats and character sets, including LOB
data (LOB
, BLOB
, CLOB
, and NCLOB
), see Table 2-6.
The character encoding scheme used by the database is defined at database creation as part of the CREATE
DATABASE
statement. All SQL CHAR
datatype columns (CHAR
, CLOB
, VARCHAR2
, and LONG)
, including columns in the data dictionary, have their data stored in the database character set. In addition, the choice of database character set determines which characters can name objects in the database. SQL NCHAR
datatype columns (NCHAR
, NCLOB
, and NVARCHAR2)
use the national character set.
After the database is created, the character set choices cannot be changed, with some exceptions, without re-creating the database. Hence, it is important to consider carefully which character sets to use. The database character set should always be a superset or equivalent of the client's operating system's native character set. The character sets used by client applications that access the database usually determine which superset is the best choice.
If all client applications use the same character set, then this is the normal choice for the database character set. When client applications use different character sets, the database character set should be a superset of all the client character sets. This ensures that every character is represented when converting from a client character set to the database character set.
When a client application operates with a terminal that uses a different character set, then the client application's characters must be converted to the database character set, and vice versa. This conversion is performed automatically, and is transparent to the client application, except that the number of bytes for a character string may be different in the client character set and the database character set. The character set used by the client application is defined by the NLS_LANG
parameter.
Table 2-5 lists the supported encoding schemes associated with different datatypes.
Table 2-6 lists the supported datatypes associated with Abstract Data Types (ADT).
Abstract Datatype | CHAR | NCHAR | BLOB | CLOB | NCLOB |
---|---|---|---|---|---|
Object |
Yes |
No |
Yes |
Yes |
No |
Collection |
Yes |
No |
Yes |
Yes |
No |
In some cases, you may wish to change the existing database character set. For example, you may find that the number of languages that need to be supported in your database have increased. In most cases, you will need to do a full export/import to properly convert all data to the new character set. However, if, and only if, the new character set is a strict superset of the current character set, it is possible to use the ALTER
DATABASE
CHARACTER
SET
statement to expedite the change in the database character set.
See Also:
Chapter 10, "Character Set Scanner Utility" for more information about character set conversion |
The simplest example of an NLS database setup is when both the client and the server run in the same language environment and use the same character encoding. This monolingual scenario has the advantage of fast response because the overhead associated with character set conversion is avoided. Figure 2-2, illustrates this:
You can also use a multitier architecture, as illustrated in Figure 2-3:
You may need to convert character sets in a client/server computing environment because a client application resides on a different computer platform from that of the server, and both platforms do not use the same character encoding schemes. Character data passed between client and server must be converted between the two encoding schemes. Character conversion occurs automatically and transparently via Oracle Net.
You can convert between any two character sets, as shown in Figure 2-4:
However, in cases where a target character set does not contain all characters in the source data, replacement characters are used. If, for example, a server uses US7ASCII and a German client WE8ISO8859P1, the German character ß
is replaced with ?
and ä
is replaced with a
.
Replacement characters may be defined for specific characters as part of a character set definition. When a specific replacement character is not defined, a default replacement character is used. To avoid the use of replacement characters when converting from client to database character set, the server character set should be a superset (or equivalent) of all the client character sets. In Figure 2-2, the server's character set was not chosen wisely. If German data is expected to be stored on the server, a character set that supports German letters, such as WE8ISO8859P1, is needed for both the server and the client.
In some variable-width multibyte cases, character set conversion may introduce noticeable overhead. You need to carefully evaluate your situation and choose character sets to avoid conversion as much as possible. Having the appropriate character set for the database and the client will avoid the overhead of character conversion, as well as possible data loss.
Note that some character sets support multiple languages. This is typical when the languages have related writing systems or scripts. For example, Table 2-7 illustrates that WE8ISO8859P1 supports the following Western European languages:
Catalan |
Finnish |
Icelandic |
Portuguese |
Danish |
French |
Italian |
Spanish |
Dutch |
German |
Norwegian |
Swedish |
English |
|
|
|
The reason WE8ISO8859P1 supports the languages above is because they are all based on a similar writing script. This situation is called restricted multilingual support. In this case, they are all Latin-based scripts.
In Figure 2-5, both clients have access to the server's data, though the German client requires character conversion because it is using the WE8DEC character set.
Character conversion is necessary, but both French and German are Latin-based scripts, so you can use WE8ISO8859P1.
Often, unrestricted multilingual support is needed, and a universal character set such as Unicode is necessary as the server database character set. Unicode has two major encoding schemes: UTF-16 and UTF-8. UTF-16 is a two-byte fixed-width format; UTF-8 is a multibyte format with a variable width. The Oracle9i database provides support for UTF-8 as a database character set and both UTF-8 and UTF-16 as the national character set. This enhancement is transparent to clients who already provide support for multi-byte character sets.
Character set conversion between a UTF-8 database and any single-byte character set introduces very little overhead. Conversion between UTF-8 and any multibyte character set has some overhead but there is no conversion loss problem except that some multibyte character sets do not support user-defined characters during character set conversion to and from UTF-8.
Figure 2-6, shows how a database can support many different languages. Here, Japanese, French, and German clients are all accessing the same database based on the Unicode character set. Note that each client accesses only data that it can process. If Japanese data were retrieved, modified, and stored by the German client, all Japanese characters would be lost during the character set conversion.
Figure 2-6 illustrates a Unicode solution for a client/server architecture. You can also use a multitier architecture, as illustrated in Figure 2-7.
Figure 2-7 illustrates a multitier Unicode solution. Using this all-UTF8 architecture, you eliminate the need for character conversion.
|
Copyright © 1996-2001, Oracle Corporation. All Rights Reserved. |
|