3
Choosing a Character Set

This chapter explains NLS topics that you need to know when choosing a character set. These topics are:

What is an Encoded Character Set?

An encoded character set is specified when creating a database, and your choice of character set determines what languages can be represented in the database. This choice also influences how you create the database schema and develop applications that process character data. It also influences interoperability with operating system resources and database performance.

When processing characters, computer systems handle character data as numeric codes rather than as their graphical representation. For instance, when the database stores the letter "A", it actually stores a numeric code that is interpreted by software as that letter.

A group of characters (for example, alphabetic characters, ideographs, symbols, punctuation marks, control characters) can be encoded as a coded character set. A coded character set assigns unique numeric codes to each character in the character repertoire. Table 3-1 shows examples of characters that are assigned a numeric code value.

Table 3-1 Encoded Characters in the ASCII Character Set

Character	Description	Code Value
!	Exclamation Mark	0x21
#	Number Sign	0x23
$	Dollar Sign	0x24
1	The Number 1	0x31
2	The Number 2	0x32
3	The Number 3	0x33
A	An Uppercase A	0x41
B	An Uppercase B	0x42
C	An Uppercase C	0x43
a	A Lowercase a	0x61
b	A Lowercase b	0x62
c	A Lowercase c	0x63

There are many different coded character sets used throughout the computer industry and supported by Oracle. Oracle supports most national, international, and vendor-specific encoded character set standards. The complete list of character sets supported by Oracle is included in Appendix A, "Locale Data". Character sets differ in:

the number of characters available
the particular characters (character repertoire) available
the writing script(s) and the languages therefore represented
the code values assigned to each character in the repertoire
the encoding scheme used to represent a character entity

These differences are discussed throughout this chapter.

Which Characters to Encode?

The first choice to make when choosing a character set is based on what languages you wish to store in the database. The characters that are encoded in a character set depend on the writing systems that are represented.

Writing Systems

A writing system can be used to represent a language or group of languages. For the purposes of this book, writing systems can be classified into two broad categories, phonetic and ideographic.

Phonetic Writing Systems

Phonetic writing systems consist of symbols which represent different sounds associated with a language. Greek, Latin, Cyrillic, and Devanagari are all examples of phonetic writing systems based on alphabets. Note that alphabets can represent more than one language. For example, the Latin alphabet can represent many Western European languages such as French, German, and English.

Characters associated with a phonetic writing system (alphabet) can typically be encoded in one byte since the character repertoire is usually smaller than 256 characters.

Ideographic Writing Systems

Ideographic writing systems, in contrast, consist of ideographs or pictographs that represent the meaning of a word, not the sounds of a language. Chinese and Japanese are examples of ideographic writing systems that are based on tens of thousands of ideographs. Languages that use ideographic writing systems may use a syllabary as well. Syllabaries provide a mechanism for communicating phonetic information along with the pictographs when necessary. For instance, Japanese has two syllabaries, Hiragana, normally used for grammatical elements, and Katakana, normally used for foreign and onomatopoeic words.

Characters associated with an ideographic writing system must typically be encoded in more than one byte because the character repertoire can be as large as tens of thousands of characters.

Punctuation, Control Characters, Numbers, and Symbols

In addition to encoding the script of a language, other special characters, such as punctuation marks, need to be encoded such as punctuation marks (for example, commas, periods, apostrophes), numbers (for example, Arabic digits 0-9), special symbols (for example, currency symbols, math operators) and control characters for computers (for example, carriage returns, tabs, NULL).

Writing Direction

Most Western languages are written left-to-right from the top to the bottom of the page. East Asian languages are usually written top-to-bottom from the right to the left of the page. Exceptions are frequently made for technical books translated from Western languages. Arabic and Hebrew are written right-to-left from the top to the bottom.

Another consideration is that numbers reverse direction in Arabic and Hebrew. So, even though the text is written right-to-left, numbers within the sentence are written left-to-right. For example, "I wrote 32 books" would be written as "skoob 32 etorw I". Irrespective of the writing direction, Oracle stores the data in logical order. Logical order means the order used by someone typing a language, not how it looks on the screen.

How Many Languages does a Character Set Support?

Different character sets support different character repertoires. Because character sets are typically based on a particular writing script, they can thus support different languages. When character sets were first developed in the United States, they had a limited character repertoire and even now there can be problems using certain characters across platforms. The following CHAR and VARCHAR characters are representable in all Oracle database character sets and transportable to any platform:

Upper and lower case English characters A-Z and a-z
Arabic digits 0-9

The following punctuation marks:

%

`

'

(

)

*

+

-

,

.

/

\

:

;

<

>

=

!

_

&

~

{

}

|

^

?

$

#

@

"

[

]

The following control characters:
- '<space>'
- '<horizontal tab>'
- '<vertical tab>'
- '<form feed>'

If you are using

characters outside this set or
the national character set feature (NCHAR or NVARCHAR characters)

take care that your data is in well-formed strings.

During conversion from one character set to another, Oracle expects CHAR and VARCHAR items to be well-formed strings encoded in the declared database character set. If you put other values into the string (for example, using the CHR or CONVERT function), the values may be corrupted when they are sent to a database with a different character set.

If you are currently using only two or three well-established character sets, you may not have experienced any problems with character conversion. However, as your enterprise grows and becomes more global, problems may arise with such conversions. Therefore, Oracle Corporation recommends that you store any values other than well-formed strings in RAW columns rather than CHAR or VARCHAR columns.

ASCII Encoding

The ASCII and IBM EBCDIC character sets support a similar character repertoire, but assign different code values to some of the characters. Table 3-2 shows how ASCII is encoded. Row and column headings denote hexadecimal digits. To find the encoded value of a character, read the column number followed by the row number. For example, the value of the character A is 0x41.

Table 3-2 7-Bit ASCII Coded Character Set

	0	1	2	3	4	5	6	7
0	NUL	DLE	SP	0	@	P	'	p
1	SOH	DC1	!	1	A	Q	a	q
2	STX	DC2	"	2	B	R	b	r
3	ETX	DC3	#	3	C	S	c	s
4	EOT	DC4	$	4	D	T	d	t
5	ENQ	NAK	%	5	E	U	e	u
6	ACK	SYN	&	6	F	V	f	v
7	BEL	ETB	'	7	G	W	g	w
8	BS	CAN	(	8	H	X	h	x
9	TAB	EM	)	9	I	Y	i	y
A	LF	SUB	*	:	J	Z	j	z
B	VT	ESC	+	;	K	[	k	{
C	FF	FS	,	<	L	\	l	\|
D	CR	GS	-	=	M	]	m	}
E	SO	RS	.	>	N	^	n	~
F	SI	US	/	?	O	_	o	DEL

Over the years, character sets evolved to support more than just monolingual English in order to meet the growing needs of users around the world. New character sets were quickly created to support other languages. Typically, these new character sets supported a group of related languages, based on the same script. For example, the ISO 8859 character set series was created based on many national or regional standards to support different European languages.

Table 3-3 lSO 8859 Character Sets

Standard	Languages Supported
ISO 8859-1	Western European (Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish)
ISO 8859-2	Eastern European (Albanian, Croatian, Czech, English, German, Hungarian, Latin, Polish, Romanian, Slovak, Slovenian, Serbian)
ISO 8859-3	Southeastern European (Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese, Spanish, Turkish)
ISO 8859-4	Northern European (Danish, English, Estonian, Finnish, German, Greenlandic, Latin, Latvian, Lithuanian, Norwegian, Sámi, Slovenian, Swedish)
ISO 8859-5	Eastern European (Cyrillic-based: Bulgarian, Byelorussian, Macedonian, Russian, Serbian, Ukrainian)
ISO 8859-6	Arabic
ISO 8859-7	Greek
ISO 8859-8	Hebrew
ISO 8859-9	Western European (Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, English, Finnish, French, Frisian, Galician, German, Greenlandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish, Turkish)
ISO 8859-10	Northern European (Danish, English, Estonian, Faeroese, Finnish, German, Greenlandic, Icelandic, Irish Gaelic, Latin, Lithuanian, Norwegian, Sámi, Slovenian, Swedish)
ISO 8859-15	Western European (Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish)

Character sets evolved and provided restricted multilingual support, restricted in the sense that they were limited to groups of languages based on similar scripts.

More recently, there has been a push to remove boundaries and limitations on the character data that can be represented through the use of an unrestricted or universal character set. Unicode is one such universal character set that encompasses most major scripts of the modern world. The Unicode character set provides support for a character repertoire of approximately 39,000 characters and continues to grow.

How are These Characters Encoded?

Different types of encoding schemes have been created by the computer industry. These schemes have different performance characteristics, and can influence your database schema and application development requirements for handling character data, so you need to be aware of the characteristics of the encoding scheme used by the character set you choose. The character set you choose will typically use one of the following types of encoding schemes.

Single-Byte Encoding Schemes

Single byte encoding schemes are the most efficient encoding schemes available. They take up the least amount of space to represent characters and are easy to process and program with because one character can be represented in one byte.

7-bit Encoding Schemes

Single-byte 7-bit encoding schemes can define up to 128 characters, and normally support just one language. One of the most common single-byte character sets, used since the early days of computing, is ASCII (American Standard Code for Information Interchange).

8-bit Encoding Schemes

Single-byte 8-bit encoding schemes can define up to 256 characters, and often support a group of related languages. One example being ISO 8859-1, which supports many Western European languages.

Figure 3-1 8-Bit Encoding Schemes

Text description of iso88591.gif follows.

Text description of the illustration iso88591.gif.

Multibyte Encoding Schemes

Multibyte encoding schemes are needed to support ideographic scripts used in Asian languages like Chinese or Japanese since these languages use thousands of characters. These schemes use either a fixed number of bytes to represent a character or a variable number of bytes per character.

Fixed-width Encoding Schemes

In a fixed-width multibyte encoding scheme, each character is represented by a fixed number of n bytes, where n is greater than or equal to two.

Variable-width Encoding Schemes

A variable-width encoding scheme uses one or more bytes to represent a single character. Some multibyte encoding schemes use certain bits to indicate the number of bytes that will represent a character. For example, if two bytes is the maximum number of bytes used to represent a character, the most significant bit can be toggled to indicate whether that byte is part of a single-byte character or the first byte of a double-byte character. In other schemes, control codes differentiate single-byte from double-byte characters. Another possibility is that a shift-out code will be used to indicate that the subsequent bytes are double-byte characters until a shift-in code is encountered.

Oracle's Naming Convention for Character Sets

Oracle uses the following naming convention for character set names:

<language_or_region><#_of_bits_representing_a_char><standard_name>[S] [C] 
[FIXED]

Note that UTF8 and UTFE are exceptions to this naming convention.

For instance:

US7ASCII is the U.S. 7-bit ASCII character set
WE8ISO8859P1 is the Western European 8-bit ISO 8859 Part 1 character set
JA16SJIS is the Japanese 16-bit Shifted Japanese Industrial Standard character set

The optional "S" or "C" at the end of the character set name is sometimes used to help differentiate character sets that can only be used on the server (S) or client (C).

On Macintosh platforms, the server character set should always be used. The Macintosh client character sets are now obsolete. On EBCDIC platforms, if available, the "S" version should be used on the server and the "C" version on the client.

The optional "FIXED" at the end of the character set name is used to denote a fixed-width multibyte encoding.

Tips on Choosing an Oracle Database Character Set

Oracle uses the database character set for:

data stored in CHAR, VARCHAR2, CLOB, and LONG columns
identifiers such as table names, column names, and PL/SQL variables
entering and storing SQL and PL/SQL program source

Four considerations you should make when choosing an Oracle character set for the database are:

What languages does the database need to support?
Interoperability with system resources and applications
Performance implications
Restrictions

Several character sets may meet your current language requirements, but you should consider future language requirements as well. If you know that you will need to expand support in the future for different languages, picking a character set with a wider range now will obviate the need for migration later. The Oracle character sets listed in Appendix A, "Locale Data", are named according to the languages and regions which are covered by a particular character set. In the case of regions covered, some character sets, the ISO character sets for instance, are also listed explicitly by language. You may want to see the actual characters that are encoded in some cases. The actual code pages are not listed in this manual, however, since most are based on national, international, or vendor product documentation, or are available in standards documents.

Interoperability with System Resources and Applications

While the database maintains and processes the actual character data, there are other resources that you must depend on from the operating system. For instance, the operating system supplies fonts that correspond to the character set you have chosen. Input methods that support the language(s) desired and application software must also be compatible with a particular character set.

Ideally, a character set should be available on the operating system and is handled by your application to ensure seamless integration.

Character Set Conversion

If you choose a character set that is different from what is available on the operating system, Oracle can handle character set conversion from the database character set to the operating system character set. However, there is some character set conversion overhead, and you need to make sure that the operating system character set has an equivalent character repertoire to avoid any possible data loss.

Also note that character set conversions can sometimes cause data loss. For example, if you are converting from character set A to character set B, the destination character set B must have the same character set repertoire as A. Any characters that are not available in character set B will be converted to a replacement character, which is most often specified as "?" or a linguistically related character. For example, ä (a with an umlaut) will be converted to "a". If you have distributed environments, consider using character sets with similar character repertoires to avoid loss of data.

Character set conversion may require copying strings between buffers multiple times before the data reaches the client. Therefore, if possible, using the same character sets for the client and the server can avoid character set conversion, and thus optimize performance.

Database Schema

The character datatypes CHAR and VARCHAR2 are specified in bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.

This works out well if the database character set uses a single-byte character encoding scheme because the number of characters will be the same as the number of bytes. If the database character set uses a multibyte character encoding scheme, there is no such correspondence. That is, the number of bytes no longer equals the number of characters since a character can consist of one or more bytes. Thus, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters.

Performance Implications

There can be different performance overheads in handling different encoding schemes, depending on the character set chosen. For best performance, you should try to choose a character set that avoids character set conversion and uses the most efficient encoding for the languages desired. Single-byte character sets are more optimal for performance than multi-byte character sets, and they also are the most efficient in terms of space requirements.

Restrictions

You cannot currently choose an Oracle database character set that is a fixed-width multibyte character set. In particular, the following character sets cannot be used as the database character set:

Table 3-4 Restricted Character Sets

JA16EUCFIXED

ZHS16GBKFIXED

JA16DBCSFIXED

KO16DBCSFIXED

ZHS16DBCSFIXED

JA16SJISFIXED

ZHT32TRISFIXED

KO16KSC5601FIXED

ZHS16CGB231280FIXED

ZHT32EUCFIXED

ZHT16BIG5FIXED

ZHT16DBCSFIXED

Tips on Choosing an Oracle NCHAR Character Set

In some cases, you may wish to have the ability to choose an alternate character set for the database because the properties of a different character encoding scheme may be more desirable for extensive character processing operations, or to facilitate ease-of-programming. In particular, the following data types can be used with an alternate character set:

NCHAR
NVARCHAR2
NCLOB

Specifying an NCHAR character set allows you to specify an alternate character set from the database character set for use in NCHAR, NVARCHAR2, and NCLOB columns. This can be particularly useful for customers using a variable-width multibyte database character set because NCHAR has the capability to support fixed-width multibyte encoding schemes, whereas the database character set cannot. The benefits in using a fixed-width multibyte encoding over a variable-width one are:

optimized string processing performance on NCHAR, NVARCHAR2, and NCLOB columns
ease-of-programming with a fixed-width multibyte character set as opposed to a variable-width multibyte character set

When choosing an NCHAR character set, you must ensure that the NCHAR character repertoire is equivalent to or a subset of the database character set repertoire.

Note: all SQL commands will use the database character set, not the NCHAR character set. Therefore, literals can only be specified in the database character set.

Database Schema

When using the NCHAR, NVARCHAR2, and NCLOB data types, the width specification can be in terms of bytes or characters depending on the encoding scheme used. If the NCHAR character set uses a variable-width multibyte encoding scheme, the width specification refers to bytes. If the NCHAR character set uses a fixed-width multibyte encoding scheme, the width specification will be in characters. For example, NCHAR(20), using the variable-width multibyte character set JA16EUC, will allocate 20 bytes while NCHAR(20) using the fixed-width multibyte character set JA16EUCFIXED will allocate 40 bytes.

Performance Implications

Some string operations are faster when you choose a fixed-width character set for the national character set. For instance, string-intensive operations such as the SQL LIKE operator used on an NCHAR fixed-width column outperform LIKE operations on a multi-byte CHAR column. A possible usage scenario is as follows:

With a Database Character Set of

JA16EUC

Use an NCHAR Character Set of

JA16EUCFIXED

Recommendations

Because SQL text such as the literals in SQL statements can only be represented by the database character set, and not the NCHAR character set, you should choose an NCHAR character set that either has an equivalent or subset character repertoire of the database character set.

Considerations for Different Encoding Schemes

Keep the following points in mind when dealing with encoding schemes.

Be Careful when Mixing Fixed-Width and Varying-Width Character Sets

Because fixed-width multi-byte character sets are measured in characters, and varying-width character sets are measured in bytes, be careful if you use a fixed-width multi-byte character set as your national character set on one platform and a varying-width character set on another platform.

As an example, if you use %TYPE or a named type to declare an item on one platform using the declaration information of an item from the other platform, you might receive a constraint limit too small to support the data. So, for example, "NCHAR (10)" on the platform using the fixed-width multi-byte set allocates enough space for 10 characters, but if %TYPE or the use of a named type creates a correspondingly typed item on the other platform, it allocates only 10 bytes. Usually, this is not enough for 10 characters. To be safe:

Do not mix fixed-width multi-byte and varying-width character sets as the national character set on different platforms.
If you do mix fixed-width multi-byte and varying-width character sets as the national character set on different platforms, use varying-length type declarations with relatively large constraint values.

Storing Data in Multi-Byte Character Sets

Width specifications of the character datatypes CHAR and VARCHAR2 refer to bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.

If the database character set is single byte, and that character set includes only composite characters, the number of characters and the number of bytes are the same. If the database character set is multi-byte, in general, there is no such correspondence. A character can consist of one or more bytes, depending on the specific multi-byte encoding scheme and whether shift-in/shift-out control codes are present. Hence, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters.

A typical situation is when character elements are combined to form a single character. For example, o and an umlaut can be combined to form ö. In the Thai language, up to three separate character elements can be combined to form one character, and one Thai character would require up to 3 bytes when TH8TISASCII or another single-byte Thai character set is used. One Thai character would require up to 9 bytes when the UTF8 character set is used.

One Thai character consists of up to three separate character elements as shown in Figure 3-2, where two of the characters are comprised of three character elements.

Figure 3-2 Combining Characters

Text description of the illustration ch37.gif.

In the lower row of Figure 3-2, nine Thai characters are shown in the correct display format. Inside the database, these nine Thai characters are stored just like the upper row of Figure 3-2. They look like thirteen characters, but they are actually nine characters. Note that the upper row is just showing how Thai characters are stored in the database (and it is the same as how Thai characters are represented in computer memory), but the way shown in the upper row is an incorrect way of displaying Thai characters.

When using the NCHAR and NVARCHAR2 data types, the width specification refers to characters when the national character set is fixed-width multi-byte. Otherwise, the width specification refers to bytes.

A separate performance issue is space efficiency (and thus speed) when using smaller-width character sets. These issues potentially trade-off against each other when the choice is between a varying-width and a fixed-width character set.

Naming Database Objects

You can use Oracle to name database objects.

Restrictions on Character Sets Used to Express Names and Text

Table 3-5 lists the restrictions on the character sets that can be used to express names and other text in Oracle.

Table 3-5 Restrictions on Character Sets Used to Express Names and Text

Name	Single- Byte Fixed	Varying Width	Multi-Byte Fixed Width Character Sets	Comments
Column Names	Yes	Yes	No
Schema Objects	Yes	Yes	No
comments	Yes	Yes	No
database link names	Yes	No	No
database names	Yes	No	No
filenames (datafile, logfile, controlfile, initialization parameter file)	Yes	No	No
instance names	Yes	No	No
directory names	Yes	No	No
keywords	Yes	No	No	Can be expressed in English ASCII or EBCDIC characters only
recovery manager filenames	Yes	No	No
rollback segment names	Yes	No	No	The ROLLBACK_SEGMENTS parameter does not support NLS
stored script names	Yes	Yes	No
tablespace names	Yes	Yes	No

For a list of supported string formats and character sets, including LOB data (LOB, BLOB, CLOB, and NCLOB), see Table 3-7.

The character encoding scheme used by the database is defined at database creation as part of the CREATE DATABASE statement. All data columns of type CHAR, CLOB, VARCHAR2, and LONG, including columns in the data dictionary, have their data stored in the database character set. In addition, the choice of database character set determines which characters can name objects in the database. Data columns of type NCHAR, NCLOB, and NVARCHAR2 use the national character set.

After the database is created, the character set choices cannot be changed, with some exceptions, without re-creating the database. Hence, it is important to consider carefully which character set(s) to use. The database character set should always be a superset or equivalent of the client's operating system's native character set. The character sets used by client applications that access the database usually determine which superset is the best choice.

If all client applications use the same character set, then this is the normal choice for the database character set. When client applications use different character sets, the database character set should be a superset (or equivalent) of all the client character sets. This ensures that every character is represented when converting from a client character set to the database character set.

When a client application operates with a terminal that uses a different character set, then the client application's characters must be converted to the database character set, and vice versa. This conversion is performed automatically, and is transparent to the client application, except that the number of bytes for a character string may be different in the client character set and the database character set. The character set used by the client application is defined by the NLS_LANG parameter. Similarly, the character set used for national character set data is defined by the NLS_NCHAR parameter.

Summary of Data Types and Supported Encoding Schemes

Table 3-6 lists the supported encoding schemes associated with different data types.

Table 3-6 Supported Encoding Schemes for Data Types

Data Type	Single-Byte	Multi-byte Varying Width	Multi-byte Fixed Width
CHAR	Yes	Yes	No
NCHAR	Yes	Yes	Yes
BLOB	Yes	Yes	Yes
CLOB	Yes	Yes	No
NCLOB	Yes	Yes	Yes

Table 3-7 lists the supported data types associated with Abstract Data Types (ADT).

Table 3-7 Supported Data Types for Abstract Data Types

Abstract DataType	CHAR	NCHAR	BLOB	CLOB	NCLOB
Object	Yes	No	Yes	Yes	No
Collection	Yes	No	Yes	Yes	No

Note:
BLOBs process characters as a series of byte sequences. The data is not subject to any NLS-sensitive operations.

Changing the Character Set After Database Creation

In some cases, you may wish to change the existing database character set. For instance, you may find that the number of languages that need to be supported in your database have increased. In most cases, you will need to do a full export/import to properly convert all data to the new character set. However, if, and only if, the new character set is a strict superset of the current character set, it is possible to use the ALTER DATABASE CHARACTER SET statement to expedite the change in the database character set.

The target character set is a strict superset if and only if each and every codepoint in the source character set is available in the target character set, with the same corresponding codepoint value. For instance, the following migration scenarios can take advantage of the ALTER DATABASE CHARACTER SET statement because US7ASCII is a strict subset of WE8ISO8859P1, ZHS16GBK, and UTF8:

Table 3-8 Sample Migration Scenarios

Current Character Set	New Character Set	New Character Set is Strict Superset?
US7ASCII	WE8ISO8859P1	Yes
US7ASCII	ZHS16GBK	Yes
US7ASCII	UTF8	Yes

Attempting to change the database character set to a character set that is not a strict superset can result in data loss and data corruption. To ensure data integrity, whenever migrating to a new character set that is not a strict superset, you must use export/import. It is essential to do a full backup of the database before using the ALTER DATABASE [NATIONAL] CHARACTER SET statement, since the command cannot be rolled back. The syntax is:

ALTER DATABASE [<db_name>] CHARACTER SET <new_character_set>;
ALTER DATABASE [<db_name>] NATIONAL CHARACTER SET <new_NCHAR_character_set>;

The database name is optional. The character set name should be specified without quotes, for example:

ALTER DATABASE CHARACTER SET WE8ISO8859P1;

To change the database character set, perform the following steps. Not all of them are absolutely necessary, but they are highly recommended:

SQL> SHUTDOWN IMMEDIATE;   -- or NORMAL
    <do a full backup>

SQL> STARTUP MOUNT;
SQL> ALTER SYSTEM ENABLE RESTRICTED SESSION;
SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES=0;
SQL> ALTER DATABASE OPEN;
SQL> ALTER DATABASE CHARACTER SET <new_character_set_name>;
SQL> SHUTDOWN IMMEDIATE;   -- or NORMAL
SQL> STARTUP;

To change the national character set, replace the ALTER DATABASE CHARACTER SET statement with the ALTER DATABASE NATIONAL CHARACTER SET statement. You can issue both statements together if desired.

Customizing Character Sets

In some cases, you may wish to tailor a character set to meet specific user needs. In Oracle8i, users can extend an existing encoded character set definition to suit their needs. User-defined Characters (UDC) are often used to encode special characters representing:

Proper names
Historical Han characters which are not defined in an existing character set standard
Vendor-specific characters
New symbols or characters you define

This section describes how Oracle supports UDC. It describes:

Character Sets with User-Defined Characters

User-defined characters are typically supported within East Asian character sets. These East Asian character sets have at least one range of reserved codepoints for use as user-defined characters. For example, Japanese Shift JIS preserves 1880 codepoints for UDC as follows:

Table 3-9 Shift JIS Codepoint Example

Japanese Shift JIS UDC Range	Number of Codepoints
0xf040-0xf07e, 0xf080-0xf0fc	188
0xf140-0xf17e, 0xf180-0xf1fc	188
0xf240-0xf27e, 0xf280-0xf2fc	188
0xf340-0xf37e, 0xf380-0xf3fc	188
0xf440-0xf47e, 0xf480-0xf4fc	188
0xf540-0xf57e, 0xf580-0xf5fc	188
0xf640-0xf67e, 0xf680-0xf6fc	188
0xf740-0xf77e, 0xf780-0xf7fc	188
0xf840-0xf87e, 0xf880-0xf8fc	188
0xf940-0xf97e, 0xf980-0xf9fc	188

The Oracle character sets listed in Table 3-10 contain pre-defined ranges that allow you to support User Defined Characters:

Table 3-10 Oracle Character Sets with UDC

Character Set Name	Number of UDC Codepoints Available
JA16DBCS	4370
JA16DBCSFIXED	4370
JA16EBCDIC930	4370
JA16SJIS	1880
JA16SJISFIXED	1880
JA16SJISYEN	1880
KO16DBCS	1880
KO16DBCSFIXED	1880
KO16MSWIN949	1880
ZHS16DBCS	1880
ZHS16DBCSFIXED	1880
ZHS16GBK	2149
ZHS16GBKFIXED	2149
ZHT16DBCS	6204
ZHT16MSWIN950	6217

Oracle's Character Set Conversion Architecture

The codepoint value that represents a particular character may vary among different character sets. For example, the Japanese kanji character:

Figure 3-3 Kanji Example

Text description of the illustration char2.gif.

is encoded as follows in different Japanese character sets:

Table 3-11 Kanji Example with Character Conversion

Character Set	Unicode	JA16SJIS	JA16EUC	JA16DBCS
Character Value of Text description of the illustration char2.gif.	0x4E9C	0x889F	0xB0A1	0x4867

Character Set

Unicode

JA16SJIS

JA16EUC

JA16DBCS

Character Value of Text description of char2.gif follows.

Text description of the illustration char2.gif.

0x4E9C

0x889F

0xB0A1

0x4867

In Oracle, all character sets are defined in terms of a Unicode 2.1 code point. That is each character is defined as a Unicode 2.1 code value. Character conversion takes place transparently to users by using Unicode as the intermediate form. For example, when a JA16SJIS client connects to a JA16EUC database, the character shown in Figure 3-3, "Kanji Example" (value 0x889F) entered from the JA16SJIS client is internally converted to Unicode (value 0x4E9C), and then converted to JA16EUC(value 0xB0A1).

Unicode 2.1 Private Use Area

Unicode 2.1 reserves the range 0xE000-0xF8FF for the Private Use Area (PUA). The PUA is intended for private use character definition by end users or vendors.

UDC can be converted between two Oracle character sets by using Unicode 2.1 PUA as the intermediate form, the same as standard characters.

UDC Cross References

UDC cross references between Japanese character sets, Korean character sets, Simplified Chinese character sets and Traditional Chinese character sets are contained in the following distribution sets:

${ORACLE_HOME}/ocommon/nls/demo/udc_ja.txt
${ORACLE_HOME}/ocommon/nls/demo/udc_ko.txt
${ORACLE_HOME}/ocommon/nls/demo/udc_zhs.txt
${ORACLE_HOME}/ocommon/nls/demo/udc_zht.txt

These cross references are useful when registering User Defined Characters across operating systems. For example, when registering a new UDC on both a Japanese Shift-JIS operating system and a Japanese IBM Host operating system, you may want to pick up 0xF040 on Shift-JIS operating system and 0x6941 on IBM Host operating system for the new UDC so that Oracle can convert correctly between JA16SJIS and JA16DBCS. You can find out that both Shift-JIS UDC value 0xF040 and IBM Host UDC value 0x6941 are mapped to the same Unicode PUA value 0xE000 in the UDC cross reference.

For further details on how to customize a character set definition file, see Appendix B, "Customizing Locale Data".

Monolingual Database Example

Same Character Set on the Client and the Server

This section describes the simplest example of an NLS database setup.

Both the client and server in Figure 3-4, "Monolingual Scenario", are running with the same language environment, and are both using the same character encoding. The monolingual scenario has the advantage of fast response because the overhead associated with character set conversion is avoided.

Figure 3-4 Monolingual Scenario

Text description of the illustration ch3a.gif.

Character Set Conversion

Character set conversion is often necessary in a client/server computing environment where a client application may reside on a different computer platform from that of the server, and both platforms may not use the same character encoding schemes. Character data passed between client and server must be converted between the two encoding schemes. Character conversion occurs automatically and transparently via Net8.

A conversion is possible between any two character sets, as shown in Figure 3-5:

Figure 3-5 Character Set Conversion Example

Text description of the illustration ch33.gif.

However, in cases where a target character set does not contain all characters in the source data, replacement characters are used. If, for example, a server uses US7ASCII and a German client WE8ISO8859P1, the German character ß is replaced with ? and the character ä is replaced with a.

Replacement characters may be defined for specific characters as part of a character set definition. Where a specific replacement character is not defined, a default replacement character is used. To avoid the use of replacement characters when converting from client to database character set, the server character set should be a superset (or equivalent) of all the client character sets. In Figure 3-4, "Monolingual Scenario", the server's character set was not chosen wisely. If German data is expected to be stored on the server, a character set which supports German letters is needed, for example, WE8ISO8859P1 for both the server and the client.

In some varying-width multi-byte cases, character set conversion may introduce noticeable overhead. Users need to carefully evaluate their situation and choose character sets to avoid conversion as much as possible. Having the appropriate character set for the database and the client will avoid the overhead of character conversion, as well as any possible data loss.

Multilingual Database Example

Note that some character sets support multiple languages. For example, WE8ISO8859P1 supports the following Western European languages:

Table 3-12 WE8ISO8859P1 Example

Catalan	Finnish	Italian	Swedish
Danish	French	Norwegian
Dutch	German	Portuguese
English	Icelandic	Spanish

The reason WE8ISO8859P1 supports the languages above is because they are all based on a similar writing script. This situation is often called restricted multilingual support. Restricted because this character set supports a group of related writing systems or scripts. In Table 3-12, WE8ISO8859-1 supports Latin-based scripts.

Restricted Multilingual Support

In Figure 3-6, both clients have access to the server's data.

Figure 3-6 Restricted Multilingual Support Example

Text description of the illustration ch32.gif.

Unrestricted Multilingual Support

Often, unrestricted multilingual support is needed, and a universal character set such as Unicode is necessary as the server database character set. Unicode has two major encoding schemes: UCS2 and UTF8. UCS2 is a two-byte fixed-width format; UTF8 is a multi-byte format with a variable width. Oracle8i provides support for the UTF8 format. This enhancement is transparent to clients who already provide support for multi-byte character sets.

Character set conversion between a UTF8 database and any single-byte character set introduces very little overhead. Conversion between UTF8 and any multi-byte character set has some overhead but there is no conversion loss problem except that some multi-byte character sets do not support user-defined characters during character set conversion to and from UTF8. See Appendix A, "Locale Data", for further information.

Figure 3-7, "Unrestricted Multilingual Support Example", shows how a database can support many different languages. Here, Japanese, French, and German clients are all accessing the same database based on the Unicode character set. Please note that each client accesses only data it can process. If Japanese data were retrieved, modified, and stored back by the German client, all Japanese characters would be lost during the character set conversion.

Figure 3-7 Unrestricted Multilingual Support Example

Text description of the illustration ch34.gif.

	0	1	2	3	4	5	6	7
0	NUL	DLE	SP	0	@	P	'	p
1	SOH	DC1	!	1	A	Q	a	q
2	STX	DC2	"	2	B	R	b	r
3	ETX	DC3	#	3	C	S	c	s
4	EOT	DC4	$	4	D	T	d	t
5	ENQ	NAK	%	5	E	U	e	u
6	ACK	SYN	&	6	F	V	f	v
7	BEL	ETB	'	7	G	W	g	w
8	BS	CAN	(	8	H	X	h	x
9	TAB	EM	)	9	I	Y	i	y
A	LF	SUB	*	:	J	Z	j	z
B	VT	ESC	+	;	K	[	k	{
C	FF	FS	,	<	L	\	l	\|
D	CR	GS	-	=	M	]	m	}
E	SO	RS	.	>	N	^	n	~
F	SI	US	/	?	O	_	o	DEL

	0	1	2	3	4	5	6	7
0	NUL	DLE	SP	0	@	P	'	p
1	SOH	DC1	!	1	A	Q	a	q
2	STX	DC2	"	2	B	R	b	r
3	ETX	DC3	#	3	C	S	c	s
4	EOT	DC4	$	4	D	T	d	t
5	ENQ	NAK	%	5	E	U	e	u
6	ACK	SYN	&	6	F	V	f	v
7	BEL	ETB	'	7	G	W	g	w
8	BS	CAN	(	8	H	X	h	x
9	TAB	EM	)	9	I	Y	i	y
A	LF	SUB	*	:	J	Z	j	z
B	VT	ESC	+	;	K	[	k	{
C	FF	FS	,	<	L	\	l	\|
D	CR	GS	-	=	M	]	m	}
E	SO	RS	.	>	N	^	n	~
F	SI	US	/	?	O	_	o	DEL

3 Choosing a Character Set

What is an Encoded Character Set?

Table 3-1 Encoded Characters in the ASCII Character Set

Which Characters to Encode?

Writing Systems

Phonetic Writing Systems

Ideographic Writing Systems

Punctuation, Control Characters, Numbers, and Symbols

Writing Direction

How Many Languages does a Character Set Support?

ASCII Encoding

Table 3-2 7-Bit ASCII Coded Character Set

Table 3-3 lSO 8859 Character Sets

How are These Characters Encoded?

Single-Byte Encoding Schemes

7-bit Encoding Schemes

8-bit Encoding Schemes

Figure 3-1 8-Bit Encoding Schemes

Multibyte Encoding Schemes

Fixed-width Encoding Schemes

Variable-width Encoding Schemes

Oracle's Naming Convention for Character Sets

Tips on Choosing an Oracle Database Character Set

Interoperability with System Resources and Applications

Character Set Conversion

Database Schema

Performance Implications

Restrictions

Table 3-4 Restricted Character Sets

Tips on Choosing an Oracle NCHAR Character Set

Database Schema

Performance Implications

Recommendations

Considerations for Different Encoding Schemes

Be Careful when Mixing Fixed-Width and Varying-Width Character Sets

Storing Data in Multi-Byte Character Sets

Figure 3-2 Combining Characters

Naming Database Objects

Restrictions on Character Sets Used to Express Names and Text

Table 3-5 Restrictions on Character Sets Used to Express Names and Text

Summary of Data Types and Supported Encoding Schemes

Table 3-6 Supported Encoding Schemes for Data Types

Table 3-7 Supported Data Types for Abstract Data Types

Changing the Character Set After Database Creation

Table 3-8 Sample Migration Scenarios

Customizing Character Sets

Character Sets with User-Defined Characters

Table 3-9 Shift JIS Codepoint Example

Table 3-10 Oracle Character Sets with UDC

Oracle's Character Set Conversion Architecture

Figure 3-3 Kanji Example

Table 3-11 Kanji Example with Character Conversion

Unicode 2.1 Private Use Area

UDC Cross References

Monolingual Database Example

Same Character Set on the Client and the Server

Figure 3-4 Monolingual Scenario

Character Set Conversion

Figure 3-5 Character Set Conversion Example

Multilingual Database Example

Table 3-12 WE8ISO8859P1 Example

Restricted Multilingual Support

Figure 3-6 Restricted Multilingual Support Example

Unrestricted Multilingual Support

Figure 3-7 Unrestricted Multilingual Support Example

3
Choosing a Character Set

	0	1	2	3	4	5	6	7
0	NUL	DLE	SP	0	@	P	'	p
1	SOH	DC1	!	1	A	Q	a	q
2	STX	DC2	"	2	B	R	b	r
3	ETX	DC3	#	3	C	S	c	s
4	EOT	DC4	$	4	D	T	d	t
5	ENQ	NAK	%	5	E	U	e	u
6	ACK	SYN	&	6	F	V	f	v
7	BEL	ETB	'	7	G	W	g	w
8	BS	CAN	(	8	H	X	h	x
9	TAB	EM	)	9	I	Y	i	y
A	LF	SUB	*	:	J	Z	j	z
B	VT	ESC	+	;	K	[	k	{
C	FF	FS	,	<	L	\	l	\|
D	CR	GS	-	=	M	]	m	}
E	SO	RS	.	>	N	^	n	~
F	SI	US	/	?	O	_	o	DEL