|Oracle9i Globalization Support Guide
Release 1 (9.0.1)
Part Number A90236-02
The Unicode UTF-16 national character set.
The Unicode 3.0 UTF-8 database character set with 4-byte surrogate pairs support.
American Standard Code for Information Interchange. A common encoded 7-bit character set for English. ASCII includes the letters A-Z and a-z, as well as digits, punctuation symbols, and control characters. The Oracle character set name for this is US7ASCII.
Sorting of character strings based on their binary coded value representations.
Byte semantics means treating strings as a sequence of bytes.
Case conversion refers to changing a character from its uppercase to lowercase form, or vice versa.
A character is an abstract element of a text. A character is different from a glyph (font glyph), which is a specific instance of a character. For example, the first character of the English upper-case Alphabet can be printed (or displayed) as A, A, A, etc. All these different forms are different glyphs but represent the same character. A character, a character code and a glyph are related as follows.
character --(encoding)--> character code --(font)--> glyph
For example, the first character of the English upper-case alphabet is represented in computer memory as a number (or a character code). The character code is 0x41 if we are using the ASCII encoding scheme, or the character code is 0xc1 if we are using the EBCDIC encoding scheme, or it can be some other number if we are using different encoding scheme. When we print or display this character, we use a font. We have to choose a font for the ASCII encoding scheme (or a font for a superset of the ASCII encoding scheme) if we are using the ASCII encoding scheme, or we have to choose a font for the EBCDIC encoding scheme if we are using the EBCDIC encoding scheme. Now the character is printed (or displayed) as A, A, A, or some other form. All these different forms are different glyphs, but represent the same character.
A character code is a number which represents a specific character. In order for computers to handle a character, we need a specific number which is assigned to that character. The number (or the character code) depends on what encoding scheme we are using. For example, the first character of the English uppercase alphabet has the character code 0x41 for the ASCII encoding scheme, but the same character has the character code 0xc1 for the EBCDIC encoding scheme.
See also character.
Character semantics means treating strings as a sequence of characters, as opposed to bytes semantics, where strings are counted in bytes.
A character set is a set of characters for a specific language or group of languages. There can be many different character sets just for one language.
A character set does not always imply any specific character encoding scheme.
In this manual, a character set generally implies a specific character encoding scheme, which is how a character code is assigned to each character of the character set. Therefore, the meaning of the term character set is generally the same as encoded character set in this manual.
A character string is a serial string of characters or even no character. In this case, the character string is called a "null string". "The number of characters" of this character string is 0 (zero).
Same as encoded character set.
An independent unit used to represent data, such as a letter, a letter with a diacritical mark, a digit, ideograph, punctuation, or symbol.
Character classification information provides details about the type of character associated with each legal character code; that is, whether it is an alphabetic, uppercase, lowercase, punctuation, control, or space character, etc.
A character encoding scheme is a rule that assigns numbers (or character codes) to all characters in a character set. We also use the shortened term encoding scheme (or encoding method, or just encoding).
The encoded character set which the client uses. A client character set can differ from the database server character set, in which case, character set conversion must occur.
Ordering of character strings in a given alphabet in a linguistic sort order or a binary sort order.
A character that graphically combines with a preceding base character. These characters are not used in isolation. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
A single character which can be represented by a composite character sequence. This type of character is found in the scripts of Thai, Lao, Vietnamese, and Korean Hangul, as well as many Latin characters used in European languages.
A character sequence consisting of a base character followed by one or more combining characters. This is also referred to as a combining character sequence.
The encoded character set in which text is stored in the database is represented. This includes
LONG, and fixed-width
CLOB column values and all SQL and PL/SQL text stored in the database.
A mark added to a letter that usually provides information about pronunciation or stress. The letter "ä" is an example of a diacritical mark added to the letter "a".
Extended Binary Coded Decimal Interchange Code. EBCDIC is a family of encoded character sets used mostly on IBM systems.
An encoded character set is a character set with an associated character encoding scheme. An encoded character set specifies how a number (or a character code) is assigned to each character of the character set based on a character encoding scheme.
Encoding Method or Encoding scheme. See also character encoding scheme.
An ordered collection of character glyphs which provides a graphical representation of characters within a character set.
The process of making software flexible enough to be used in many different linguistic and cultural environments. Globalalization should not be confused with localization, which is the process of preparing software for use in one specific locale.
A glyph (font glyph) is a specific instance of a character. A character can have many different glyphs. For example, the first character of the English upper-case Alphabet can be printed (or displayed) as A, A, A, etc.
All these different forms are different glyphs, but representing the same character. See also character.
A symbol representing an idea. Chinese is an example of an ideographic writing system.
The process of making software flexible enough to be used in many different linguistic and cultural environments. Internationalization should not be confused with localization, which is the process of preparing software for use in one specific locale.
International Organization for Standards. A worldwide federation of national standards bodies from 130 countries. The mission of ISO is to promote the development of standardization and related activities in the world with a view to facilitating the international exchange of goods and services.
A multilingual sort designed to hanld almost all languages of the world.
A universal character set standard defining the characters of most major scripts used in the modern world. In 1993, ISO adopted Unicode version 1.1 as ISO/IEC 10646-1:1993. ISO/IEC 10646 has two formats: UCS-2 is a 2-byte fixed-width format and UCS-4 is a 4-byte fixed-width format. There are three levels of implementation, all relating to support for composite characters. Level 1 requires no composite character support, level 2 requires support for specific scripts (including most of the Unicode scripts such as Arabic, Thai, etc.), and level 3 requires unrestricted support for composite characters in all languages.
The 3-letter abbreviation used to denote a local currency, which is based on the ISO 4217 standard. For example, "USD" represents the United States Dollar.
A family of 8-bit encoded character sets. The most common one is ISO 8859-1 (also known as Latin-1), and is used for Western European languages.
An International String Ordering standard sort designed to handle almost all languages.
Formally known as the ISO 8859-1 character set standard. An 8-bit extension to ASCII which adds 128 characters covering the most common Latin characters used in Western Europe. The Oracle character set name for this is WE8ISO8859P1. See "ISO 8859".
Length semantics determines how you treat stringlengths. They can be treated as a sequence of characters or bytes.
An index built on a linguistic collation order.
A sort of strings based on requirements from a locale instead of based on the binary representation of the strings. See also multilingual linguistic sort and monolingual linguistic sort.
A collection of information regarding the linguistic and cultural preferences from a particular region. Typically, a locale consists of language, territory, character set, linguistic, and calendar information defined in NLS data files.
A GUI utility that offers a way to modify, view or define locale-specific data. You can also create your own formats for language, territory, character set, and collation.
The process of providing language-specific or culture-specific information for software systems. Translation of an application's user interface would be an example of localization. Localization should not be confused with internationalization, which is the process of generalizing software so it can handle many different linguistic and cultural conventions.
An Oracle sort that uses two passes when comparing strings. This is fine for most European languages, but is inadequate for Asian languages. See also multilingual linguistic sort.
Support for only one language.
Multibyte means characters represented by two or more bytes.
When character codes are assigned to all characters in a specific language (or a group of languages), one byte (8 bits) can represent 256 different characters. Two bytes (16 bits) can represent up to 65,536 different characters. However, two bytes are still not enough to represent all the characters for many languages. We use 3 bytes or 4 bytes for those characters.
One example is the UTF8 encoding of Unicode. In UTF8, there are many 2-byte and 3-byte characters.
Another example is Traditional Chinese language used in Taiwan. It has more than 80,000 different characters. Some character encoding schemes used in Taiwan encode characters in up to 4 bytes.
A multibyte character is a character whose character code consists of two or more bytes under a certain character encoding scheme. Note that the same character may have different character code where the character encoding scheme is different. Without knowing which character encoding scheme is being used, Oracle cannot tell which character is a multibyte character. For example, Japanese Hankaku-Katakana (half width Katakana) characters are one byte in JA16SJIS encoded character set, two bytes in JA16EUC, and three bytes in UTF8. See "single-byte character".
A multibyte character string is a character string which consists of one of the below.
(The character string is called the "null string" in this case.)
An Oracle sort that uses evaluates strings on three levels when comparing.
An alternate character set from the database character set that can be specified for
NCLOB columns. National character sets are in Unicode only.
Binary files used by the Locale Builder to define locale-specific data.
National Language Support. NLS allows users to interact with the database in their native languages. It also allows applications to run in different linguistic and cultural environments. The term is somewhat obsolete because Oracle supports global users at one time.
A general phrase referring to the contents in many files with .nlb suffixes. These files contain data that the NLSRTL library uses to provide specific NLS support.
National Language Support Run-Time Library. This library is responsible for providing locale-independent algorithms for internationalization. The locale-specific information (that is, NLSDATA) is read by the NLSRTL library during run-time.
Text files used by the Locale Builder to define locale-specific data. Because they are in text, you can view the settings.
A character used during character conversion when the source character is not available in the target character set. For example,
? is often used as Oracle's default replacement character.
Multilingual support which is restricted to a group of related languages. Support for related languages, but not all languages. Similar language families, such as Western European languages can be represented with, for example, ISO 8859/1. In this case, however, Thai could not be added.
A collection of related graphic symbols used in a writing system. Some scripts are used to represent multiple languages, and some languages use multiple scripts. Example of scripts include Latin, Arabic, and Han.
Single-byte (or single byte) means one byte. One byte usually consists of 8 bits. When we assign character codes to all characters for a specific language, one byte (8 bits) can represent 256 different characters.
A single-byte character is a character whose character code consists of one byte under a certain character encoding scheme. Note that the same character may have different character code where the character encoding scheme is different. Without knowing which character encoding scheme we are using, we cannot tell which character is a single-byte character. For example, the euro currency symbol is one byte in WE8MSWIN1252 encoded character set, two bytes in AL16UTF16, and three bytes in UTF8. See also multibyte character.
A single-byte character string is a character string that consists of one of the below.
(The character string is called "null string" in this case.)
You can extend Unicode to encode more than 1 million characters. These extended characters are called surrogate pairs. Surrogate pairs are designed to allow representation of characters in future extensions of the Unicode standard. Surrogate pairs require 4 bytes in UTF-8 and UTF-16,
UCS stands for "Universal Multiple-Octet Coded Character Set". It is a 1993 ISO and IEC standard character set. Fixed-width 16-bit Unicode. Each character occupies 16 bits of storage. The Latin-1 characters are the first 256 code points in this standard, so it can be viewed as a 16-bit extension of Latin-1.
Fixed-width 32-bit Unicode. Each character occupies 32 bits of storage. The UCS-2 characters are the first 65,536 code points in this standard, so it can be viewed as a 32-bit extension of UCS-2. This is also sometimes referred to as ISO-10646. ISO-10646 is a standard that specifies up to 2,147,483,648 characters in 32768 planes, of which the first plane is the UCS-2 set. The ISO standard also specifies transformations between different encodings.
Unicode is a type of universal character set, a collection of 64K characters encoded in a 16-bit space. It encodes nearly every character in just about every existing character set standard, covering most written scripts used in the world. It is owned and defined by Unicode Inc. Unicode is canonical encoding which means its value can be passed around in different locales. But it does not guarantee a round-trip conversion between it and every Oracle character set without information loss.
A 16-bit binary value that can represent a unit of encoded text for processing and interchange. Every point between U+0000 and U+FFFF is a code point. The term Unicode codepoint is interchangeable with code element, code position, and code value.
NCHAR datatype (
NCLOB). You can store Unicode characters into columns of these datatypes irrespective of the database character set.
Being able to use as many languages as desired. A universal character set, such as Unicode, helps to provide unrestricted multilingual support because it supports a very large character repertoire, encompassing most modern languages of the world.
The Unicode 3.1 UTF-8 database character set with 6 byte surrogate pairs support.
A variable-width encoding of UCS-2 that uses sequences of 1, 2, or 3 bytes per character. Characters from 0-127 (the 7-bit ASCII characters) are encoded with one byte, characters from 128-2047 require two bytes, and characters from 2048-65535 require three bytes. The Oracle character set name for this is UTF8. The standard has left room for expansion to support the UCS4 characters with sequences of 4, 5, and 6 bytes per character.
An extension to UCS-2 that allows for pairs of UCS-2 code points to represent extended characters from the UCS-4 set. UCS-2 has ranges of code points allocated for high (leading) and low (trailing) surrogates that support UTF-16 encodings.
A fixed-width character format that is well-suited for extensive text processing because it allows for data to be processed in consistent fixed-width chunks. Wide characters are intended for supporting internal character processing, and are therefore implementation-dependent.