|Oracle9i Globalization Support Guide
Release 1 (9.0.1)
Part Number A90236-02
This chapter explains how characters are sorted in an Oracle environment. Its sections are:
Different languages have different sort orders. What's more, different cultures or countries using the same alphabets may sort words differently. For example, in Danish, the letter Æ is after
Ü are considered to be variants of the same letter. Sort order can be case sensitive or insensitive and can ignore accents or not. It can also be either phonetic or based on the appearance of the character, such as ordering by the number of strokes or by radicals for East Asian ideographs. Another common sorting issue is when letters are combined. For example, in traditional Spanish,
ch is a distinct character, which means that the correct order is: cerveza, colorado, cheremoya, and so on. This means that the letter
c cannot be sorted until checking to see if the next letter is an
Oracle provides several different types of sort, and can achieve a linguistically correct sort as well as the new multilingual ISO standard (14651), which is designed to handle many languages at the same time.
Conventionally, when character data is stored, the sort sequence is based on the numeric values of the characters defined by the character encoding scheme. This is called a binary sort. Binary sorts are the fastest type of sort, and produce reasonable results for the English alphabet because the ASCII and EBCDIC standards define the letters A to Z in ascending numeric value. Note, however, that in the ASCII standard, all uppercase letters appear before any lowercase letters. In the EBCDIC standard, the opposite is true: all lowercase letters appear before any uppercase letters.
When characters used in other languages are present, a binary sort generally does not produce reasonable results. For example, an ascending
BY query would return the character strings
ÄBC, in the sequence, when the
Ä has a higher numeric value than
B in the character encoding scheme. For languages using Chinese characters, a binary sort is not usually linguistically meaningful.
To produce a sort sequence that matches the alphabetic sequence of characters, another sort technique must be used that sorts characters independently of their numeric values in the character encoding scheme. This technique is called a linguistic sort. A linguistic sort operates by replacing characters with numeric values that reflect each character's proper linguistic order.
Oracle offers two kinds of linguistic sorts:
Oracle makes two passes when comparing strings in monolingual sorts. The first pass is to compare the major value of entire string from the major table and the second pass is to compare the minor value from the minor table. Each major table entry contains the Unicode codepoint and major value. Usually, letters with the same appearance will have the same major value. Oracle defines letters with diacritic and case differences for the same major value but different minor values. Table 4-1 illustrates sample values for sorting
|Glyph||Major Value||Minor Value|
Oracle9i provides multilingual linguistic sorts so that you can sort more than one language as part of one sort. This is useful for certain regions or languages that have complex sorting rules or global multilingual databases. Additionally, Oracle9i still supports all the sort orders defined by previous releases.
For Asian language data or multilingual data, Oracle provides a sorting mechanism based on an ISO standard (ISO14651) and the Unicode 3.0 standard. Multilingual linguistic sorts also work for Asian language sorts ordered by the number of strokes, PinYin, or radicals. In addition, they can handle canonical equivalence and surrogate codepoint pairs. You can define up to 1.1 million codepoints in one sort.
For example, in Oracle9i, a
French sort is supported, but the new multilingual linguistic sort for French can also be applied by changing the sort order from
_M represents the new ISO standard (ISO 14651) for multilingual sorting. By doing so, the sorting order will be based on the
GENERIC_M sorting order (ISO standard) and will be able to sort at the secondary level from right to left. Oracle Corporation recommends using a multilingual linguistic sort if the tables contain multilingual data. If the tables contain only pure French, for memory usage concern, a
French monolingual sort may get better performance. You have to make a trade-off between extensibility and performance.
To use a multilingual linguistic sort, you can specify the default sort order by setting the environment variable
NLS_SORT or using the
You can create new linguistic indexes based on multilingual linguistic sorts just as you did for monolingual linguistic sorts. However, if you do not want to change every index hint of your SQL statements, but you do need to use those new multilinguistic sorts. Here are the steps:
Oracle evaluates multilingual sorts at three levels of precision:
A primary level sort distinguishes between base characters, such as the difference between characters
b. It is up to individual locales to define if
a is before
b is before
a, or they are equal. The binary representation of the characters is completely irrelevant. If a character is an ignorable character, it is assigned a primary level weight (or order) of zero, which means it is ignored at the primary level. Ignorable characters on all other levels are also defined by the use of weight zero. At the primary level, the following words are not distinguished between uppercase and lowercase words and can appear in different orders.
A secondary level sort distinguishes between base characters (the primary level sort), plus it distinguishes the different diacritical marks on a given base character. For example, the character
Ä differs from the character
A only because it has a diacritical mark. Thus,
A differ on the secondary level but they are the same on the primary level because they are derived from the same base character
SELECT words FROM rdictionary; --with a secondary level sort words ----- resume résumé Résumé Resumes resumes résumés
A tertiary level sort distinguishes between base characters (primary level sort), diacritical marks (secondary level sort), and the different cases of characters. It can also include difference of special characters such as
*. For example, characters
A are different on the tertiary level and equal on the primary and secondary levels because they only have a case difference. Another example is that characters
A are equal on the primary level and different on secondary and tertiary levels. The final example is that the primary order for the dash character
- is 0. That is, it is ignored on the primary and secondary levels. If a dash is compared with another character whose primary level order is nonzero, for example, the character
u, then no result for the primary level is available because
u is not compared with anything. In this case, Oracle finds a difference between
u, but only at the tertiary level. For example:
SELECT words FROM rdictionary; --with a tertiary level sort words ----- resume Resume résumé Résumé resumes résumés Resumes Résumés
You should consider the following issues when sorting:
When sorting, some characters can or should be ignored. For example, a dash in
multi-lingual could be treated the same for sorting purposes as
multilingual. These characters are called ignorable characters. There are two kinds of ignorable characters: accents and punctuation.
Examples of ignorable accent characters:
Examples of ignorable punctuation characters:
Sorting elements usually consist of a single character, but, in some locales, two or more characters in a character string must be considered as a single sorting element during sorting. For example, in Spanish, the string
ch is composed of two characters. These characters are called contracting characters or group characters in multilingual linguistic sorting and special combination letters in monolingual linguistic sorting. The important difference is that a composed character can be displayed as a single character on a terminal (if desired), while a contracting character is only used for sorting, and its component characters must be rendered separately. Note that a contracting character is not a Unicode-composed character.
In some locales, one character must be sorted as if it was a character string. An example is the German character
ß (sharp s). It is sorted exactly the same as the string
SS. Another example is that
ö sorts as if it were
od and before
of. These characters are known as expanding characters in multilingual linguistic sorting and special letters in monolingual linguistic sorting. Just as with contracting characters, the replacement string for an expanding character is only meaningful for sorting.
In Japanese, a prolonged sound mark (resembles an em dash
--) represents a length mark that lengthens the vowel of the preceding character. Depending on the vowel, the result will sort in a different order. This is called context-sensitive sorting. For example, after the character
-- length mark indicates a long
a and is treated the same as
a, while after the character
-- length mark indicates a long
i and is treated the same as
i. Transliterating this to Latin characters, a sort might look like this:
kaa -- kaa and ka-- are the same ka-- kai -- kai follows ka- because i is after a kia kii -- kii and ki-- are the same ki--
One Unicode code point may be equivalent to a sequence of base character code points plus combining characters (accent marks) code points, regardless of locales in use. This is called the Unicode canonical equivalence. For example,
ä equals its base letter
a and a combining character diaeresis. A linguistic flag,
Canonical_equivalence=TRUE, which you can set in the definition file, indicates that all canonical equivalence rules defined in Unicode 3.0 need to be applied. You can change this flag to
FALSE to speed up the comparison and ordering functions if all the data is in its composed form.
You can extend UTF-16 and UTF-8 to encode more than 1 million characters. These extended characters are called surrogate pairs. Multilingual linguistic sorts can support up to one million surrogate pairs. However, surrogate characters cannot be defined as contracting characters, expanding characters, or context-sensitive characters.
In French, the sorting of strings with accented characters compares base characters from left to right, but compares accented characters from right to left. For example, by default, an accented character is placed after its unaccented variant. Then the two strings
Edít are in proper French order. They are equal on the primary level, and the secondary order is determined by examining accented characters from right to left. Individual locales can request that the accented characters and related diacritical marks be sorted with the right-to-left rule. This is specified in a locale specification file by using a linguistic flag.
Chapter 11, "Oracle Locale Builder Utility" for more information about customizing a sort
In Thai and Lao, some characters must first be swapped with their following character before sorting. Normally, these types of character are symbols representing vowel sounds, and the next character will be a consonant. Consonants and vowels must be swapped before sorting. A linguistic flag is used to enable you to specify all the characters to be swapped.
Base letters are defined in a base letter table, which maps each letter to its base letter. For example,
Ä all map to
a, which is the base letter. This concept is particularly relevant for working with Oracle9i Text.
Special letters is the term used in monolingual sorts. They are called expanding characters in multilingual sorts.
Special combination letters is the term used in monolingual sorts. They are called contracting letters in multilingual sorts.
One lowercase letter may map to multiple uppercase letters. For example, in traditional German, the uppercase letters for
Special cases like these are also handled when converting uppercase characters to lowercase, and vice versa. Such case-conversion issues are handled by the
NLS_INITCAP functions, according to the conventions established by the linguistic sort sequence. (The standard functions
INITCAP cannot handle these special cases.)
Oracle supports special lowercase letters, so one letter may map to multiple base letters. An example is the Turkish uppercase
I becoming a small, dotless
Linguistic sorting is language-specific and requires more data processing than binary sorting. Binary sorting ASCII is accurate and fast because it is in order. When data of multiple languages is stored in the database, you may want your applications to collate a result set returned from a
SELECT statement using the
BY clause with different collating sequences based upon the language being used. You can accomplish this without sacrificing performance by using linguistic indexes, a feature introduced in Oracle8i. While a linguistic index for a column slows down inserts and updates, it greatly improves the performance of linguistic sorting with the
You can create a function-based index that uses languages other than English. The index does not change the linguistic sort order determined by
NLS_SORT. The index simply improves the performance. An example is:
returns the result much faster than without an index.
Oracle9i Database Concepts for more information about function-based indexes
There are three ways to build linguistic indexes for data in multiple languages:
CREATE INDEX french_index ON emp (NLSSORT(emp_name, 'NLS_SORT=FRENCH')); CREATE INDEX german_index ON emp (NLSSORT(emp_name, 'NLS_SORT=GERMAN'));
Which index to use is based on the
NLS_SORT session parameters or the arguments of the
NLSSORT function you specified in the
BY clause. For example, if the session variable
NLS_SORT is set to
FRENCH, you can use
french_index and when it is set to
GERMAN, you can use
LANG_COLin the example below) that contains
NLS_LANGUAGEvalues for the corresponding column on which the index is built as a parameter to the
NLSSORTfunction. The following example builds a single linguistic index for multiple languages. With this index, the rows with the same values for
NLS_LANGUAGEare collated together.
Which index to use is based on the argument of the
NLSSORT function you specified in the
FRENCH_M. These indexes collate characters according to the rules defined in ISO 14651.
Chapter 4, "Linguistic Sorting" for more information about the different Unicode sorting sequences
If you want to use a single linguistic index or multiple linguistic indexes, some requirements must be met for the linguistic index. The first requirement is that the
QUERY_REWRITE_ENABLED initialization parameter must be set to
TRUE. This is not a specific requirement for linguistic indexes, but for all function-based indexes. Here is an example of setting
The second requirement, which is specific to linguistic indexes, is that
NLS_COMP needs to be set to
ANSI. There are various ways to set
NLS_COMP. For example:
The third requirement is that
NLS_SORT needs to indicate the linguistic definition you want to use for the linguistic sort. If you want a French linguistic sort order,
NLS_SORT needs to be set to
FRENCH. If you want a German linguistic sort order,
NLS_SORT needs to beset to
There are various ways to set
NLS_SORT. Although the following example uses an
SESSION statement, it is probably better for you to set
NLS_SORT as a client environment variable so that you can use the same SQL statements for all languages and different linguistic indexes can be used, based on
NLS_SORT being set in the client environment. The following is an example of setting
The fourth requirement is that you need to use the cost-based optimizer with the optimizer mode set to
FIRST_ROWS, because linguistic indexes are not recognized by the rule-based optimizer. The following is an example of setting the optimizer mode:
The last thing is that you need to set
NULL when you want to use
column_name is the column with the linguistic index. This is necessary only when you use an
The following example shows how to set up a French linguistic index. For
NLS_SORT, you may want to set it in the client environment variable instead of with the
ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE; ALTER SESSION SET NLS_COMP = ANSI; ALTER SESSION SET NLS_SORT='FRENCH'; ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS; CREATE TABLE test(col VARCHAR(20) NOT NULL); CREATE INDEX test_idx ON test(NLSSORT(col, 'NLS_SORT=FRENCH')); SELECT * FROM test WHERE NLSSORT(col) IS NOT NULL ORDER BY col; SELECT * FROM test WHERE col > 'JJJ';
Oracle9i Database Concepts for more information about function-based indexes
You can create a function-based index that improves the performance of case-insensitive searches. For example:
CREATE INDEX case_insensitive_ind ON my_table(NLS_UPPER(empname)); SELECT * FROM my_table WHERE NLS_UPPER(empname) = 'KARL';
You can customize sorting with the Locale Builder Utility.