Oracle9i Globalization Support Guide Release 1 (9.0.1) Part Number A90236-02 |
|
This chapter explains how characters are sorted in an Oracle environment. Its sections are:
Different languages have different sort orders. What's more, different cultures or countries using the same alphabets may sort words differently. For example, in Danish, the letter Æ is after Z
, while Y
and Ü
are considered to be variants of the same letter. Sort order can be case sensitive or insensitive and can ignore accents or not. It can also be either phonetic or based on the appearance of the character, such as ordering by the number of strokes or by radicals for East Asian ideographs. Another common sorting issue is when letters are combined. For example, in traditional Spanish, ch
is a distinct character, which means that the correct order is: cerveza, colorado, cheremoya, and so on. This means that the letter c
cannot be sorted until checking to see if the next letter is an h
.
Oracle provides several different types of sort, and can achieve a linguistically correct sort as well as the new multilingual ISO standard (14651), which is designed to handle many languages at the same time.
Conventionally, when character data is stored, the sort sequence is based on the numeric values of the characters defined by the character encoding scheme. This is called a binary sort. Binary sorts are the fastest type of sort, and produce reasonable results for the English alphabet because the ASCII and EBCDIC standards define the letters A to Z in ascending numeric value. Note, however, that in the ASCII standard, all uppercase letters appear before any lowercase letters. In the EBCDIC standard, the opposite is true: all lowercase letters appear before any uppercase letters.
When characters used in other languages are present, a binary sort generally does not produce reasonable results. For example, an ascending ORDER
BY
query would return the character strings ABC
, ABZ
, BCD
, ÄBC
, in the sequence, when the Ä
has a higher numeric value than B
in the character encoding scheme. For languages using Chinese characters, a binary sort is not usually linguistically meaningful.
To produce a sort sequence that matches the alphabetic sequence of characters, another sort technique must be used that sorts characters independently of their numeric values in the character encoding scheme. This technique is called a linguistic sort. A linguistic sort operates by replacing characters with numeric values that reflect each character's proper linguistic order.
Oracle offers two kinds of linguistic sorts:
Oracle makes two passes when comparing strings in monolingual sorts. The first pass is to compare the major value of entire string from the major table and the second pass is to compare the minor value from the minor table. Each major table entry contains the Unicode codepoint and major value. Usually, letters with the same appearance will have the same major value. Oracle defines letters with diacritic and case differences for the same major value but different minor values. Table 4-1 illustrates sample values for sorting a
, A
, ä
, and Ä
.
Glyph | Major Value | Minor Value |
---|---|---|
a |
15 |
5 |
A |
15 |
10 |
ä |
15 |
15 |
Ä |
15 |
20 |
Oracle9i provides multilingual linguistic sorts so that you can sort more than one language as part of one sort. This is useful for certain regions or languages that have complex sorting rules or global multilingual databases. Additionally, Oracle9i still supports all the sort orders defined by previous releases.
For Asian language data or multilingual data, Oracle provides a sorting mechanism based on an ISO standard (ISO14651) and the Unicode 3.0 standard. Multilingual linguistic sorts also work for Asian language sorts ordered by the number of strokes, PinYin, or radicals. In addition, they can handle canonical equivalence and surrogate codepoint pairs. You can define up to 1.1 million codepoints in one sort.
For example, in Oracle9i, a French
sort is supported, but the new multilingual linguistic sort for French can also be applied by changing the sort order from French
to French_M
, where _M
represents the new ISO standard (ISO 14651) for multilingual sorting. By doing so, the sorting order will be based on the GENERIC_M
sorting order (ISO standard) and will be able to sort at the secondary level from right to left. Oracle Corporation recommends using a multilingual linguistic sort if the tables contain multilingual data. If the tables contain only pure French, for memory usage concern, a French
monolingual sort may get better performance. You have to make a trade-off between extensibility and performance.
To use a multilingual linguistic sort, you can specify the default sort order by setting the environment variable NLS_SORT
or using the NLSSORT
function.
% setenv NLS_SORT='French_M'
or
NLSSORT('','NLS_SORT=French_M')
You can create new linguistic indexes based on multilingual linguistic sorts just as you did for monolingual linguistic sorts. However, if you do not want to change every index hint of your SQL statements, but you do need to use those new multilinguistic sorts. Here are the steps:
DROP
INDEX
index_table1
;CREATE
INDEX
index_table1
ON
table1
(col
, 'NLS_SORT=French_M
');COMMIT
;
Oracle evaluates multilingual sorts at three levels of precision:
A primary level sort distinguishes between base characters, such as the difference between characters a
and b
. It is up to individual locales to define if a
is before b
, b
is before a
, or they are equal. The binary representation of the characters is completely irrelevant. If a character is an ignorable character, it is assigned a primary level weight (or order) of zero, which means it is ignored at the primary level. Ignorable characters on all other levels are also defined by the use of weight zero. At the primary level, the following words are not distinguished between uppercase and lowercase words and can appear in different orders.
Bat bat BAT BET Bet bet
A secondary level sort distinguishes between base characters (the primary level sort), plus it distinguishes the different diacritical marks on a given base character. For example, the character Ä
differs from the character A
only because it has a diacritical mark. Thus, Ä
and A
differ on the secondary level but they are the same on the primary level because they are derived from the same base character A
.
SELECT words FROM rdictionary; --with a secondary level sort words ----- resume résumé Résumé Resumes resumes résumés
A tertiary level sort distinguishes between base characters (primary level sort), diacritical marks (secondary level sort), and the different cases of characters. It can also include difference of special characters such as +
, -
, and *
. For example, characters a
and A
are different on the tertiary level and equal on the primary and secondary levels because they only have a case difference. Another example is that characters ä
and A
are equal on the primary level and different on secondary and tertiary levels. The final example is that the primary order for the dash character -
is 0. That is, it is ignored on the primary and secondary levels. If a dash is compared with another character whose primary level order is nonzero, for example, the character u
, then no result for the primary level is available because u
is not compared with anything. In this case, Oracle finds a difference between -
and u
, but only at the tertiary level. For example:
SELECT words FROM rdictionary; --with a tertiary level sort words ----- resume Resume résumé Résumé resumes résumés Resumes Résumés
You should consider the following issues when sorting:
When sorting, some characters can or should be ignored. For example, a dash in multi-lingual
could be treated the same for sorting purposes as multilingual
. These characters are called ignorable characters. There are two kinds of ignorable characters: accents and punctuation.
Examples of ignorable accent characters:
Examples of ignorable punctuation characters:
Sorting elements usually consist of a single character, but, in some locales, two or more characters in a character string must be considered as a single sorting element during sorting. For example, in Spanish, the string ch
is composed of two characters. These characters are called contracting characters or group characters in multilingual linguistic sorting and special combination letters in monolingual linguistic sorting. The important difference is that a composed character can be displayed as a single character on a terminal (if desired), while a contracting character is only used for sorting, and its component characters must be rendered separately. Note that a contracting character is not a Unicode-composed character.
In some locales, one character must be sorted as if it was a character string. An example is the German character ß
(sharp s). It is sorted exactly the same as the string SS
. Another example is that ö
sorts as if it were oe
, after od
and before of
. These characters are known as expanding characters in multilingual linguistic sorting and special letters in monolingual linguistic sorting. Just as with contracting characters, the replacement string for an expanding character is only meaningful for sorting.
In Japanese, a prolonged sound mark (resembles an em dash --
) represents a length mark that lengthens the vowel of the preceding character. Depending on the vowel, the result will sort in a different order. This is called context-sensitive sorting. For example, after the character ka
, the --
length mark indicates a long a
and is treated the same as a
, while after the character ki
, the --
length mark indicates a long i
and is treated the same as i
. Transliterating this to Latin characters, a sort might look like this:
kaa -- kaa and ka-- are the same ka-- kai -- kai follows ka- because i is after a kia kii -- kii and ki-- are the same ki--
One Unicode code point may be equivalent to a sequence of base character code points plus combining characters (accent marks) code points, regardless of locales in use. This is called the Unicode canonical equivalence. For example, ä
equals its base letter a
and a combining character diaeresis. A linguistic flag, Canonical_equivalence=TRUE
, which you can set in the definition file, indicates that all canonical equivalence rules defined in Unicode 3.0 need to be applied. You can change this flag to FALSE
to speed up the comparison and ordering functions if all the data is in its composed form.
You can extend UTF-16 and UTF-8 to encode more than 1 million characters. These extended characters are called surrogate pairs. Multilingual linguistic sorts can support up to one million surrogate pairs. However, surrogate characters cannot be defined as contracting characters, expanding characters, or context-sensitive characters.
In French, the sorting of strings with accented characters compares base characters from left to right, but compares accented characters from right to left. For example, by default, an accented character is placed after its unaccented variant. Then the two strings Èdit
and Edít
are in proper French order. They are equal on the primary level, and the secondary order is determined by examining accented characters from right to left. Individual locales can request that the accented characters and related diacritical marks be sorted with the right-to-left rule. This is specified in a locale specification file by using a linguistic flag.
In Thai and Lao, some characters must first be swapped with their following character before sorting. Normally, these types of character are symbols representing vowel sounds, and the next character will be a consonant. Consonants and vowels must be swapped before sorting. A linguistic flag is used to enable you to specify all the characters to be swapped.
Base letters are defined in a base letter table, which maps each letter to its base letter. For example, a
, A
, ä
, and Ä
all map to a
, which is the base letter. This concept is particularly relevant for working with Oracle9i Text.
Special letters is the term used in monolingual sorts. They are called expanding characters in multilingual sorts.
Special combination letters is the term used in monolingual sorts. They are called contracting letters in multilingual sorts.
One lowercase letter may map to multiple uppercase letters. For example, in traditional German, the uppercase letters for ß
are SS
.
Special cases like these are also handled when converting uppercase characters to lowercase, and vice versa. Such case-conversion issues are handled by the NLS_UPPER
, NLS_LOWER
, and NLS_INITCAP
functions, according to the conventions established by the linguistic sort sequence. (The standard functions UPPER
, LOWER
, and INITCAP
cannot handle these special cases.)
Oracle supports special lowercase letters, so one letter may map to multiple base letters. An example is the Turkish uppercase I
becoming a small, dotless i
.
Linguistic sorting is language-specific and requires more data processing than binary sorting. Binary sorting ASCII is accurate and fast because it is in order. When data of multiple languages is stored in the database, you may want your applications to collate a result set returned from a SELECT
statement using the ORDER
BY
clause with different collating sequences based upon the language being used. You can accomplish this without sacrificing performance by using linguistic indexes, a feature introduced in Oracle8i. While a linguistic index for a column slows down inserts and updates, it greatly improves the performance of linguistic sorting with the ORDER
BY
clause.
You can create a function-based index that uses languages other than English. The index does not change the linguistic sort order determined by NLS_SORT
. The index simply improves the performance. An example is:
CREATE INDEX nls_index ON my_table (NLSSORT(name, 'NLS_SORT = German'));
So
SELECT * FROM my_table WHERE NLSSORT(name) IS NOT NULL ORDER BY name;
returns the result much faster than without an index.
There are three ways to build linguistic indexes for data in multiple languages:
CREATE INDEX french_index ON emp (NLSSORT(emp_name, 'NLS_SORT=FRENCH')); CREATE INDEX german_index ON emp (NLSSORT(emp_name, 'NLS_SORT=GERMAN'));
Which index to use is based on the NLS_SORT
session parameters or the arguments of the NLSSORT
function you specified in the ORDER
BY
clause. For example, if the session variable NLS_SORT
is set to FRENCH
, you can use french_index
and when it is set to GERMAN
, you can use german_index
.
LANG_COL
in the example below) that contains NLS_LANGUAGE
values for the corresponding column on which the index is built as a parameter to the NLSSORT
function. The following example builds a single linguistic index for multiple languages. With this index, the rows with the same values for NLS_LANGUAGE
are collated together.
CREATE INDEX i ON t (NLSSORT(col, 'NLS_SORT=' || LANG_COL));
Which index to use is based on the argument of the NLSSORT
function you specified in the ORDER
BY
clause.
GENERIC_M
or FRENCH_M
. These indexes collate characters according to the rules defined in ISO 14651.
CREATE INDEX i on t (NLSSORT(col, 'NLS_SORT=GENERIC_M');
See Also:
Chapter 4, "Linguistic Sorting" for more information about the different Unicode sorting sequences |
If you want to use a single linguistic index or multiple linguistic indexes, some requirements must be met for the linguistic index. The first requirement is that the QUERY_REWRITE_ENABLED
initialization parameter must be set to TRUE
. This is not a specific requirement for linguistic indexes, but for all function-based indexes. Here is an example of setting QUERY_REWRITE_ENABLED
:
ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE;
The second requirement, which is specific to linguistic indexes, is that NLS_COMP
needs to be set to ANSI
. There are various ways to set NLS_COMP
. For example:
ALTER SESSION SET NLS_COMP = ANSI;
The third requirement is that NLS_SORT
needs to indicate the linguistic definition you want to use for the linguistic sort. If you want a French linguistic sort order, NLS_SORT
needs to be set to FRENCH
. If you want a German linguistic sort order, NLS_SORT
needs to beset to GERMAN
.
There are various ways to set NLS_SORT
. Although the following example uses an ALTER
SESSION
statement, it is probably better for you to set NLS_SORT
as a client environment variable so that you can use the same SQL statements for all languages and different linguistic indexes can be used, based on NLS_SORT
being set in the client environment. The following is an example of setting NLS_SORT
:
ALTER SESSION SET NLS_SORT='FRENCH';
The fourth requirement is that you need to use the cost-based optimizer with the optimizer mode set to FIRST_ROWS
, because linguistic indexes are not recognized by the rule-based optimizer. The following is an example of setting the optimizer mode:
ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS;
The last thing is that you need to set WHERE
NLSSORT
(column_name
) to IS
NOT
NULL
when you want to use ORDER
BY
column_name
, where column_name
is the column with the linguistic index. This is necessary only when you use an ORDER
BY
clause.
The following example shows how to set up a French linguistic index. For NLS_SORT
, you may want to set it in the client environment variable instead of with the ALTER
SESSION
statement.
ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE; ALTER SESSION SET NLS_COMP = ANSI; ALTER SESSION SET NLS_SORT='FRENCH'; ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS; CREATE TABLE test(col VARCHAR(20) NOT NULL); CREATE INDEX test_idx ON test(NLSSORT(col, 'NLS_SORT=FRENCH')); SELECT * FROM test WHERE NLSSORT(col) IS NOT NULL ORDER BY col; SELECT * FROM test WHERE col > 'JJJ';
You can create a function-based index that improves the performance of case-insensitive searches. For example:
CREATE INDEX case_insensitive_ind ON my_table(NLS_UPPER(empname)); SELECT * FROM my_table WHERE NLS_UPPER(empname) = 'KARL';
You can customize sorting with the Locale Builder Utility.
|
Copyright © 1996-2001, Oracle Corporation. All Rights Reserved. |
|