4
Linguistic Sorting

This chapter explains how characters are sorted in an Oracle environment. Its sections are:

Overview of Oracle's Sorting Capabilities

Different languages have different sort orders. What's more, different cultures or countries using the same alphabets may sort words differently. For example, in Danish, the letter Æ is after Z, while Y and Ü are considered to be variants of the same letter. Sort order can be case sensitive or insensitive and can ignore accents or not. It can also be either phonetic or based on the appearance of the character, such as ordering by the number of strokes or by radicals for East Asian ideographs. Another common sorting issue is when letters are combined. For example, in traditional Spanish, ch is a distinct character, which means that the correct order is: cerveza, colorado, cheremoya, and so on. This means that the letter c cannot be sorted until checking to see if the next letter is an h.

Oracle provides several different types of sort, and can achieve a linguistically correct sort as well as the new multilingual ISO standard (14651), which is designed to handle many languages at the same time.

Using Binary Sorts

Conventionally, when character data is stored, the sort sequence is based on the numeric values of the characters defined by the character encoding scheme. This is called a binary sort. Binary sorts are the fastest type of sort, and produce reasonable results for the English alphabet because the ASCII and EBCDIC standards define the letters A to Z in ascending numeric value. Note, however, that in the ASCII standard, all uppercase letters appear before any lowercase letters. In the EBCDIC standard, the opposite is true: all lowercase letters appear before any uppercase letters.

When characters used in other languages are present, a binary sort generally does not produce reasonable results. For example, an ascending ORDER BY query would return the character strings ABC, ABZ, BCD, ÄBC, in the sequence, when the Ä has a higher numeric value than B in the character encoding scheme. For languages using Chinese characters, a binary sort is not usually linguistically meaningful.

Using Linguistic Sorts

To produce a sort sequence that matches the alphabetic sequence of characters, another sort technique must be used that sorts characters independently of their numeric values in the character encoding scheme. This technique is called a linguistic sort. A linguistic sort operates by replacing characters with numeric values that reflect each character's proper linguistic order.

Oracle offers two kinds of linguistic sorts:

Monolingual Linguistic Sorts, commonly used for European languages
Multilingual Linguistic Sorts, commonly used for Asian languages

Monolingual Linguistic Sorts

Oracle makes two passes when comparing strings in monolingual sorts. The first pass is to compare the major value of entire string from the major table and the second pass is to compare the minor value from the minor table. Each major table entry contains the Unicode codepoint and major value. Usually, letters with the same appearance will have the same major value. Oracle defines letters with diacritic and case differences for the same major value but different minor values. Table 4-1 illustrates sample values for sorting a, A, ä, and Ä.

Table 4-1 Sample Glyphs and Their Major and Minor Codepoint Values

Glyph	Major Value	Minor Value
a	15	5
A	15	10
ä	15	15
Ä	15	20

Multilingual Linguistic Sorts

Oracle9i provides multilingual linguistic sorts so that you can sort more than one language as part of one sort. This is useful for certain regions or languages that have complex sorting rules or global multilingual databases. Additionally, Oracle9i still supports all the sort orders defined by previous releases.

For Asian language data or multilingual data, Oracle provides a sorting mechanism based on an ISO standard (ISO14651) and the Unicode 3.0 standard. Multilingual linguistic sorts also work for Asian language sorts ordered by the number of strokes, PinYin, or radicals. In addition, they can handle canonical equivalence and surrogate codepoint pairs. You can define up to 1.1 million codepoints in one sort.

For example, in Oracle9i, a French sort is supported, but the new multilingual linguistic sort for French can also be applied by changing the sort order from French to French_M, where _M represents the new ISO standard (ISO 14651) for multilingual sorting. By doing so, the sorting order will be based on the GENERIC_M sorting order (ISO standard) and will be able to sort at the secondary level from right to left. Oracle Corporation recommends using a multilingual linguistic sort if the tables contain multilingual data. If the tables contain only pure French, for memory usage concern, a French monolingual sort may get better performance. You have to make a trade-off between extensibility and performance.

To use a multilingual linguistic sort, you can specify the default sort order by setting the environment variable NLS_SORT or using the NLSSORT function.

% setenv NLS_SORT='French_M'

NLSSORT('','NLS_SORT=French_M')

You can create new linguistic indexes based on multilingual linguistic sorts just as you did for monolingual linguistic sorts. However, if you do not want to change every index hint of your SQL statements, but you do need to use those new multilinguistic sorts. Here are the steps:

DROP INDEX index_table1;
CREATE INDEX index_table1 ON table1(col, 'NLS_SORT=French_M');
COMMIT;

Multilingual Sorting Levels

Oracle evaluates multilingual sorts at three levels of precision:

Primary Level Sorts

A primary level sort distinguishes between base characters, such as the difference between characters a and b. It is up to individual locales to define if a is before b, b is before a, or they are equal. The binary representation of the characters is completely irrelevant. If a character is an ignorable character, it is assigned a primary level weight (or order) of zero, which means it is ignored at the primary level. Ignorable characters on all other levels are also defined by the use of weight zero. At the primary level, the following words are not distinguished between uppercase and lowercase words and can appear in different orders.

Bat
bat
BAT
BET
Bet
bet

Secondary Level Sorts

A secondary level sort distinguishes between base characters (the primary level sort), plus it distinguishes the different diacritical marks on a given base character. For example, the character Ä differs from the character A only because it has a diacritical mark. Thus, Ä and A differ on the secondary level but they are the same on the primary level because they are derived from the same base character A.

SELECT words FROM rdictionary; --with a secondary level sort
words
-----
resume
résumé
Résumé
Resumes
resumes
résumés

Tertiary Level Sorts

A tertiary level sort distinguishes between base characters (primary level sort), diacritical marks (secondary level sort), and the different cases of characters. It can also include difference of special characters such as +, -, and *. For example, characters a and A are different on the tertiary level and equal on the primary and secondary levels because they only have a case difference. Another example is that characters ä and A are equal on the primary level and different on secondary and tertiary levels. The final example is that the primary order for the dash character - is 0. That is, it is ignored on the primary and secondary levels. If a dash is compared with another character whose primary level order is nonzero, for example, the character u, then no result for the primary level is available because u is not compared with anything. In this case, Oracle finds a difference between - and u, but only at the tertiary level. For example:

SELECT words FROM rdictionary; --with a tertiary level sort
words
-----
resume
Resume
résumé
Résumé
resumes
résumés
Resumes
Résumés

General Linguistic Sorting Information

You should consider the following issues when sorting:

Ignorable Characters

When sorting, some characters can or should be ignored. For example, a dash in multi-lingual could be treated the same for sorting purposes as multilingual. These characters are called ignorable characters. There are two kinds of ignorable characters: accents and punctuation.

Examples of ignorable accent characters:

rôle can be treated as role
naïve can be treated as naive

Examples of ignorable punctuation characters:

multi-lingual can be treated as multilingual
e-mail can be treated as email

Contracting Characters

Sorting elements usually consist of a single character, but, in some locales, two or more characters in a character string must be considered as a single sorting element during sorting. For example, in Spanish, the string ch is composed of two characters. These characters are called contracting characters or group characters in multilingual linguistic sorting and special combination letters in monolingual linguistic sorting. The important difference is that a composed character can be displayed as a single character on a terminal (if desired), while a contracting character is only used for sorting, and its component characters must be rendered separately. Note that a contracting character is not a Unicode-composed character.

Expanding Characters

In some locales, one character must be sorted as if it was a character string. An example is the German character ß (sharp s). It is sorted exactly the same as the string SS. Another example is that ö sorts as if it were oe, after od and before of. These characters are known as expanding characters in multilingual linguistic sorting and special letters in monolingual linguistic sorting. Just as with contracting characters, the replacement string for an expanding character is only meaningful for sorting.

Context-Sensitive Characters

In Japanese, a prolonged sound mark (resembles an em dash --) represents a length mark that lengthens the vowel of the preceding character. Depending on the vowel, the result will sort in a different order. This is called context-sensitive sorting. For example, after the character ka, the -- length mark indicates a long a and is treated the same as a, while after the character ki, the -- length mark indicates a long i and is treated the same as i. Transliterating this to Latin characters, a sort might look like this:

kaa   -- kaa and ka-- are the same
ka--
kai   -- kai follows ka- because i is after a
kia
kii   -- kii and ki-- are the same
ki--

Canonical Equivalence

One Unicode code point may be equivalent to a sequence of base character code points plus combining characters (accent marks) code points, regardless of locales in use. This is called the Unicode canonical equivalence. For example, ä equals its base letter a and a combining character diaeresis. A linguistic flag, Canonical_equivalence=TRUE, which you can set in the definition file, indicates that all canonical equivalence rules defined in Unicode 3.0 need to be applied. You can change this flag to FALSE to speed up the comparison and ordering functions if all the data is in its composed form.

Surrogate Characters

You can extend UTF-16 and UTF-8 to encode more than 1 million characters. These extended characters are called surrogate pairs. Multilingual linguistic sorts can support up to one million surrogate pairs. However, surrogate characters cannot be defined as contracting characters, expanding characters, or context-sensitive characters.

Reverse Secondary Sorting

In French, the sorting of strings with accented characters compares base characters from left to right, but compares accented characters from right to left. For example, by default, an accented character is placed after its unaccented variant. Then the two strings Èdit and Edít are in proper French order. They are equal on the primary level, and the secondary order is determined by examining accented characters from right to left. Individual locales can request that the accented characters and related diacritical marks be sorted with the right-to-left rule. This is specified in a locale specification file by using a linguistic flag.

See Also:

Chapter 11, "Oracle Locale Builder Utility" for more information about customizing a sort

Character Rearrangement for Thai/Lao Characters

In Thai and Lao, some characters must first be swapped with their following character before sorting. Normally, these types of character are symbols representing vowel sounds, and the next character will be a consonant. Consonants and vowels must be swapped before sorting. A linguistic flag is used to enable you to specify all the characters to be swapped.

Base Letters

Base letters are defined in a base letter table, which maps each letter to its base letter. For example, a, A, ä, and Ä all map to a, which is the base letter. This concept is particularly relevant for working with Oracle9i Text.

See Also:

Oracle Text Reference

Special Letters

Special letters is the term used in monolingual sorts. They are called expanding characters in multilingual sorts.

See Also:

"Expanding Characters"

Special Combination Letters

Special combination letters is the term used in monolingual sorts. They are called contracting letters in multilingual sorts.

See Also:

"Contracting Characters"

Special Uppercase Letters

One lowercase letter may map to multiple uppercase letters. For example, in traditional German, the uppercase letters for ß are SS.

Special cases like these are also handled when converting uppercase characters to lowercase, and vice versa. Such case-conversion issues are handled by the NLS_UPPER, NLS_LOWER, and NLS_INITCAP functions, according to the conventions established by the linguistic sort sequence. (The standard functions UPPER, LOWER, and INITCAP cannot handle these special cases.)

Special Lowercase Letters

Oracle supports special lowercase letters, so one letter may map to multiple base letters. An example is the Turkish uppercase I becoming a small, dotless i.

Using Linguistic Indexes

Linguistic sorting is language-specific and requires more data processing than binary sorting. Binary sorting ASCII is accurate and fast because it is in order. When data of multiple languages is stored in the database, you may want your applications to collate a result set returned from a SELECT statement using the ORDER BY clause with different collating sequences based upon the language being used. You can accomplish this without sacrificing performance by using linguistic indexes, a feature introduced in Oracle8i. While a linguistic index for a column slows down inserts and updates, it greatly improves the performance of linguistic sorting with the ORDER BY clause.

You can create a function-based index that uses languages other than English. The index does not change the linguistic sort order determined by NLS_SORT. The index simply improves the performance. An example is:

CREATE INDEX nls_index ON my_table (NLSSORT(name, 'NLS_SORT = German'));

SELECT * FROM my_table WHERE NLSSORT(name) IS NOT NULL
  ORDER BY name;

returns the result much faster than without an index.

See Also:

Oracle9i Database Concepts for more information about function-based indexes

Linguistic Indexes for Multiple Languages

There are three ways to build linguistic indexes for data in multiple languages:

Build a linguistic index for each language that the application needs to support. This approach offers simplicity but requires more disk space. For each index, the rows in the language other than the one on which the index is built are collated together at the end of the sequence. The following example builds linguistic indexes for French and German.
```
CREATE INDEX french_index ON emp (NLSSORT(emp_name, 'NLS_SORT=FRENCH'));
CREATE INDEX german_index ON emp (NLSSORT(emp_name, 'NLS_SORT=GERMAN'));
```
Which index to use is based on the NLS_SORT session parameters or the arguments of the NLSSORT function you specified in the ORDER BY clause. For example, if the session variable NLS_SORT is set to FRENCH, you can use french_index and when it is set to GERMAN, you can use german_index.
Build a single linguistic index for all languages. This can be accomplished by including a language column (LANG_COL in the example below) that contains NLS_LANGUAGE values for the corresponding column on which the index is built as a parameter to the NLSSORT function. The following example builds a single linguistic index for multiple languages. With this index, the rows with the same values for NLS_LANGUAGE are collated together.
```
CREATE INDEX i ON t (NLSSORT(col, 'NLS_SORT=' || LANG_COL));
```
Which index to use is based on the argument of the NLSSORT function you specified in the ORDER BY clause.
Build a single linguistic index for all languages using one of the sorting sequences such as GENERIC_M or FRENCH_M. These indexes collate characters according to the rules defined in ISO 14651.
```
CREATE INDEX i on t (NLSSORT(col, 'NLS_SORT=GENERIC_M');
```
See Also:
Chapter 4, "Linguistic Sorting" for more information about the different Unicode sorting sequences

Requirements for Linguistic Indexes

If you want to use a single linguistic index or multiple linguistic indexes, some requirements must be met for the linguistic index. The first requirement is that the QUERY_REWRITE_ENABLED initialization parameter must be set to TRUE. This is not a specific requirement for linguistic indexes, but for all function-based indexes. Here is an example of setting QUERY_REWRITE_ENABLED:

ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE;

The second requirement, which is specific to linguistic indexes, is that NLS_COMP needs to be set to ANSI. There are various ways to set NLS_COMP. For example:

ALTER SESSION SET NLS_COMP = ANSI;

The third requirement is that NLS_SORT needs to indicate the linguistic definition you want to use for the linguistic sort. If you want a French linguistic sort order, NLS_SORT needs to be set to FRENCH. If you want a German linguistic sort order, NLS_SORT needs to beset to GERMAN.

There are various ways to set NLS_SORT. Although the following example uses an ALTER SESSION statement, it is probably better for you to set NLS_SORT as a client environment variable so that you can use the same SQL statements for all languages and different linguistic indexes can be used, based on NLS_SORT being set in the client environment. The following is an example of setting NLS_SORT:

ALTER SESSION SET NLS_SORT='FRENCH';

The fourth requirement is that you need to use the cost-based optimizer with the optimizer mode set to FIRST_ROWS, because linguistic indexes are not recognized by the rule-based optimizer. The following is an example of setting the optimizer mode:

ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS;

The last thing is that you need to set WHERE NLSSORT(column_name) to IS NOT NULL when you want to use ORDER BY column_name, where column_name is the column with the linguistic index. This is necessary only when you use an ORDER BY clause.

The following example shows how to set up a French linguistic index. For NLS_SORT, you may want to set it in the client environment variable instead of with the ALTER SESSION statement.

ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE;
ALTER SESSION SET NLS_COMP = ANSI;
ALTER SESSION SET NLS_SORT='FRENCH';
ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS;
CREATE TABLE test(col VARCHAR(20) NOT NULL);
CREATE INDEX test_idx ON test(NLSSORT(col, 'NLS_SORT=FRENCH'));
SELECT * FROM test WHERE NLSSORT(col) IS NOT NULL ORDER BY col;
SELECT * FROM test WHERE col > 'JJJ';

See Also:

Oracle9i Database Concepts for more information about function-based indexes

Case-Insensitive Searching

You can create a function-based index that improves the performance of case-insensitive searches. For example:

CREATE INDEX case_insensitive_ind ON my_table(NLS_UPPER(empname));
SELECT * FROM my_table WHERE NLS_UPPER(empname) = 'KARL';

Customizing Linguistic Sorts

You can customize sorting with the Locale Builder Utility.

4 Linguistic Sorting

Overview of Oracle's Sorting Capabilities

Using Binary Sorts

Using Linguistic Sorts

Monolingual Linguistic Sorts

Table 4-1 Sample Glyphs and Their Major and Minor Codepoint Values

Multilingual Linguistic Sorts

Multilingual Sorting Levels

Primary Level Sorts

Secondary Level Sorts

Tertiary Level Sorts

General Linguistic Sorting Information

Ignorable Characters

Contracting Characters

Expanding Characters

Context-Sensitive Characters

Canonical Equivalence

Surrogate Characters

Reverse Secondary Sorting

Character Rearrangement for Thai/Lao Characters

Base Letters

Special Letters

Special Combination Letters

Special Uppercase Letters

Special Lowercase Letters

Using Linguistic Indexes

Linguistic Indexes for Multiple Languages

Requirements for Linguistic Indexes

Case-Insensitive Searching

Customizing Linguistic Sorts

4
Linguistic Sorting