10.1.13.1 Unicode Character Sets

MySQL 5.0 has two Unicode character sets:

You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set and describes their differentiating properties. For general information about the character sets, see Section 10.1.10, “Unicode Support”.

A similar set of collations is available for each Unicode character set. These are shown in the following list, where xxx represents the character set name. For example, xxx_danish_ci represents the Danish collations, the specific names of which are ucs2_danish_ci and utf8_danish_ci.

The xxx_esperanto_ci collations were added in MySQL 5.0.13. The xxx_hungarian_ci collations were added in MySQL 5.0.19.

MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. Currently, the xxx_unicode_ci collations have only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo. A combined character will be considered different from the same character written with a single unicode character in string comparisons, and the two characters are considered to have a different length (for example, as returned by the CHAR_LENGTH() function or in result set metadata).

MySQL implements language-specific Unicode collations only if the ordering with xxx_unicode_ci does not work well for a language. Language-specific collations are UCA-based. They are derived from xxx_unicode_ci with additional language tailoring rules.

For any Unicode character set, operations performed using the xxx_general_ci collation are faster than those for the xxx_unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages ß is equal to ss. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 10.1.7.8, “Examples of the Effect of Collation”):

Ä = A
Ö = O
Ü = U

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci, which supports the German DIN-1 ordering (also known as dictionary order):

ß = ss

MySQL implements language-specific collations for the utf8 character set only if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German dictionary order and French, so there is no need to create special utf8 collations.

utf8_general_ci also is satisfactory for both German and French, except that ß is equal to s, and not to ss. If this is acceptable for your application, you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

xxx_swedish_ci includes Swedish rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

Ü = Y < Ö

The xxx_spanish_ci and xxx_spanish2_ci collations correspond to modern Spanish and traditional Spanish, respectively. In both collations, ñ (n-tilde) is a separate letter between n and o. In addition, for traditional Spanish, ch is a separate letter between c and d, and ll is a separate letter between l and m

The xxx_spanish2_ci collations may also be used for Asturian and Galician.

The xxx_danich_ci collations may also be used for Norwegian.

In the xxx_roman_ci collations, I and J compare as equal, and U and V compare as equal.

For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).