10.3 Adding a Character Set

10.3.1 Character Definition Arrays
10.3.2 String Collating Support for Complex Character Sets
10.3.3 Multi-Byte Character Support for Complex Character Sets

This section discusses the procedure for adding a character set to MySQL. The proper procedure depends on whether the character set is simple or complex:

For example, greek and swe7 are simple character sets, whereas big5 and czech are complex character sets.

To use the following instructions, you must have a MySQL source distribution. In the instructions, MYSET represents the name of the character set that you want to add.

  1. Add a <charset> element for MYSET to the sql/share/charsets/Index.xml file. Use the existing contents in the file as a guide to adding new contents. A partial listing for the latin1 <charset> element follows:

    <charset name="latin1">
      <description>cp1252 West European</description>
      <collation name="latin1_swedish_ci" id="8" order="Finnish, Swedish">
      <collation name="latin1_danish_ci" id="15" order="Danish"/>
      <collation name="latin1_bin" id="47" order="Binary">

    The <charset> element must list all the collations for the character set. These must include at least a binary collation and a default (primary) collation. The default collation is often named using a suffix of general_ci (general, case insensitive). It is possible for the binary collation to be the default collation, but usually they are different. The default collation should have a primary flag. The binary collation should have a binary flag.

    You must assign a unique ID number to each collation, chosen from the range 1 to 254. To find the maximum of the currently used collation IDs, use this query:

  2. This step depends on whether you are adding a simple or complex character set. A simple character set requires only a configuration file, whereas a complex character set requires C source file that defines collation functions, multibyte functions, or both.

    For a simple character set, create a configuration file, MYSET.xml, that describes the character set properties. Create this file in the sql/share/charsets directory. You can use a copy of latin1.xml as the basis for this file. The syntax for the file is very simple:

    • Comments are written as ordinary XML comments (<!-- text -->).

    • Words within <map> array elements are separated by arbitrary amounts of whitespace.

    • Each word within <map> array elements must be a number in hexadecimal format.

    • The <map> array element for the <ctype> element has 257 words. The other <map> array elements after that have 256 words. See Section 10.3.1, “Character Definition Arrays”.

    • For each collation listed in the <charset> element for the character set in Index.xml, MYSET.xml must contain a <collation> element that defines the character ordering.

    For a complex character set, create a C source file that describes the character set properties and defines the support routines necessary to properly perform operations on the character set:

  3. Modify the configuration information. Use the existing configuration information as a guide to adding information for MYSYS. The example here assumes that the character set has default and binary collations, but more lines are needed if MYSET has additional collations.

    1. Edit mysys/charset-def.c, and register the collations for the new character set.

      Add these lines to the declaration section:

      extern CHARSET_INFO my_charset_MYSET_general_ci;
      extern CHARSET_INFO my_charset_MYSET_bin;

      Add these lines to the registration section:

    2. If the character set uses ctype-MYSET.c, edit strings/Makefile.am and add ctype-MYSET.c to each definition of the CSRCS variable, and to the EXTRA_DIST variable.

    3. If the character set uses ctype-MYSET.c, edit libmysql/Makefile.shared and add ctype-MYSET.lo to the mystringsobjects definition.

    4. Edit config/ac-macros/character_sets.m4:

      1. Add MYSET to one of the define(CHARSETS_AVAILABLE...) lines in alphabetic order.

      2. Add MYSET to CHARSETS_COMPLEX. This is needed even for simple character sets, or configure will not recognize --with-charset=MYSET.

      3. Add MYSET to the first case control structure. Omit the USE_MB and USE_MB_IDENT lines for 8-bit character sets.

          AC_DEFINE(HAVE_CHARSET_MYSET, 1, [Define to enable charset MYSET])
          AC_DEFINE([USE_MB], 1, [Use multi-byte character routines])
          AC_DEFINE(USE_MB_IDENT, 1)
      4. Add MYSET to the second case control structure:

          default_charset_collations="MYSET_general_ci MYSET_bin"
  4. Reconfigure, recompile, and test.