geniconvtbl-cconv - man pages section 5: File Formats

Language:

geniconvtbl-cconv (5)

Name

geniconvtbl-cconv - geniconvtbl cconv input file format

Description

A cconv input file to geniconvtbl is an ASCII text file that contains a definition of a cconv code conversion either from UTF-32 to a codeset or from a codeset to UTF-32.

The geniconvtbl utility used with the –c option accepts the code conversion definition file and writes a code conversion binary table file that can be used in cconv(3C), iconv(1), and iconv(3C) to support user-defined code conversions. See cconv(3C), iconv(1), and iconv(3C) for more details on the code conversions and geniconvtbl(1) for more detail on the utility.

The Lexical Conventions

The following lexical conventions are used in the geniconvtbl cconv code conversion definition:

HEXADECIMAL

A hexadecimal number. The following four representations of hexadecimal numbers are accepted:

ISO C hexadecimal constant consisting of one or more hexadecimal digits prefixed with 0x or 0X. Examples: 0x0, 0x1, 0x1a, 0X1A, 0x1B3.

ISO C Universal Character Names (UCN) consisting of four hexadecimal digits prefixed with \u or eight hexadecimal digits prefixed with \U. Examples: \u0041, \U0001030c.

ISO C hexadecimal byte representation consisting of escape sequence \x followed by two hexadecimal digits. Examples: \x41, \xff\xfd. If it is of a file code that is based on bytes, this representation must be used (unless it is of a single byte codeset) since any other representations are not byte based and will yield different byte ordering depending on the natural byte order of the current system.

Unicode representation consisting of the prefix U+ followed by four to six hexadecimal digits. Examples: U+0020, U+8041, U+1030C.

If any of the representations other then the ISO C hexadecimal byte representation is used in the cconv input file, the utility assumes that the mapping table including such representation is for a fixed width, non-file code codeset and prepares the code conversion definition accordingly. For instance, if there are \x03\x01 and 0x00A1B2 used together as target codes in mapping definitions, the target codeset portion of the resulting mapping table in the cconv binary table file will be prepared as 32-bit fixed width entities since the number of hexadecimal digits in the 0x00A1B2 is bigger than four requiring a 32-bit data type entity. And, the \x03\x01 will be saved as 0x00000301 regardless of the underlying system's native byte ordering.

DECIMAL

A decimal number, represented by one or more decimal digits. Examples: 0, 123, 2165.

CHARACTER

A printable ASCII character.

While the supported maximum number will vary as noted in the sections below, the maximum number of digits lexically supported in HEXADECIMAL and DECIMAL is 128.

Comment starts with the character # or the character defined by the COMMENT_CHAR directive and ends at the end of the line.

The following keywords are recognized:

CHARSET_SHIFT_DESIGNATORS       NIL
COMBINING_SEQ                   REPLACEMENT_CHAR
COMMENT_CHAR                    charset
END                             initial
IL                              locking_shift
MAPPING_TABLE                   range
NI                              single_shift

Additionally, the following symbols are also reserved as tokens:

{  }  (  )  , ...

Basic Syntax

A cconv conversion definition consists of a set of character mapping entries optionally divided into mapping tables and, in case of stateful encoding, preceded by a table defining the charset shift designators. Optionally, a table defining input side combining/conjoining sequences may also follow as the last block.

A character mapping entry is a line with two entries separated by white space character optionally followed by comment:

0x20      0x0020         # SPACE

The above example defines the mapping for the SPACE character from an ASCII compatible codeset to UTF-32.

The left value of a character mapping entry is a HEXADECIMAL value of the character in the ASCII compatible codeset.

The right value of a character mapping entry can be a HEXADECIMAL value of the character in UTF-32, keyword NI with optional transliteration definition (see "Non-identicals and transliterations" below), keyword IL (see "Illegals"), variant definition (see "Variants") or combining/conjoining sequence definition (see "Combining/conjoining sequences").

For more details on the mapping tables, charset shift designators, and input side combining/conjoining sequences, refer to Multibyte codeset files, Stateful encoding codeset files, and Combining/conjoining sequences sections below.

Input files

Three types of input files are accepted - single byte codeset files, multibyte codeset files and stateful encoding codeset files. Depending on the type of the input file, different tables with supplementary information are allowed.

Single byte codeset files

The input source file for cconv code conversions between a single byte codeset and UTF-32 consists of up to 256 character mapping entries for bytes 0x00 - 0xFF, such as in the following example:

# ISO 8859-1 to UTF-32 definition:
#
0x00      0x0000    # NULL
0x01      0x0001    # START OF HEADING
:
0xFE      0x00FE    # LATIN SMALL LETTER THORN
0xFF      0x00FF    # LATIN SMALL LETTER Y WITH DIAERESIS

where each line has 1:1 mapping definition for a character from the ISO 8859-1 codeset to Unicode UTF-32 codeset.

Any code point value in the single byte coding space range that doesn't have an explicit mapping will be treated as an illegal character.

Multibyte codeset files

The input source file for cconv code conversions between a multibyte codeset and UTF-32 consists of one or more blocks of sub-codesets, delimited by MAPPING_TABLE and END MAPPING_TABLE keywords.

Each sub-codeset mapping table has a unique DECIMAL identifier and contains an optional range definition followed by character mapping entries:

# A multibyte codeset to UTF-32 definition
#
MAPPING_TABLE 0
range     0x00...0xff
0x00      0x0000
:
0xff      0xfffd
END MAPPING_TABLE

MAPPING_TABLE 1
range     \x41\x41...\xdd\xd2
\x41\x41  U+3001
:
\xDD\xD2  \uE72C
END MAPPING_TABLE

The "range" definition specifies the valid coding space for a sub-codeset. With the "range" defined, any code point value in the range that doesn't have an explicit mapping will be treated as a non-identical code conversion.

If "range" isn't explicitly defined, geniconvtbl(1) will scan all the entries in the sub-codeset to find out the range-related values and then any code point value in the range that doesn't have an explicit mapping will be treated as an illegal character.

The maximum number of mapping tables and also the maximum number of code conversion character mapping entries that an input file can have is 4,294,967,296, i.e., UINT_MAX + 1.

Stateful encoding codeset files

The input source file for cconv code conversions between a stateful encoding codeset and UTF-32 consists of a block defining a set of designators for charsets and shifts followed by one or more of blocks of sub-codeset mapping tables.

The block that defines the designators is delimited by CHARSET_SHIFT_DESIGNATORS and END CHARSET_SHIFT_DESIGNATORS keywords and contains optional "charset" designators and "locking_shift" and "single_shift" shift designators. In example:

          CHARSET_SHIFT_DESIGNATORS
          # keyword  sequence      graphic   mapping   initial?
          #                        set id    table id
          charset    \x1b\x28\x42  0         0         initial
          charset    \x1b\x24\x41  1         1
          charset    \x1b\x24\x40  1         2
          charset    \x1b\x24\x30  2         3
          #
          # keyword      sequence  graphic  initial?
          #                        set id
          locking_shift  \x7e\x7b  0        initial
          locking_shift  \x7e\x7d  1
          single_shift   \x1b\x0e  2
          END CHARSET_SHIFT_DESIGNATORS

          MAPPING_TABLE 0
          range    0x00...0x7f
          \x41           0x0041
          :
          END MAPPING_TABLE

          MAPPING_TABLE 1
          range    0x2121...0x7e72
          \x21\x21       U+7E20
          :
          END MAPPING_TABLE

          MAPPING_TABLE 2
          \x21\x21       \uFE23
          :
          END MAPPING_TABLE

Each of the "charset" designators defines a charset (i.e., a codeset) with a sequence of characters identifying the charset, graphic set id that identifies which graphic set it shall be assigned to, and the corresponding mapping table id. The optional last field "initial" indicates that the charset is the default initial charset that the code conversion must start with. When there is no "initial" keyword specified, the first "charset" designator defined charset from the top will be the default initial charset. Defining more than one "initial" charset is an error. When the initial charset doesn't have an identifying sequence, its sequence entry must indicate this with the keyword "NIL".

Each of the "locking_shift" and the "single_shift" lines define a locking shift designator and a single shift designator, respectively. They have a sequence of characters identifying the shift operation followed by a graphic set id that should be in effect if the shift sequence is found. The last field is an optional keyword "initial" that indicates that the designated shift operation should be the default initial shift state that the code conversion must start with. Defining more than one "initial" shift sequence is an error.

It is possible to have no "charset" designators but only shift designators. In this case, the graphic set id column should have mapping table ids.

The maximum number of mapping tables, the maximum number of designators (including both charset and shift designators), the maximum length of a designator sequence, and the maximum number of graphic sets in a code conversion for a stateful encoding is 256. The maximum number of code conversion character mapping entries that an input file can have is 4,294,967,296, i.e., UINT_MAX + 1.

See NOTES section for more details on the stateful encoding codeset and used terminologies.

Non-identicals and transliterations

To explicitly mark a character in a character mapping as a non-identical (i.e. a character with no matching counterpart in the other codeset), the NI keyword is used.

0xa1  NI            # 0xa1

To define a transliteration for a non-identical character, the syntax is as follows:

0xc1  NI(\x41\x27)  # <A with acute> maps to "A'"

For a code conversion from a multibyte or a stateful encoding codeset to UTF-32, when the "range" for a MAPPING_TABLE is explicitly defined, any code point values belonging to the defined range and having no explicit mapping will be treated as implicit non-identicals.

The combined maximum number of transliterations and conjoining/combining sequences supported in a code conversion is 1,048,576. The maximum length of all transliterations and conjoining/combining sequences combined is 4,294,967,296 bytes, i.e., UINT_MAX + 1.

Replacement character

It is possible to explicitly define the replacement character for non-identical conversions by using the "REPLACEMENT_CHAR" keyword, e.g.:

REPLACEMENT_CHAR    \xa1\xa1

When REPLACEMENT_CHAR is specified, geniconvtbl will store the value in the header of the output binary table file instead of the default values.

Without the REPLACEMENT_CHAR definition, by default, the replacement character will be set to '?' (0x3f in ASCII-compatible codesets) if the target codeset is a non-Unicode codeset, or to Unicode replacement character (U+FFFD) if the target codeset is an Unicode codeset.

Illegal code points

For a code conversion to explicitly specify a code point value as an illegal code point value, the IL keyword can be used in the input as in the following examples:

0xbb      IL
U+D800    IL

For a code conversion from a single byte codeset to UTF-32, the byte values that don't have any explicit mapping definitions are considered as explicit illegals.

For a code conversion from UTF-32 to any codeset, UTF-32 code point values in the Surrogate area will be treated as illegal code point values. Any other code points without an explicit mapping definition will be treated as non-identicals by default.

For a code conversion from a multibyte or a stateful encoding codeset to UTF-32, when the "range" for a MAPPING_TABLE is not explicitly defined, the geniconvtbl(1) will scan all the entries in the sub-codeset to find out the range-related values and then any code point values in the range that do not have an explicit mapping will be treated as implicit illegals.

Variants

Within a character mapping entry it is possible to define multiple variants for the output character. In the geniconvtbl cconv input file, variants are separated by a comma.

0xa1      0xa1a1,0xb1b1,,0xfefe
0xa2      0xa1a2

In the above example, the first mapping line defines four variant levels and the input character 0xa1 has three variants defined: 0xa1a1 for the level 0, 0xb1b1 for the level 1, and 0xfefe for the level 3. There is no variant for the level 2 and in such cases, the value for the level 0 will be used for the both levels 0 and 2 as the match.

In the second mapping line of the example above there are no variants defined except just the level 0 and so 0xa1a2 will be matched for all four levels.

It is an error when there is no level 0 variant specified.

The maximum number of variant levels supported in a code conversion is 256. The maximum number of variants supported in a code conversion is 1,048,576.

Combining/conjoining sequences

There are two kinds of combining/conjoining sequences based on where do they appear, input buffer side or output buffer side.

Input buffer side combining/conjoining sequences are defined in the input source file using an optional block (and as the last block in the input source file) delimited by the COMBINING_SEQ and END COMBINING_SEQ keywords, as shown in the following example for single and multibyte codeset files:

COMBINING_SEQ
{U+00CA,U+0304}         0x8862
{U+0041,U+0301,U+0302}  {0xabcd,0xef01}
{U+007E,U+000A}         NIL
END COMBINING_SEQ

The NIL keyword indicates that the corresponding input side combining sequence should be consumed but without yielding any output.

For stateful encoding codeset files, as shown below, a mapping table id must be appended:

COMBINING_SEQ
{0x7e,0x7e}              0x7e        0
{0x7e,0x0a}              0x0a        2
END COMBINING_SEQ

The mapping table id, 0, in the above example means that if and only if there is a sequence of 0x7e followed by another 0x7e in the input buffer and the current mapping table id is 0, then, the mapping to 0x7e will happen and with any other mapping table ids, the mapping will not happen.

Output buffer side combining/conjoining sequences are defined in the input source file using like the following:

0xc0  {\x00\x00\x00\x41,\x00\x00\x03\x00} # A with grave
0xc1  {U+0041,U+0301}                     # A with acute

Syntax definition

The syntax of the geniconvtbl cconv input source file for code conversion definition in extended BNF is illustrated below:

      cconv_input_file
          : codeset
          | char_definitions codeset
          ;

     char_definitions
          : comment_char_definition
          | replacement_char_definition
          | comment_char_definition replacement_char_definition
          ;

     comment_char_definition
          : 'COMMENT_CHAR' CHARACTER
          ;

     replacement_char_definition
          : 'REPLACEMENT_CHAR' HEXADECIMAL
          ;

     codeset
          : single_byte_codeset
          | multi_byte_codeset
          | stateful_encoding_codeset
          ;

     single_byte_codeset
          : map_list
          | map_list combining_seq
          ;

     multi_byte_codeset
          : mapping_table_list
          | mapping_table_list combining_seq
          ;

     stateful_encoding_codeset
          : charset_shift_designators mapping_table_list
          | charset_shift_designators mapping_table_list stateful_combining_seq
          ;

     map_list
          : map_pair
          | map_list map_pair
          ;

     mapping_table_list
          : mapping_table
          | mapping_table_list mapping_table
          ;

     mapping_table
          : 'MAPPING_TABLE' mapping_table_id map 'END' 'MAPPING_TABLE'
          ;

     map
          : map_list
          | 'range' range_pair map_list
          ;

     range_pair
          : HEXADECIMAL '...' HEXADECIMAL
          ;

     map_pair
          : HEXADECIMAL rvalue
          | HEXADECIMAL variant_list
          ;

     rvalue
          : HEXADECIMAL
          | 'IL'
          | 'NI'
          | transliteration
          | combining_conjoining
          ;

     variant_list
          : rvalue ',' rvalue_omit_list
          ;

     rvalue_omit_list
          : rvalue ',' rvalue_omit_list
          | ',' rvalue_omit_list
          | rvalue
          |
          ;

     hexadecimal_list
          : HEXADECIMAL
          | hexadecimal_list ',' HEXADECIMAL
          ;

     transliteration
          : 'NI' '(' hexadecimal_list ')'
          ;

     combining_conjoining
          : '{' hexadecimal_list '}'
          ;

     charset_shift_designators
          : 'CHARSET_SHIFT_DESIGNATORS' designator_list 'END' 'CHARSET_SHIFT_DESIGNATORS'
          ;

     designator_list
          : designator
          | designator_list designator
          ;

     designator
          : charset
          | locking_shift
          | single_shift
          ;

     charset
          : 'charset' charset_id graphic_set_id mapping_table_id 'initial'
          | 'charset' NIL graphic_set_id mapping_table_id 'initial'
          | 'charset' charset_id graphic_set_id mapping_table_id
          ;

     locking_shift
          : 'locking_shift' charset_id graphic_set_id 'initial'
          | 'locking_shift' charset_id graphic_set_id
          ;

     single_shift
          : 'single_shift' charset_id graphic_set_id
          ;

     mapping_table_id
          : DECIMAL
          ;

     charset_id
          : HEXADECIMAL
          ;

     graphic_set_id
          : DECIMAL
          ;

     combining_seq
          : 'COMBINING_SEQ' combining_seq_input_list 'END' 'COMBINING_SEQ'
          ;

     combining_seq_input_list
          : combining_seq_input
          | combining_seq_input_list combining_seq_input
          ;

     combining_seq_input
          : '{' hexadecimal_list '}' HEXADECIMAL
          | '{' hexadecimal_list '}' '{' hexadecimal_list '}'
          | '{' hexadecimal_list '}' 'NIL'
          ;

     stateful_combining_seq
          : 'COMBINING_SEQ' stateful_combining_seq_input_list 'END' 'COMBINING_SEQ'
          ;

     stateful_combining_seq_input_list
          : stateful_combining_seq_input
          | stateful_combining_seq_input_list stateful_combining_seq_input
          ;

     stateful_combining_seq_input
          : combining_seq_input mapping_table_id
          ;

Files

/usr/bin/geniconvtbl: the utility geniconvtbl
/usr/lib/iconv/*.bt: conversion binary tables

Notes

Stateful encoding codeset is a codeset that has states and thus depending on what is the current state, the same code point values could mean different characters. The state is usually changed by charset designators, single shift designators, locking shift designators, or any combinations of them. A charset designator designates a charset into one of the graphic sets the codeset has. A single shift designator indicates that the consecutively following character and that only belongs to a specific graphic set. A locking shift designator indicates that until there is any other shift designator, the characters following are all belonging to a specific graphic set.

The most common forms of stateful encoding codeset are usually based on ISO/IEC 2022 codeset extension mechanism such as ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR. There are also stateful encoding codesets such as HZ-GB-2312 and N-byte Hangul that are not exactly based on the ISO/IEC 2022 but other mechanisms.

man pages section 5: File Formats