geniconvtbl-cconv - geniconvtbl cconv input file format
A cconv input file to geniconvtbl is an ASCII text file that contains a definition of a cconv code conversion either from UTF-32 to a codeset or from a codeset to UTF-32.
The geniconvtbl utility used with the –c option accepts the code conversion definition file and writes a code conversion binary table file that can be used in cconv(3C), iconv(1), and iconv(3C) to support user-defined code conversions. See cconv(3C), iconv(1), and iconv(3C) for more details on the code conversions and geniconvtbl(1) for more detail on the utility.
The following lexical conventions are used in the geniconvtbl cconv code conversion definition:
A hexadecimal number. The following four representations of hexadecimal numbers are accepted:
ISO C hexadecimal constant consisting of one or more hexadecimal digits prefixed with 0x or 0X. Examples: 0x0, 0x1, 0x1a, 0X1A, 0x1B3.
ISO C Universal Character Names (UCN) consisting of four hexadecimal digits prefixed with \u or eight hexadecimal digits prefixed with \U. Examples: \u0041, \U0001030c.
ISO C hexadecimal byte representation consisting of escape sequence \x followed by two hexadecimal digits. Examples: \x41, \xff\xfd. If it is of a file code that is based on bytes, this representation must be used (unless it is of a single byte codeset) since any other representations are not byte based and will yield different byte ordering depending on the natural byte order of the current system.
Unicode representation consisting of the prefix U+ followed by four to six hexadecimal digits. Examples: U+0020, U+8041, U+1030C.
If any of the representations other then the ISO C hexadecimal byte representation is used in the cconv input file, the utility assumes that the mapping table including such representation is for a fixed width, non-file code codeset and prepares the code conversion definition accordingly. For instance, if there are \x03\x01 and 0x00A1B2 used together as target codes in mapping definitions, the target codeset portion of the resulting mapping table in the cconv binary table file will be prepared as 32-bit fixed width entities since the number of hexadecimal digits in the 0x00A1B2 is bigger than four requiring a 32-bit data type entity. And, the \x03\x01 will be saved as 0x00000301 regardless of the underlying system's native byte ordering.
A decimal number, represented by one or more decimal digits. Examples: 0, 123, 2165.
A printable ASCII character.
While the supported maximum number will vary as noted in the sections below, the maximum number of digits lexically supported in HEXADECIMAL and DECIMAL is 128.
Comment starts with the character # or the character defined by the COMMENT_CHAR directive and ends at the end of the line.
The following keywords are recognized:
CHARSET_SHIFT_DESIGNATORS NIL COMBINING_SEQ REPLACEMENT_CHAR COMMENT_CHAR charset END initial IL locking_shift MAPPING_TABLE range NI single_shift
Additionally, the following symbols are also reserved as tokens:
{ } ( ) , ...
A cconv conversion definition consists of a set of character mapping entries optionally divided into mapping tables and, in case of stateful encoding, preceded by a table defining the charset shift designators. Optionally, a table defining input side combining/conjoining sequences may also follow as the last block.
A character mapping entry is a line with two entries separated by white space character optionally followed by comment:
0x20 0x0020 # SPACE
The above example defines the mapping for the SPACE character from an ASCII compatible codeset to UTF-32.
The left value of a character mapping entry is a HEXADECIMAL value of the character in the ASCII compatible codeset.
The right value of a character mapping entry can be a HEXADECIMAL value of the character in UTF-32, keyword NI with optional transliteration definition (see "Non-identicals and transliterations" below), keyword IL (see "Illegals"), variant definition (see "Variants") or combining/conjoining sequence definition (see "Combining/conjoining sequences").
For more details on the mapping tables, charset shift designators, and input side combining/conjoining sequences, refer to Multibyte codeset files, Stateful encoding codeset files, and Combining/conjoining sequences sections below.
Three types of input files are accepted - single byte codeset files, multibyte codeset files and stateful encoding codeset files. Depending on the type of the input file, different tables with supplementary information are allowed.
Single byte codeset filesThe input source file for cconv code conversions between a single byte codeset and UTF-32 consists of up to 256 character mapping entries for bytes 0x00 - 0xFF, such as in the following example:
# ISO 8859-1 to UTF-32 definition: # 0x00 0x0000 # NULL 0x01 0x0001 # START OF HEADING : 0xFE 0x00FE # LATIN SMALL LETTER THORN 0xFF 0x00FF # LATIN SMALL LETTER Y WITH DIAERESIS
where each line has 1:1 mapping definition for a character from the ISO 8859-1 codeset to Unicode UTF-32 codeset.
Any code point value in the single byte coding space range that doesn't have an explicit mapping will be treated as an illegal character.
Multibyte codeset filesThe input source file for cconv code conversions between a multibyte codeset and UTF-32 consists of one or more blocks of sub-codesets, delimited by MAPPING_TABLE and END MAPPING_TABLE keywords.
Each sub-codeset mapping table has a unique DECIMAL identifier and contains an optional range definition followed by character mapping entries:
# A multibyte codeset to UTF-32 definition # MAPPING_TABLE 0 range 0x00...0xff 0x00 0x0000 : 0xff 0xfffd END MAPPING_TABLE MAPPING_TABLE 1 range \x41\x41...\xdd\xd2 \x41\x41 U+3001 : \xDD\xD2 \uE72C END MAPPING_TABLE
The "range" definition specifies the valid coding space for a sub-codeset. With the "range" defined, any code point value in the range that doesn't have an explicit mapping will be treated as a non-identical code conversion.
If "range" isn't explicitly defined, geniconvtbl(1) will scan all the entries in the sub-codeset to find out the range-related values and then any code point value in the range that doesn't have an explicit mapping will be treated as an illegal character.
The maximum number of mapping tables and also the maximum number of code conversion character mapping entries that an input file can have is 4,294,967,296, i.e., UINT_MAX + 1.
Stateful encoding codeset filesThe input source file for cconv code conversions between a stateful encoding codeset and UTF-32 consists of a block defining a set of designators for charsets and shifts followed by one or more of blocks of sub-codeset mapping tables.
The block that defines the designators is delimited by CHARSET_SHIFT_DESIGNATORS and END CHARSET_SHIFT_DESIGNATORS keywords and contains optional "charset" designators and "locking_shift" and "single_shift" shift designators. In example:
CHARSET_SHIFT_DESIGNATORS # keyword sequence graphic mapping initial? # set id table id charset \x1b\x28\x42 0 0 initial charset \x1b\x24\x41 1 1 charset \x1b\x24\x40 1 2 charset \x1b\x24\x30 2 3 # # keyword sequence graphic initial? # set id locking_shift \x7e\x7b 0 initial locking_shift \x7e\x7d 1 single_shift \x1b\x0e 2 END CHARSET_SHIFT_DESIGNATORS MAPPING_TABLE 0 range 0x00...0x7f \x41 0x0041 : END MAPPING_TABLE MAPPING_TABLE 1 range 0x2121...0x7e72 \x21\x21 U+7E20 : END MAPPING_TABLE MAPPING_TABLE 2 \x21\x21 \uFE23 : END MAPPING_TABLE
Each of the "charset" designators defines a charset (i.e., a codeset) with a sequence of characters identifying the charset, graphic set id that identifies which graphic set it shall be assigned to, and the corresponding mapping table id. The optional last field "initial" indicates that the charset is the default initial charset that the code conversion must start with. When there is no "initial" keyword specified, the first "charset" designator defined charset from the top will be the default initial charset. Defining more than one "initial" charset is an error. When the initial charset doesn't have an identifying sequence, its sequence entry must indicate this with the keyword "NIL".
Each of the "locking_shift" and the "single_shift" lines define a locking shift designator and a single shift designator, respectively. They have a sequence of characters identifying the shift operation followed by a graphic set id that should be in effect if the shift sequence is found. The last field is an optional keyword "initial" that indicates that the designated shift operation should be the default initial shift state that the code conversion must start with. Defining more than one "initial" shift sequence is an error.
It is possible to have no "charset" designators but only shift designators. In this case, the graphic set id column should have mapping table ids.
The maximum number of mapping tables, the maximum number of designators (including both charset and shift designators), the maximum length of a designator sequence, and the maximum number of graphic sets in a code conversion for a stateful encoding is 256. The maximum number of code conversion character mapping entries that an input file can have is 4,294,967,296, i.e., UINT_MAX + 1.
See NOTES section for more details on the stateful encoding codeset and used terminologies.
To explicitly mark a character in a character mapping as a non-identical (i.e. a character with no matching counterpart in the other codeset), the NI keyword is used.
0xa1 NI # 0xa1
To define a transliteration for a non-identical character, the syntax is as follows:
0xc1 NI(\x41\x27) # <A with acute> maps to "A'"
For a code conversion from a multibyte or a stateful encoding codeset to UTF-32, when the "range" for a MAPPING_TABLE is explicitly defined, any code point values belonging to the defined range and having no explicit mapping will be treated as implicit non-identicals.
The combined maximum number of transliterations and conjoining/combining sequences supported in a code conversion is 1,048,576. The maximum length of all transliterations and conjoining/combining sequences combined is 4,294,967,296 bytes, i.e., UINT_MAX + 1.
It is possible to explicitly define the replacement character for non-identical conversions by using the "REPLACEMENT_CHAR" keyword, e.g.:
REPLACEMENT_CHAR \xa1\xa1
When REPLACEMENT_CHAR is specified, geniconvtbl will store the value in the header of the output binary table file instead of the default values.
Without the REPLACEMENT_CHAR definition, by default, the replacement character will be set to '?' (0x3f in ASCII-compatible codesets) if the target codeset is a non-Unicode codeset, or to Unicode replacement character (U+FFFD) if the target codeset is an Unicode codeset.
For a code conversion to explicitly specify a code point value as an illegal code point value, the IL keyword can be used in the input as in the following examples:
0xbb IL U+D800 IL
For a code conversion from a single byte codeset to UTF-32, the byte values that don't have any explicit mapping definitions are considered as explicit illegals.
For a code conversion from UTF-32 to any codeset, UTF-32 code point values in the Surrogate area will be treated as illegal code point values. Any other code points without an explicit mapping definition will be treated as non-identicals by default.
For a code conversion from a multibyte or a stateful encoding codeset to UTF-32, when the "range" for a MAPPING_TABLE is not explicitly defined, the geniconvtbl(1) will scan all the entries in the sub-codeset to find out the range-related values and then any code point values in the range that do not have an explicit mapping will be treated as implicit illegals.
Within a character mapping entry it is possible to define multiple variants for the output character. In the geniconvtbl cconv input file, variants are separated by a comma.
0xa1 0xa1a1,0xb1b1,,0xfefe 0xa2 0xa1a2
In the above example, the first mapping line defines four variant levels and the input character 0xa1 has three variants defined: 0xa1a1 for the level 0, 0xb1b1 for the level 1, and 0xfefe for the level 3. There is no variant for the level 2 and in such cases, the value for the level 0 will be used for the both levels 0 and 2 as the match.
In the second mapping line of the example above there are no variants defined except just the level 0 and so 0xa1a2 will be matched for all four levels.
It is an error when there is no level 0 variant specified.
The maximum number of variant levels supported in a code conversion is 256. The maximum number of variants supported in a code conversion is 1,048,576.
There are two kinds of combining/conjoining sequences based on where do they appear, input buffer side or output buffer side.
Input buffer side combining/conjoining sequences are defined in the input source file using an optional block (and as the last block in the input source file) delimited by the COMBINING_SEQ and END COMBINING_SEQ keywords, as shown in the following example for single and multibyte codeset files:
COMBINING_SEQ {U+00CA,U+0304} 0x8862 {U+0041,U+0301,U+0302} {0xabcd,0xef01} {U+007E,U+000A} NIL END COMBINING_SEQ
The NIL keyword indicates that the corresponding input side combining sequence should be consumed but without yielding any output.
For stateful encoding codeset files, as shown below, a mapping table id must be appended:
COMBINING_SEQ {0x7e,0x7e} 0x7e 0 {0x7e,0x0a} 0x0a 2 END COMBINING_SEQ
The mapping table id, 0, in the above example means that if and only if there is a sequence of 0x7e followed by another 0x7e in the input buffer and the current mapping table id is 0, then, the mapping to 0x7e will happen and with any other mapping table ids, the mapping will not happen.
Output buffer side combining/conjoining sequences are defined in the input source file using like the following:
0xc0 {\x00\x00\x00\x41,\x00\x00\x03\x00} # A with grave 0xc1 {U+0041,U+0301} # A with acute
The combined maximum number of transliterations and conjoining/combining sequences supported in a code conversion is 1,048,576. The maximum length of all transliterations and conjoining/combining sequences added is 4,294,967,296 bytes, i.e., UINT_MAX + 1.
The syntax of the geniconvtbl cconv input source file for code conversion definition in extended BNF is illustrated below:
cconv_input_file : codeset | char_definitions codeset ; char_definitions : comment_char_definition | replacement_char_definition | comment_char_definition replacement_char_definition ; comment_char_definition : 'COMMENT_CHAR' CHARACTER ; replacement_char_definition : 'REPLACEMENT_CHAR' HEXADECIMAL ; codeset : single_byte_codeset | multi_byte_codeset | stateful_encoding_codeset ; single_byte_codeset : map_list | map_list combining_seq ; multi_byte_codeset : mapping_table_list | mapping_table_list combining_seq ; stateful_encoding_codeset : charset_shift_designators mapping_table_list | charset_shift_designators mapping_table_list stateful_combining_seq ; map_list : map_pair | map_list map_pair ; mapping_table_list : mapping_table | mapping_table_list mapping_table ; mapping_table : 'MAPPING_TABLE' mapping_table_id map 'END' 'MAPPING_TABLE' ; map : map_list | 'range' range_pair map_list ; range_pair : HEXADECIMAL '...' HEXADECIMAL ; map_pair : HEXADECIMAL rvalue | HEXADECIMAL variant_list ; rvalue : HEXADECIMAL | 'IL' | 'NI' | transliteration | combining_conjoining ; variant_list : rvalue ',' rvalue_omit_list ; rvalue_omit_list : rvalue ',' rvalue_omit_list | ',' rvalue_omit_list | rvalue | ; hexadecimal_list : HEXADECIMAL | hexadecimal_list ',' HEXADECIMAL ; transliteration : 'NI' '(' hexadecimal_list ')' ; combining_conjoining : '{' hexadecimal_list '}' ; charset_shift_designators : 'CHARSET_SHIFT_DESIGNATORS' designator_list 'END' 'CHARSET_SHIFT_DESIGNATORS' ; designator_list : designator | designator_list designator ; designator : charset | locking_shift | single_shift ; charset : 'charset' charset_id graphic_set_id mapping_table_id 'initial' | 'charset' NIL graphic_set_id mapping_table_id 'initial' | 'charset' charset_id graphic_set_id mapping_table_id ; locking_shift : 'locking_shift' charset_id graphic_set_id 'initial' | 'locking_shift' charset_id graphic_set_id ; single_shift : 'single_shift' charset_id graphic_set_id ; mapping_table_id : DECIMAL ; charset_id : HEXADECIMAL ; graphic_set_id : DECIMAL ; combining_seq : 'COMBINING_SEQ' combining_seq_input_list 'END' 'COMBINING_SEQ' ; combining_seq_input_list : combining_seq_input | combining_seq_input_list combining_seq_input ; combining_seq_input : '{' hexadecimal_list '}' HEXADECIMAL | '{' hexadecimal_list '}' '{' hexadecimal_list '}' | '{' hexadecimal_list '}' 'NIL' ; stateful_combining_seq : 'COMBINING_SEQ' stateful_combining_seq_input_list 'END' 'COMBINING_SEQ' ; stateful_combining_seq_input_list : stateful_combining_seq_input | stateful_combining_seq_input_list stateful_combining_seq_input ; stateful_combining_seq_input : combining_seq_input mapping_table_id ;
the utility geniconvtbl
conversion binary tables
geniconvtbl(1), cconv(3C), cconv_close(3C), cconv_open(3C), cconvctl(3C), iconv(3C), iconv_close(3C), iconv_open(3C), iconvctl(3C), attributes(7), environ(7)
Stateful encoding codeset is a codeset that has states and thus depending on what is the current state, the same code point values could mean different characters. The state is usually changed by charset designators, single shift designators, locking shift designators, or any combinations of them. A charset designator designates a charset into one of the graphic sets the codeset has. A single shift designator indicates that the consecutively following character and that only belongs to a specific graphic set. A locking shift designator indicates that until there is any other shift designator, the characters following are all belonging to a specific graphic set.
The most common forms of stateful encoding codeset are usually based on ISO/IEC 2022 codeset extension mechanism such as ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR. There are also stateful encoding codesets such as HZ-GB-2312 and N-byte Hangul that are not exactly based on the ISO/IEC 2022 but other mechanisms.