iconv_unicode - codeset conversion for Unicode
The table below lists the names and descriptions of the supported Unicode encodings or encoding schemes (byte serializations of Unicode encoding forms) that can be used as fromcode or tocode parameters to iconv(1), iconv_open(3C), and cconv_open(3C). There are also aliases such as FSS-UTF, UTF8, and so on.
Available iconv and cconv conversions in the current system including aliases and optional variant levels can be obtained by running the iconv -l command as described in the iconv(1) manual page.
For additional information on the mappings between canonical names and supported aliases with optional variant levels, refer to the alias(5) manual page and also the /usr/lib/iconv/alias file.
|
UCS, or Universal Character Set, refers to the ISO/IEC 10646 family of standards with character set identical to that of Unicode.
Byte Order Mark, also known as BOM (U+FEFF), is a special character in the beginning of a file or character stream, denoting the byte order of the subsequent characters. UCS-2, UTF-16, UTF-32, and UCS-4 files and character streams usually start with a BOM character to indicate the byte ordering used in the file or character stream.
UTF-8 to UTF-8 conversion simply moves bytes from input buffer to output buffer without doing any conversion. During the moves, illegal character checking is done to screen out any potentially harmful character bytes. Such illegal characters will cause the conversion to fail.
UTF-7, a legacy 7-bit Unicode Transformation Format, is only supported by iconv conversions to and from UTF-8, UCS-2 and UCS-4.
UTF-EBCDIC, a legacy EBCDIC-compatible variant of UTF-8, is only supported by iconv conversions to and from UTF-8.
iconv also supports conversion between Unicode encodings and many different codesets. The list of such codesets includes for example the ISO 8859 character sets, EBCDIC code pages, EUC (Extended UNIX Code) and ISO 2022 encodings for Chinese, Japanese, Korean, and many others (see iconv_extra(7), iconv_ja(7), iconv_ko(7), iconv_zh(7), iconv_zh_HK(7), and iconv_zh_TW(7)).
If a source character code value cannot be mapped to a valid character in target codeset, it will be considered as an illegal or a non-identical character. In the absence of explicit information about the source character code value, iconv code conversions uses the following rules in determining whether a character is illegal or non-identical:
If the source character code value is not within a range defined by the source codeset standard, it is considered as an illegal character. If the source character code value is within the range(s) defined by the standard, it will be considered as non-identical, even if the source character code value maps to an undefined or a reserved location within the valid range. The non-identical character will map to either ? (0x3f in ASCII-compatible codesets) if the target codeset is a non-Unicode codeset or to Unicode replacement character (U+FFFD) if the target codeset is an Unicode codeset.
When the BOM is present as the first character in the encoding that supports it, it will direct the way the following Unicode character sequences are interpreted. If the BOM is not the first character for such encodings or for Unicode encodings that do not support the BOM, the BOM character (U+FEFF) will be interpreted as Zero Width No-Break Space (ZWNBSP) character and will not affect the way the Unicode characters are interpreted in terms of byte ordering.
When the target codeset is one of UCS-2, UTF-16, UTF-32, UCS-4, UCS-2-BIG-ENDIAN, UCS-2-LITTLE-ENDIAN, UTF-16-BIG-ENDIAN, UTF-16-LITTLE-ENDIAN, UCS-4-BIG-ENDIAN, UCS-4-LITTLE-ENDIAN, UTF-32-BIG-ENDIAN, and UTF-32-LITTLE-ENDIAN, expect a BOM character in the beginning of the iconv code conversion output buffer.
When the source codeset is UCS-2, UTF-16, UTF-32, or UCS-4 and there is no BOM presented as the first input character, the byte ordering of the current system is assumed on the input byte stream given to the iconv code conversion.
In the conversion library, /usr/lib/iconv (see iconv(3C)), the library module filename is composed of two symbolic elements separated by the percent sign (%). The first symbol specifies the source codeset, i.e. the codeset that is being converted; the second symbol specifies the target codeset, i.e. the codeset to which the first one is being converted.
For example, the library module filename to convert from the legacy UTF-7 codeset to the UTF-8 codeset is UTF-7%UTF-8.so.
Example 2 The cconv Library Module FilenameFor some conversions, iconv(3C) makes a call to the cconv(3C) interfaces to perform the conversion. The cconv conversion modules are binary tables with .bt suffix generated by geniconvtbl(1) and placed in the same /usr/lib/iconv library. The cconv library module filename is composed of the symbolic elements for source and target codeset separated by the plus sign (+). The cconv conversion is typically performed in two steps, with UTF-32 as the intermediate encoding.
For example, the cconv library module filename to convert from the Japanese EUC codeset to the UTF-32 codeset is eucJP+UTF-32.bt.
iconv conversion modules
cconv code conversion binary tables for iconv(1), cconv(3C), and iconv(3C)
geniconvtbl conversion binary tables
Alias table file of codeset names
geniconvtbl(1), iconv(1), cconv(3C), cconv_close(3C), cconv_open(3C), cconvctl(3C), iconv(3C), iconv_close(3C), iconv_open(3C), iconvctl(3C), alias(5), geniconvtbl-cconv(5), iconv_extra(7), iconv_ja(7), iconv_ko(7), iconv_zh(7), iconv_zh_HK(7), iconv_zh_TW(7)
The Unicode Consortium. The Unicode Standard, Version 6.2.0, (Mountain View, CA: The Unicode Consortium, 2012. ISBN 978-1-936213-07-8)
Yergeau, F., UTF-8, a transformation format of Unicode and ISO 10646, RFC 2044, Alis Technologies, October 1996.
Ohta, M., Character Sets ISO-10646 and ISO-10646-J-1, RFC 1815, Tokyo Institute of Technology, July 1995.
Simonson, K., Character Mnemonics & Character Sets, RFC 1345, Rationel Almen Planlaegning, June 1992.
Goldsmith, D., and M. Davis, UTF-7 - A Mail-Safe Transformation Format of Unicode, RFC 1642, Taligent, Inc., July 1994.