iconv_unicode - man pages section 7: Standards, Environments, Macros, Character Sets, and Miscellany

Language:

iconv_unicode(7)

Name

iconv_unicode - codeset conversion for Unicode

Description

The table below lists the names and descriptions of the supported Unicode encodings or encoding schemes (byte serializations of Unicode encoding forms) that can be used as fromcode or tocode parameters to iconv(1), iconv_open(3C), and cconv_open(3C). There are also aliases such as FSS-UTF, UTF8, and so on.

Available iconv and cconv conversions in the current system including aliases and optional variant levels can be obtained by running the iconv -l command as described in the iconv(1) manual page.

For additional information on the mappings between canonical names and supported aliases with optional variant levels, refer to the alias(5) manual page and also the /usr/lib/iconv/alias file.

Encoding Form	Description
UTF-8	Multibyte sequences of 1-4 character bytes
UTF-16	Represented in 16-bit entity for `U+0000-U+D7FF` and `U+E000-U+FFFF`, and two 16-bit entities for `U+10000-U+10FFFF`. Is in the platforms default byte ordering and includes the Byte Order Mark (BOM). See below for a description on the BOM.
UTF-16-INTERNAL	UTF-16, without BOM
UTF-16BE	UTF-16 in the big-endian byte ordering, without BOM
UTF-16-BIG-ENDIAN	UTF-16 in the big-endian byte ordering, including BOM
UTF-16LE	UTF-16 in the little-endian byte ordering, without BOM
UTF-16-LITTLE-ENDIAN	UTF-16 in the little-endian byte ordering, including BOM
UTF-16-SWAPPED	UTF-16 with endianness opposite to that of the local platform, without BOM
UTF-32	Represented in 32-bit entity in platforms default byte ordering and includes the BOM
UTF-32-INTERNAL	UTF-32, without BOM
UTF-32BE	UTF-32 in the big-endian byte ordering, without BOM
UTF-32-BIG-ENDIAN	UTF-32 in the big-endian byte ordering, including BOM
UTF-32-SWAPPED	UTF-32 with endianness opposite to that of the local platform, without BOM
UTF-32LE	UTF-32 in the little-endian byte ordering, without BOM
UTF-32-LITTLE-ENDIAN	UTF-32 in the little-endian byte ordering, including BOM
UCS-2	Represented in 16-bit entity for `U+0000-U+D7FF` and `U+E000-U+FFFF` in the platforms default byte ordering and includes byte order mark (BOM)
UCS-2-INTERNAL	UCS-2, without BOM
UCS-2BE	UCS-2 in the big-endian byte ordering, without BOM
UCS-2-BIG-ENDIAN	UCS-2 in the big-endian byte ordering, including BOM
UCS-2LE	UCS-2 in the little-endian byte ordering, without BOM
UCS-2-LITTLE-ENDIAN	UCS-2 in the little-endian byte ordering, including BOM
UCS-2-SWAPPED	UCS-2 with endianness opposite to that of the local platform, without BOM
UCS-4	Represented in 32-bit entity in the platforms default byte ordering and includes byte order mark (BOM)
UCS-4-INTERNAL	UCS-4, without BOM
UCS-4BE	UCS-4 in the big-endian byte ordering, without BOM
UCS-4-BIG-ENDIAN	UCS-4 in the big-endian byte ordering, including BOM
UCS-4LE	UCS-4 in the little-endian byte ordering, without BOM
UCS-4-LITTLE-ENDIAN	UCS-4 in the little-endian byte ordering, including BOM
UCS-4-SWAPPED	UCS-4 with endianness opposite to that of the local platform, without BOM

UCS, or Universal Character Set, refers to the ISO/IEC 10646 family of standards with character set identical to that of Unicode.

Byte Order Mark, also known as BOM (U+FEFF), is a special character in the beginning of a file or character stream, denoting the byte order of the subsequent characters. UCS-2, UTF-16, UTF-32, and UCS-4 files and character streams usually start with a BOM character to indicate the byte ordering used in the file or character stream.

UTF-8 to UTF-8 conversion simply moves bytes from input buffer to output buffer without doing any conversion. During the moves, illegal character checking is done to screen out any potentially harmful character bytes. Such illegal characters will cause the conversion to fail.

UTF-7, a legacy 7-bit Unicode Transformation Format, is only supported by iconv conversions to and from UTF-8, UCS-2 and UCS-4.

UTF-EBCDIC, a legacy EBCDIC-compatible variant of UTF-8, is only supported by iconv conversions to and from UTF-8.

Notes

iconv also supports conversion between Unicode encodings and many different codesets. The list of such codesets includes for example the ISO 8859 character sets, EBCDIC code pages, EUC (Extended UNIX Code) and ISO 2022 encodings for Chinese, Japanese, Korean, and many others (see iconv_extra(7), iconv_ja(7), iconv_ko(7), iconv_zh(7), iconv_zh_HK(7), and iconv_zh_TW(7)).

If a source character code value cannot be mapped to a valid character in target codeset, it will be considered as an illegal or a non-identical character. In the absence of explicit information about the source character code value, iconv code conversions uses the following rules in determining whether a character is illegal or non-identical:

If the source character code value is not within a range defined by the source codeset standard, it is considered as an illegal character. If the source character code value is within the range(s) defined by the standard, it will be considered as non-identical, even if the source character code value maps to an undefined or a reserved location within the valid range. The non-identical character will map to either ? (0x3f in ASCII-compatible codesets) if the target codeset is a non-Unicode codeset or to Unicode replacement character (U+FFFD) if the target codeset is an Unicode codeset.

When the BOM is present as the first character in the encoding that supports it, it will direct the way the following Unicode character sequences are interpreted. If the BOM is not the first character for such encodings or for Unicode encodings that do not support the BOM, the BOM character (U+FEFF) will be interpreted as Zero Width No-Break Space (ZWNBSP) character and will not affect the way the Unicode characters are interpreted in terms of byte ordering.

When the target codeset is one of UCS-2, UTF-16, UTF-32, UCS-4, UCS-2-BIG-ENDIAN, UCS-2-LITTLE-ENDIAN, UTF-16-BIG-ENDIAN, UTF-16-LITTLE-ENDIAN, UCS-4-BIG-ENDIAN, UCS-4-LITTLE-ENDIAN, UTF-32-BIG-ENDIAN, and UTF-32-LITTLE-ENDIAN, expect a BOM character in the beginning of the iconv code conversion output buffer.

When the source codeset is UCS-2, UTF-16, UTF-32, or UCS-4 and there is no BOM presented as the first input character, the byte ordering of the current system is assumed on the input byte stream given to the iconv code conversion.

Examples

Example 1 The iconv Library Module Filename

In the conversion library, /usr/lib/iconv (see iconv(3C)), the library module filename is composed of two symbolic elements separated by the percent sign (%). The first symbol specifies the source codeset, i.e. the codeset that is being converted; the second symbol specifies the target codeset, i.e. the codeset to which the first one is being converted.

For example, the library module filename to convert from the legacy UTF-7 codeset to the UTF-8 codeset is UTF-7%UTF-8.so.

Example 2 The cconv Library Module Filename

For some conversions, iconv(3C) makes a call to the cconv(3C) interfaces to perform the conversion. The cconv conversion modules are binary tables with .bt suffix generated by geniconvtbl(1) and placed in the same /usr/lib/iconv library. The cconv library module filename is composed of the symbolic elements for source and target codeset separated by the plus sign (+). The cconv conversion is typically performed in two steps, with UTF-32 as the intermediate encoding.

For example, the cconv library module filename to convert from the Japanese EUC codeset to the UTF-32 codeset is eucJP+UTF-32.bt.

Files

/usr/lib/iconv/*.so: iconv conversion modules
/usr/lib/iconv/*.bt: cconv code conversion binary tables for iconv(1), cconv(3C), and iconv(3C)
/usr/lib/iconv/geniconvtbl/binarytables/*.bt: geniconvtbl conversion binary tables
/usr/lib/iconv/alias: Alias table file of codeset names

man pages section 7: Standards, Environments, Macros, Character Sets, and Miscellany