Go to main content

man pages section 7: Standards, Environments, Macros, Character Sets, and Miscellany

Exit Print View

Updated: Wednesday, August 8, 2018
 
 

iconv_unicode(7)

Name

iconv_unicode - codeset conversion for Unicode

Description

The table below lists the names and descriptions of the supported Unicode encodings or encoding schemes (byte serializations of Unicode encoding forms) that can be used as fromcode or tocode parameters to iconv(1), iconv_open(3C), and cconv_open(3C). There are also aliases such as FSS-UTF, UTF8, and so on.

Available iconv and cconv conversions in the current system including aliases and optional variant levels can be obtained by running the iconv -l command as described in the iconv(1) manual page.

For additional information on the mappings between canonical names and supported aliases with optional variant levels, refer to the alias(5) manual page and also the /usr/lib/iconv/alias file.

Unicode Encoding Form
Description
UTF-8
Multi-byte sequences of 1-4 character bytes
UTF-16
Represented in 16-bit entity for U+0000-U+D7FF and U+E000-U+FFFF, and two 16-bit entities for U+10000-U+10FFFF. Is in the platforms default byte ordering and includes the Byte Order Mark (BOM). See below for a description on the BOM.
UTF-16-INTERNAL
UTF-16, without BOM
UTF-16BE
UTF-16 in the big-endian byte ordering, without BOM
UTF-16-BIG-ENDIAN
UTF-16 in the big-endian byte ordering, including BOM
UTF-16LE
UTF-16 in the little-endian byte ordering, without BOM
UTF-16-LITTLE-ENDIAN
UTF-16 in the little-endian byte ordering, including BOM
UTF-16-SWAPPED
UTF-16 with endianness opposite to that of the local platform, without BOM
UTF-32
Represented in 32-bit entity in platforms default byte ordering and includes the BOM
UTF-32-INTERNAL
UTF-32, without BOM
UTF-32BE
UTF-32 in the big-endian byte ordering, without BOM
UTF-32-BIG-ENDIAN
UTF-32 in the big-endian byte ordering, including BOM
UTF-32-SWAPPED
UTF-32 with endianness opposite to that of the local platform, without BOM
UTF-32LE
UTF-32 in the little-endian byte ordering, without BOM
UTF-32-LITTLE-ENDIAN
UTF-32 in the little-endian byte ordering, including BOM
UCS-2
Represented in 16-bit entity for U+0000-U+D7FF and U+E000-U+FFFF in the platforms default byte ordering and includes byte order mark (BOM)
UCS-2-INTERNAL
UCS-2, without BOM
UCS-2BE
UCS-2 in the big-endian byte ordering, without BOM
UCS-2-BIG-ENDIAN
UCS-2 in the big-endian byte ordering, including BOM
UCS-2LE
UCS-2 in the little-endian byte ordering, without BOM
UCS-2-LITTLE-ENDIAN
UCS-2 in the little-endian byte ordering, including BOM
UCS-2-SWAPPED
UCS-2 with endianness opposite to that of the local platform, without BOM
UCS-4
Represented in 32-bit entity in the platforms default byte ordering and includes byte order mark (BOM)
UCS-4-INTERNAL
UCS-4, without BOM
UCS-4BE
UCS-4 in the big-endian byte ordering, without BOM
UCS-4-BIG-ENDIAN
UCS-4 in the big-endian byte ordering, including BOM
UCS-4LE
UCS-4 in the little-endian byte ordering, without BOM
UCS-4-LITTLE-ENDIAN
UCS-4 in the little-endian byte ordering, including BOM
UCS-4-SWAPPED
UCS-4 with endianness opposite to that of the local platform, without BOM

UCS, or Universal Character Set, refers to the ISO/IEC 10646 family of standards with character set identical to that of Unicode.

Byte Order Mark, also known as BOM (U+FEFF), is a special character in the beginning of a file or character stream, denoting the byte order of the subsequent characters. UCS-2, UTF-16, UTF-32, and UCS-4 files and character streams usually start with a BOM character to indicate the byte ordering used in the file or character stream.

UTF-8 to UTF-8 conversion simply moves bytes from input buffer to output buffer without doing any conversion. During the moves, illegal character checking is done to screen out any potentially harmful character bytes. Such illegal characters will cause the conversion to fail.

UTF-7, a legacy 7-bit Unicode Transformation Format, is only supported by iconv conversions to and from UTF-8, UCS-2 and UCS-4.

UTF-EBCDIC, a legacy EBCDIC-compatible variant of UTF-8, is only supported by iconv conversions to and from UTF-8.

Notes

iconv also supports conversion between Unicode encodings and many different codesets. The list of such codesets includes for example the ISO 8859 character sets, EBCDIC code pages, EUC (Extended Unix Code) and ISO 2022 encodings for Chinese, Japanese, Korean, and many others (see iconv_extra(7), iconv_ja(7), iconv_ko(7), iconv_zh(7), iconv_zh_HK(7), and iconv_zh_TW(7)).

If a source character code value cannot be mapped to a valid character in target codeset, it will be considered as an illegal or a non-identical character. In the absence of explicit information about the source character code value, iconv code conversions uses the following rules in determining whether a character is illegal or non-identical:

If the source character code value is not within a range defined by the source codeset standard, it is considered as an illegal character. If the source character code value is within the range(s) defined by the standard, it will be considered as non-identical, even if the source character code value maps to an undefined or a reserved location within the valid range. The non-identical character will map to either ? (0x3f in ASCII-compatible codesets) if the target codeset is a non-Unicode codeset or to Unicode replacement character (U+FFFD) if the target codeset is an Unicode codeset.

When the BOM is present as the first character in the encoding that supports it, it will direct the way the following Unicode character sequences are interpreted. If the BOM is not the first character for such encodings or for Unicode encodings that do not support the BOM, the BOM character (U+FEFF) will be interpreted as Zero Width No-Break Space (ZWNBSP) character and will not affect the way the Unicode characters are interpreted in terms of byte ordering.

When the target codeset is one of UCS-2, UTF-16, UTF-32, UCS-4, UCS-2-BIG-ENDIAN, UCS-2-LITTLE-ENDIAN, UTF-16-BIG-ENDIAN, UTF-16-LITTLE-ENDIAN, UCS-4-BIG-ENDIAN, UCS-4-LITTLE-ENDIAN, UTF-32-BIG-ENDIAN, and UTF-32-LITTLE-ENDIAN, expect a BOM character in the beginning of the iconv code conversion output buffer.

When the source codeset is UCS-2, UTF-16, UTF-32, or UCS-4 and there is no BOM presented as the first input character, the byte ordering of the current system is assumed on the input byte stream given to the iconv code conversion.

Examples

Example 1 The iconv Library Module Filename

In the conversion library, /usr/lib/iconv (see iconv(3C)), the library module filename is composed of two symbolic elements separated by the percent sign (%). The first symbol specifies the source codeset, i.e. the codeset that is being converted; the second symbol specifies the target codeset, i.e. the codeset to which the first one is being converted.

For example, the library module filename to convert from the legacy UTF-7 codeset to the UTF-8 codeset is UTF-7%UTF-8.so.

Example 2 The cconv Library Module Filename

For some conversions, iconv(3C) makes a call to the cconv(3C) interfaces to perform the conversion. The cconv conversion modules are binary tables with .bt suffix generated by geniconvtbl(1) and placed in the same /usr/lib/iconv library. The cconv library module filename is composed of the symbolic elements for source and target codeset separated by the plus sign (+). The cconv conversion is typically performed in two steps, with UTF-32 as the intermediate encoding.

For example, the cconv library module filename to convert from the Japanese EUC codeset to the UTF-32 codeset is eucJP+UTF-32.bt.

Files

/usr/lib/iconv/*.so

iconv conversion modules

/usr/lib/iconv/*.bt

cconv code conversion binary tables for iconv(1), cconv(3C), and iconv(3C)

/usr/lib/iconv/geniconvtbl/binarytables/*.bt

geniconvtbl conversion binary tables

/usr/lib/iconv/alias

Alias table file of codeset names

See Also

geniconvtbl(1), iconv(1), cconv(3C), cconv_close(3C), cconv_open(3C), cconvctl(3C), iconv(3C), iconv_close(3C), iconv_open(3C), iconvctl(3C), alias(5), geniconvtbl-cconv(5), iconv_extra(7), iconv_ja(7), iconv_ko(7), iconv_zh(7), iconv_zh_HK(7), iconv_zh_TW(7)

The Unicode Consortium. The Unicode Standard, Version 6.2.0, (Mountain View, CA: The Unicode Consortium, 2012. ISBN 978-1-936213-07-8)

Yergeau, F., UTF-8, a transformation format of Unicode and ISO 10646, RFC 2044, Alis Technologies, October 1996.

Ohta, M., Character Sets ISO-10646 and ISO-10646-J-1, RFC 1815, Tokyo Institute of Technology, July 1995.

Simonson, K., Character Mnemonics & Character Sets, RFC 1345, Rationel Almen Planlaegning, June 1992.

Goldsmith, D., and M. Davis, UTF-7 - A Mail-Safe Transformation Format of Unicode, RFC 1642, Taligent, Inc., July 1994.