Converting Codesets

Language:

In computer systems, characters are represented as unique scalar values. These scalar values are handled as bytes or byte sequences. The coded character set is the character set plus the mappings between the characters and the corresponding unique scalar values. These unique scalar values are called codeset. For example, 646 (also called US-ASCII) is a codeset as per the ISO/IEC 646:1991 standard for the basic Latin alphabet. The following table shows other examples of codesets:

Codeset	Character	Representation
`US-ASCII`	A	0x41
`ISO 8859-2`	Č	0xC8
`EUC-KR`	Full-width Latin A	0xA3 0xC1

The Unicode standard adds another layer and maps each character to a code point, which is a number between 0 and 1,114,111. This number is represented differently in each of the Unicode encoding forms, such as UTF-8, UTF-16, or UTF-32. For example:

Codeset	Character	Code point	Encoding	Representation
Unicode	FULLWIDTH LATIN CAPITAL LETTER A	65,313 or 0xFF21	`UTF-8`	0xEF 0xBC 0xA1
Unicode	FULLWIDTH LATIN CAPITAL LETTER A	65,313 or 0xFF21	`UTF-16LE`	0x21 0xFF

Note - A codeset is also referred to as an encoding. Even though, there is a distinction between these terms, codeset and encoding are used interchangeably.

Code conversion or codeset conversion means converting the byte or byte sequence representations from one codeset to another codeset. A common approach to conversion is to use the iconv() family of functions. Some of the terms used in the area of code conversion and iconv() functions are as follows:

Single-byte codeset: Codeset that maps the characters to a set of values ranging from 0 to 255, or 0x00 to 0xFF. Therefore, a character is represented in a single byte.
Multibyte codeset: Codeset that maps some or all of the characters to more than one byte.
Illegal character: Invalid character in an input codeset.
Shift sequence: Special sequence of bytes in a multibyte codeset that does not map to a character but instead is a means of changing the state of the decoder.
Incomplete character: Sequence of bytes that does not form a valid character in an input codeset. However, it may form a valid character on a subsequent call to the conversion function, such as the iconv() function, when additional bytes are provided from the input. This is common when converting a multibyte stream.
Non-identical character: Character that is valid in the input codeset but for which an identical character does not exist in the output codeset.
Non-identical conversion: Conversion of a non-identical character. Depending on the implementation and conversion options, these characters can be omitted in the output or replaced with one or more characters indicating that a non-identical conversion occurred. The Oracle Solaris iconv() function replaces non-identical characters with a question mark ('?') by default.