In computer systems, characters are represented as unique scalar values. These scalar values are handled as bytes or byte sequences. The coded character set is the character set plus the mappings between the characters and the corresponding unique scalar values. These unique scalar values are called codeset. For example, 646 (also called US-ASCII) is a codeset as per the ISO/IEC 646:1991 standard for the basic Latin alphabet. The following table shows other examples of codesets:
|
The Unicode standard adds another layer and maps each character to a code point, which is a number between 0 and 1,114,111. This number is represented differently in each of the Unicode encoding forms, such as UTF-8, UTF-16, or UTF-32. For example:
|
Code conversion or codeset conversion means converting the byte or byte sequence representations from one codeset to another codeset. A common approach to conversion is to use the iconv() family of functions. Some of the terms used in the area of code conversion and iconv() functions are as follows:
Codeset that maps the characters to a set of values ranging from 0 to 255, or 0x00 to 0xFF. Therefore, a character is represented in a single byte.
Codeset that maps some or all of the characters to more than one byte.
Invalid character in an input codeset.
Special sequence of bytes in a multibyte codeset that does not map to a character but instead is a means of changing the state of the decoder.
Sequence of bytes that does not form a valid character in an input codeset. However, it may form a valid character on a subsequent call to the conversion function, such as the iconv() function, when additional bytes are provided from the input. This is common when converting a multibyte stream.
Character that is valid in the input codeset but for which an identical character does not exist in the output codeset.
Conversion of a non-identical character. Depending on the implementation and conversion options, these characters can be omitted in the output or replaced with one or more characters indicating that a non-identical conversion occurred. The Oracle Solaris iconv() function replaces non-identical characters with a question mark ('?') by default.