Internationalizing and Localizing Applications in Oracle Solaris

Exit Print View

Updated: July 2014
 
 

Converting Codesets

In computer systems, characters are represented as unique scalar values. These scalar values are handled as bytes or byte sequences. The coded character set is the character set plus the mappings between the characters and the corresponding unique scalar values. These unique scalar values are called codeset. For example, 646 (also called US-ASCII) is a codeset as per the ISO/IEC 646:1991 standard for the basic Latin alphabet. The following table shows other examples of codesets:

Codeset
Character
Representation
US-ASCII
A
0x41
ISO 8859-2
Č
0xC8
EUC-KR
Full-width Latin A
0xA3 0xC1

The Unicode standard adds another layer and maps each character to a code point, which is a number between 0 and 1,114,111. This number is represented differently in each of the Unicode encoding forms, such as UTF-8, UTF-16, or UTF-32. For example:

Codeset
Character
Code point
Encoding
Representation
Unicode
FULLWIDTH LATIN CAPITAL LETTER A
65,313 or 0xFF21
UTF-8
0xEF 0xBC 0xA1
UTF-16LE
0x21 0xFF

Note -  A codeset is also referred to as an encoding. Even though, there is a distinction between these terms, codeset and encoding are used interchangeably.

Code conversion or codeset conversion means converting the byte or byte sequence representations from one codeset to another codeset. A common approach to conversion is to use the iconv() family of functions. Some of the terms used in the area of code conversion and iconv() functions are as follows:

Single-byte codeset

Codeset that maps the characters to a set of values ranging from 0 to 255, or 0x00 to 0xFF. Therefore, a character is represented in a single byte.

Multibyte codeset

Codeset that maps some or all of the characters to more than one byte.

Illegal character

Invalid character in an input codeset.

Shift sequence

Special sequence of bytes in a multibyte codeset that does not map to a character but instead is a means of changing the state of the decoder.

Incomplete character

Sequence of bytes that does not form a valid character in an input codeset. However, it may form a valid character on a subsequent call to the conversion function, such as the iconv() function, when additional bytes are provided from the input. This is common when converting a multibyte stream.

Non-identical character

Character that is valid in the input codeset but for which an identical character does not exist in the output codeset.

Non-identical conversion

Conversion of a non-identical character. Depending on the implementation and conversion options, these characters can be omitted in the output or replaced with one or more characters indicating that a non-identical conversion occurred. The Oracle Solaris iconv() function replaces non-identical characters with a question mark ('?') by default.