Converting Codesets

Language:

In systems, characters are represented as unique scalar values. These scalar values are handled as bytes or byte sequences. The coded character set is the character set plus the mappings between the characters and the corresponding unique scalar values. These unique scalar values are called codeset. For example, 646 (also called US-ASCII) is a codeset as per the ISO/IEC 646:1991 standard for the basic Latin alphabet. The following table shows other examples of codesets:

Codeset	Character	Representation
`US-ASCII`	A	0x41
`ISO 8859-2`	Č	0xC8
`EUC-KR`	Full-width Latin A	0xA3 0xC1

The Unicode standard adds another layer and maps each character to a code point, which is a number between 0 and 1,114,111. This number is represented differently in each of the Unicode encoding forms, such as UTF-8, UTF-16, or UTF-32. For example:

Codeset	Character	Code point	Encoding	Representation
Unicode	FULLWIDTH LATIN CAPITAL LETTER A	65,313 or 0xFF21	`UTF-8`	0xEF 0xBC 0xA1
Unicode	FULLWIDTH LATIN CAPITAL LETTER A	65,313 or 0xFF21	`UTF-16LE`	0x21 0xFF

Note - A codeset is also referred to as an encoding. Even though there is a distinction between these terms, codeset and encoding are used interchangeably.

Code conversion or codeset conversion means converting the byte or byte sequence representations from one codeset to another codeset. A common approach to conversion is to use the iconv() family of functions. Some of the terms used in the area of code conversion and iconv() functions are as follows:

Single-byte codeset: Codeset that maps the characters to a set of values ranging from 0 to 255, or 0x00 to 0xFF. Therefore, a character is represented in a single byte.
Multibyte codeset: Codeset that maps some or all of the characters to more than one byte.
Illegal character: Invalid character in an input codeset.
Shift sequence: Special sequence of bytes in a multibyte codeset that does not map to a character but instead is a means of changing the state of the decoder.
Incomplete character: Sequence of bytes that does not form a valid character in an input codeset. However, it may form a valid character on a subsequent call to the conversion function, such as the iconv() function, when additional bytes are provided from the input. This is common when converting a multibyte stream.
Non-identical character: Character that is valid in the input codeset but for which an identical character does not exist in the output codeset.
Non-identical conversion: Conversion of a non-identical character. Depending on the implementation and conversion options, these characters can be omitted in the output or replaced with one or more characters indicating that a non-identical conversion occurred. The Oracle Solaris iconv() function replaces non-identical characters with a question mark ('?') by default.

Converting Codesets by Using `iconv` Functions

The iconv() functions available in the libc library for code conversion are described as follows:

iconv_open(): Code conversion allocation function
iconv(): Code conversion function
iconv_close(): Code conversion deallocation function
iconvctl(): Control and query the code conversion behavior
iconvstr(): String-based code conversion function

iconv() functions enable the code conversion of characters or a sequence of characters from one codeset to another. The iconv_open() function supports various codesets. You can display information about supported codesets and their aliases currently available on a system by running the following command:

$ iconv -l

You can extend the default list of available conversions by installing the additional packages. The default installation includes the system/library/iconv/iconv-core package, which covers the basic set of iconv tables and modules for conversions among UTF-8, Unicode, and selected codesets. The additional package system/library/iconv is available for others.

Converting codesets by using iconv() functions is established with three types of files:

cconv code conversion binary table files /usr/lib/iconv/*.bt
iconv code conversion modules /usr/lib/iconv/*.so
iconv code conversion binary tables /usr/lib/iconv/geniconvtbl/binarytables/*.bt

You cannot convert between any two codesets listed by the iconv –l command. When all the iconv packages are installed and a required module is not available, you can do a two step conversion using a Unicode encoding, for example, UTF-32, as an intermediary codeset. Alternatively, you can develop a custom conversion table. To create custom conversion tables use the geniconvtbl utility. For information about the input file format for this utility, see the geniconvtbl(5) and geniconvtbl-cconv(5) man pages.

Example 11 Creating Conversion Descriptor Using iconv_open()

For more information, see also Per-character Sequence-based Code.

The following code fragment shows how to use the iconv_open() function for converting the string złoty (currency of Poland) from the single-byte ISO 8859-2 codeset to UTF-8. In order to perform the conversion with iconv, you need to create a conversion descriptor with a call to the iconv_open() function and verify that the call was successful.

#include <iconv.h>
#include <stdio.h>
iconv_t cd;
 :
cd = iconv_open("UTF-8", "ISO8859-2");
if (cd == (iconv_t)-1) {
   (void) fprintf(stderr, "iconv_open() failed");
    return(1);
}

The target codeset is the first argument to the iconv_open() function.

Example 12 Conversion Using iconv()

The following code fragment shows how to convert one codeset to another using the iconv() function.

Before the actual conversion, certain variables need to be in place to hold the information returned by the iconv call, such as the output buffer, number of bytes left in the input and output buffers, and so on.

The L WITH STROKE character is represented as 0xB3 in hexadecimal, in the ISO 8859-2 codeset. Therefore, the input buffer (inbuf), which holds the input string, is set for illustration purposes to z\xB3oty. The contents of inbuf would be a result of reading a stream or a file.

#include <iconv.h>
#include <stdio.h>
#include <errno.h>

    :
int        ret;
char       *inbuf;
size_t     inbytesleft, outbytesleft;
char       outbuf[BUFSIZ];
char       *outbuf_p;

inbuf = "z\xB3oty";
inbytesleft = 5;      /* the size of the input string */

For the output buffer to hold the converted string, at least 6 bytes are needed. The L WITH STROKE character is converted to Unicode character LATIN SMALL LETTER L WITH STROKE, represented as a two-byte sequence 0xC5 0x82 in UTF-8.

Because in most cases, the actual size of the resulting string is not known before the conversion, make sure to allocate the output buffer with enough extra space. The BUFSIZ macro defined in stdio.h is sufficient in this case.

outbytesleft = BUFSIZ;   
outbuf_p = outbuf;

This conversion call uses the conversion descriptor cd from the previous example.

ret = iconv(cd, &inbuf, &inbytesleft, &outbuf_p, &outbytesleft);

After the call to iconv, you need to check whether it succeeded. If it was successful and there is still space in the output buffer, you need to terminate the string with a null character.

  if (ret != (size_t)-1) {
          if (outbytesleft == 0) {
                  /* Cannot terminate outbuf as a character string; error return */
                  return (-1);
          }
          /* success */
          *outbuf_p = '\0';
            :
  }

If the call is successful, the outbuf will contain the string in the UTF-8 codeset in hexadecimal notation \x7a\xc5\x82\x6f\x74\x79, or z\xc5\x82oty. The inbuf will now point to the end of the converted string. The inbytesleft will be 0. The outbytesleft is decremented by 6, which is the number of bytes put to the output buffer. The outbuf_p points to the end of the output string in outbuf.

If the call fails, check the errno value to handle the error cases as shown in the following code fragment:

  if (ret != (size_t)-1)) {
          if (errno == EILSEQ) {
                  /* Input conversion stopped due to an input byte that
                   * does not belong to the input codeset.
                   */
                    :
          } else if (errno == E2BIG) {
                  /* Input conversion stopped due to lack of space in
                   * the output buffer.
                   */
                    :
          } else if (errno == EINVAL) {
                  /* Input conversion stopped due to an incomplete
                   * character or shift sequence at the end of the
                   * input buffer.
                   */
                    :
          }
  }

Finally, deallocate the conversion descriptor and any memory associated with it.

iconv_close(cd);

Per-character Sequence-based Code

The cconv() functions are used to convert a character sequence from one codeset into a corresponding character sequence in another codeset. Unlike iconv() that does buffer-based character code conversion, cconv() does per-character sequence-based code conversion which converts a single character sequence. The cconv() functions available in the libc library are described as follows:

cconv_open(): Character sequence based code conversion allocation function
cconv(): Per character sequence based code conversion function
cconv_close(): Character sequence based code conversion deallocation function
cconvctl(): Control and query cconv code conversion behavior

See the cconv(3C) man page for a sample code fragment.

Just like the iconv() conversion functions, cconv() conversion functions support transliteration. When the target codeset does not have a character corresponding to an input character, the cconv() conversion transliterates the character into one or more characters of the target codeset that best resembles the input character. See the cconv_open(3C) man page for details.

The cconv conversion is established with combination of two cconv conversion binary table files: fromcode+UTF-32.bt and UTF-32+tocode.bt, both located in the /usr/lib/iconv directory. The UTF-32 is used as an intermediary codeset to convert from fromcode codeset to tocode codeset. When required tables are not available, you can create custom conversion tables. Use geniconvtbl utility with the –c option to create custom cconv conversion tables. For information about the input file format for the utility to create custom cconv conversion tables, see the geniconvtbl(5) and geniconvtbl-cconv(5) man pages.

Functions for Converting Between Unicode Codesets

Functions available for converting between any two of the Unicode encoding forms UTF-8, UTF-16, and UTF-32 are described in the following man pages:

uconv_u8tou16(9F): Convert UTF-8 string to UTF-16
uconv_u8tou32(9F): Convert UTF-8 string to UTF-32
uconv_u16tou8(9F): Convert UTF-16 string to UTF-8
uconv_u16tou32(9F): Convert UTF-16 string to UTF-32
uconv_u32tou8(9F): Convert UTF-32 string to UTF-8
uconv_u32tou16(9F): Convert UTF-32 string to UTF-16

Processing UTF-8 Strings

Functions available for processing Unicode UTF-8 strings are described in the following man pages:

u8_textprep_str(9F): String-based UTF-8 text preparation
u8_strcmp(9F): UTF-8 string comparison function
u8_validate(9F): Validate UTF-8 characters and calculate the byte length

Note - Use the u8_textprep_str() function to convert a UTF-8 string to uppercase or lowercase as well as to apply one of the Unicode normalization forms. For more information, see http://unicode.org/reports/tr15/Unicode Normalization Forms.

Internationalizing and Localizing Applications in Oracle Solaris

Converting Codesets

Converting Codesets by Using iconv Functions

Per-character Sequence-based Code

Functions for Converting Between Unicode Codesets

Processing UTF-8 Strings

Converting Codesets by Using `iconv` Functions