Converting Codesets

Language:

In systems, characters are represented as unique scalar values. These scalar values are handled as bytes or byte sequences. The coded character set is the character set plus the mappings between the characters and the corresponding unique scalar values. These unique scalar values are called codeset. For example, 646 (also called US-ASCII) is a codeset as per the ISO/IEC 646:1991 standard for the basic Latin alphabet. The following table shows other examples of codesets:

Codeset	Character	Representation
`US-ASCII`	A	0x41
`ISO 8859-2`	Č	0xC8
`EUC-KR`	Full-width Latin A	0xA3 0xC1

The Unicode standard adds another layer and maps each character to a code point, which is a number between 0 and 1,114,111. This number is represented differently in each of the Unicode encoding forms, such as UTF-8, UTF-16, or UTF-32. For example:

Codeset	Character	Code point	Encoding	Representation
Unicode	FULLWIDTH LATIN CAPITAL LETTER A	65,313 or 0xFF21	`UTF-8`	0xEF 0xBC 0xA1
Unicode	FULLWIDTH LATIN CAPITAL LETTER A	65,313 or 0xFF21	`UTF-16LE`	0x21 0xFF

Note - A codeset is also referred to as an encoding. Even though, there is a distinction between these terms, codeset and encoding are used interchangeably.

Code conversion or codeset conversion means converting the byte or byte sequence representations from one codeset to another codeset. A common approach to conversion is to use the iconv() family of functions. Some of the terms used in the area of code conversion and iconv() functions are as follows:

Single-byte codeset: Codeset that maps the characters to a set of values ranging from 0 to 255, or 0x00 to 0xFF. Therefore, a character is represented in a single byte.
Multibyte codeset: Codeset that maps some or all of the characters to more than one byte.
Illegal character: Invalid character in an input codeset.
Shift sequence: Special sequence of bytes in a multibyte codeset that does not map to a character but instead is a means of changing the state of the decoder.
Incomplete character: Sequence of bytes that does not form a valid character in an input codeset. However, it may form a valid character on a subsequent call to the conversion function, such as the iconv() function, when additional bytes are provided from the input. This is common when converting a multibyte stream.
Non-identical character: Character that is valid in the input codeset but for which an identical character does not exist in the output codeset.
Non-identical conversion: Conversion of a non-identical character. Depending on the implementation and conversion options, these characters can be omitted in the output or replaced with one or more characters indicating that a non-identical conversion occurred. The Oracle Solaris iconv() function replaces non-identical characters with a question mark ('?') by default.

Converting Codesets by Using `iconv` Functions

The iconv() functions available in the libc library for code conversion are described as follows:

iconv_open(): Code conversion allocation function
iconv(): Code conversion function
iconv_close(): Code conversion deallocation function
iconvctl(): Control and query the code conversion behavior
iconvstr(): String-based code conversion function

iconv() functions enable the code conversion of characters or a sequence of characters from one codeset to another. The iconv_open() function supports various codesets. You can display information about supported codesets and their aliases currently available on a system by running the following command:

$ iconv -l

Because iconv modules come in multiple packages, you can extend the default list of available conversions by installing additional packages. The default installation includes the system/library/iconv/utf-8 package, which covers the basic set of iconv modules for conversions among UTF-8, Unicode, and other selected codesets.

You can install additional packages by using the Package Manager application or the pkg command. If you are using the Package Manager for installation, the additional packages are available in the System/Internationalization category. If you are using the pkg command, use the system/library/iconv/* name pattern for installation.

The iconv conversion modules are in the form of fromcode%tocode.so and must be present in the iconv module library under the /usr/lib/iconv directory for the iconv functions to use them. Therefore, you cannot convert between any two codesets listed by the iconv -l command. When all the iconv packages are installed and a required module is not available, you can do a two-step conversion using a Unicode encoding, for example, UTF-32, as an intermediary codeset. Alternately, you can develop a custom conversion module. To create custom iconv conversion modules use the geniconvtbl utility. For information about the input file format for the geniconvtbl utility, see the geniconvtbl(4) man page.

Example 9 Creating Conversion Descriptor Using iconv_open()

The following code fragment shows how to use the iconv_open() function for converting the string złoty (currency of Poland) from the single-byte ISO 8859-2 codeset to UTF-8. In order to perform the conversion with iconv, you need to create a conversion descriptor with a call to the iconv_open() function and verify that the call was successful.

#include <iconv.h>
#include <stdio.h>
iconv_t cd;
 :
cd = iconv_open("UTF-8", "ISO8859-2");
if (cd == (iconv_t)-1) {
   (void) fprintf(stderr, "iconv_open() failed");
    return(1);
}

The target codeset is the first argument to the iconv_open() function.

Example 10 Conversion Using iconv()

The following code fragment shows how to convert one codeset to another using the iconv() function.

Before the actual conversion, certain variables need to be in place to hold the information returned by the iconv call, such as the output buffer, number of bytes left in the input and output buffers, and so on.

The L WITH STROKE character is represented as 0xB3 in hexadecimal, in the ISO 8859-2 codeset. Therefore, the input buffer (inbuf), which holds the input string, is set for illustrational purposes to z\xB3oty. The contents of inbuf would be a result of reading a stream or a file.

#include <iconv.h>
#include <stdio.h>
#include <errno.h>

    :
int        ret;
char       *inbuf;
size_t     inbytesleft, outbytesleft;
char       outbuf[BUFSIZ];
char       *outbuf_p;

inbuf = "z\xB3oty";
inbytesleft = 5;      /* the size of the input string */

For the output buffer to hold the converted string, at least 6 bytes are needed. The L WITH STROKE character is converted to Unicode character LATIN SMALL LETTER L WITH STROKE, represented as a two-byte sequence 0xC5 0x82 in UTF-8.

Because in most cases, the actual size of the resulting string is not known before the conversion, make sure to allocate the output buffer with enough extra space. The BUFSIZ macro defined in stdio.h is sufficient in this case.

outbytesleft = BUFSIZ;   
outbuf_p = outbuf;

This conversion call uses the conversion descriptor cd from the previous example.

ret = iconv(cd, &inbuf, &inbytesleft, &outbuf_p, &outbytesleft);

After the call to iconv, you need to check whether it succeeded. If it was successful and there is still space in the output buffer, you need to terminate the string with a null character.

  if (ret != (size_t)-1) {
          if (outbytesleft == 0) {
                  /* Cannot terminate outbuf as a character string; error return */
                  return (-1);
          }
          /* success */
          *outbuf_p = '\0';
            :
  }

If the call is successful, the outbuf will contain the string in the UTF-8 codeset in hexadecimal notation \x7a\xc5\x82\x6f\x74\x79, or z\xc5\x82oty. The inbuf will now point to the end of the converted string. The inbytesleft will be 0. The outbytesleft is decremented by 6, which is the number of bytes put to the output buffer. The outbuf_p points to the end of the output string in outbuf.

If the call fails, check the errno value to handle the error cases as shown in the following code fragment:

  if (ret != (size_t)-1)) {
          if (errno == EILSEQ) {
                  /* Input conversion stopped due to an input byte that
                   * does not belong to the input codeset.
                   */
                    :
          } else if (errno == E2BIG) {
                  /* Input conversion stopped due to lack of space in
                   * the output buffer.
                   */
                    :
          } else if (errno == EINVAL) {
                  /* Input conversion stopped due to an incomplete
                   * character or shift sequence at the end of the
                   * input buffer.
                   */
                    :
          }
  }

Finally, deallocate the conversion descriptor and any memory associated with it.

iconv_close(cd);

Functions for Converting Between Unicode Codesets

Functions available for converting between any two of the Unicode encoding forms UTF-8, UTF-16, and UTF-32 are described in the following man pages:

uconv_u8tou16(9F): Convert UTF-8 string to UTF-16
uconv_u8tou32(9F): Convert UTF-8 string to UTF-32
uconv_u16tou8(9F): Convert UTF-16 string to UTF-8
uconv_u16tou32(9F): Convert UTF-16 string to UTF-32
uconv_u32tou8(9F): Convert UTF-32 string to UTF-8
uconv_u32tou16(9F): Convert UTF-32 string to UTF-16

Processing UTF-8 Strings

Functions available for processing Unicode UTF-8 strings are described in the following man pages:

u8_textprep_str(9F): String-based UTF-8 text preparation
u8_strcmp(9F): UTF-8 string comparison function
u8_validate(9F): Validate UTF-8 characters and calculate the byte length

Note - Use the u8_textprep_str() function to convert a UTF-8 string to uppercase or lowercase as well as to apply one of the Unicode normalization forms. For more information, see http://unicode.org/reports/tr15/Unicode Normalization Forms.

Internationalizing and Localizing Applications in Oracle Solaris

Converting Codesets

Converting Codesets by Using iconv Functions

Functions for Converting Between Unicode Codesets

Processing UTF-8 Strings

Converting Codesets by Using `iconv` Functions