Internationalizing and Localizing Applications in Oracle Solaris

Exit Print View

Updated: July 2014
 
 

Converting Codesets by Using iconv Functions

The iconv() functions available in the libc library for code conversion are described as follows:

iconv_open()

Code conversion allocation function

iconv()

Code conversion function

iconv_close()

Code conversion deallocation function

iconvctl()

Control and query the code conversion behavior

iconvstr()

String-based code conversion function

iconv functions enable the code conversion of characters or a sequence of characters from one codeset to another. The iconv_open() function supports various codesets. You can display information about supported codesets and their aliases currently available on a system by running the following command:

$ iconv -l

Because iconv modules come in multiple packages, you can extend the default list of available conversions by installing additional packages. The default installation includes the system/library/iconv/utf-8 package, which covers the basic set of iconv modules for conversions among UTF-8, Unicode, and other selected codesets.

You can install additional packages by using the Package Manager application or the pkg command. If you are using the Package Manager for installation, the additional packages are available in the System/Internationalization category. If you are using the pkg command, use the system/library/iconv/* name pattern for installation.

The iconv conversion modules are in the form of fromcode%tocode.so and must be present in the iconv module library under the /usr/lib/iconv directory for the iconv functions to use them. Therefore, you cannot convert between any two codesets listed by the iconv -l command. When all the iconv packages are installed and a required module is not available, you can do a two-step conversion using a Unicode encoding, for example, UTF-32, as an intermediary codeset. Alternately, you can develop a custom conversion module. To create custom iconv conversion modules use the geniconvtbl utility. For information about the input file format for the geniconvtbl utility, see the geniconvtbl(4) man page.

Example 2-9  Creating Conversion Descriptor Using iconv_open()

The following code fragment shows how to use the iconv_open() function for converting the string złoty (currency of Poland) from the single-byte ISO 8859-2 codeset to UTF-8. In order to perform the conversion with iconv, you need to create a conversion descriptor with a call to the iconv_open() function and verify that the call was successful.

#include <iconv.h>
#include <stdio.h>
iconv_t cd;
 :
cd = iconv_open("UTF-8", "ISO8859-2");
if (cd == (iconv_t)-1) {
   (void) fprintf(stderr, "iconv_open() failed");
    return(1);
}

The target codeset is the first argument to the iconv_open() function.

Example 2-10  Conversion Using iconv()

The following code fragment shows how to convert one codeset to another using the iconv() function.

Before the actual conversion, certain variables need to be in place to hold the information returned by the iconv call, such as the output buffer, number of bytes left in the input and output buffers, and so on.

The L WITH STROKE character is represented as 0xB3 in hexadecimal, in the ISO 8859-2 codeset. Therefore, the input buffer (inbuf), which holds the input string, is set for illustrational purposes to z\xB3oty. The contents of inbuf would be a result of reading a stream or a file.

#include <iconv.h>
#include <stdio.h>
#include <errno.h>

    :
int        ret;
char       *inbuf;
size_t     inbytesleft, outbytesleft;
char       outbuf[BUFSIZ];
char       *outbuf_p;

inbuf = "z\xB3oty";
inbytesleft = 5;      /* the size of the input string */

For the output buffer to hold the converted string, at least 6 bytes are needed. The L WITH STROKE character is converted to Unicode character LATIN SMALL LETTER L WITH STROKE, represented as a two-byte sequence 0xC5 0x82 in UTF-8.

Because in most cases, the actual size of the resulting string is not known before the conversion, make sure to allocate the output buffer with enough extra space. The BUFSIZ macro defined in stdio.h is sufficient in this case.

outbytesleft = BUFSIZ;   
outbuf_p = outbuf;

This conversion call uses the conversion descriptor cd from the previous example.

ret = iconv(cd, &inbuf, &inbytesleft, &outbuf_p, &outbytesleft);

After the call to iconv, you need to check whether it succeeded. If it was successful and there is still space in the output buffer, you need to terminate the string with a null character.

  if (ret != (size_t)-1) {
          if (outbytesleft == 0) {
                  /* Cannot terminate outbuf as a character string; error return */
                  return (-1);
          }
          /* success */
          *outbuf_p = '\0';
            :
  }

If the call is successful, the outbuf will contain the string in the UTF-8 codeset in hexadecimal notation \x7a\xc5\x82\x6f\x74\x79, or z\xc5\x82oty. The inbuf will now point to the end of the converted string. The inbytesleft will be 0. The outbytesleft is decremented by 6, which is the number of bytes put to the output buffer. The outbuf_p points to the end of the output string in outbuf.

If the call fails, check the errno value to handle the error cases as shown in the following code fragment:

  if (ret != (size_t)-1)) {
          if (errno == EILSEQ) {
                  /* Input conversion stopped due to an input byte that
                   * does not belong to the input codeset.
                   */
                    :
          } else if (errno == E2BIG) {
                  /* Input conversion stopped due to lack of space in
                   * the output buffer.
                   */
                    :
          } else if (errno == EINVAL) {
                  /* Input conversion stopped due to an incomplete
                   * character or shift sequence at the end of the
                   * input buffer.
                   */
                    :
          }
  }

Finally, deallocate the conversion descriptor and any memory associated with it.

iconv_close(cd);