Common Desktop Environment: Internationalization Programmer's Guide

Interchange Concepts

This section describes the way 8-bit user names and 8-bit data can be communicated on a network for communications utilities, such as ftp, mail, or interclient communication between the desktop clients.

There are three primary considerations for communicating data:

Sender's code set and the receiver's code set.
Whether the communications protocol allows 8-bit data or is limited to 7-bit coded data (for example, the Japanese JUNET passes Japanese Industrial Standard (JIS) coded data over 7-bit protocols).
Type of interchange encoding available, per protocol rules. The actual conversion needed is dependent on the specific protocol used.

If the remote host uses the same code set as the local host, the following is true:

If the protocol allows 8-bit data, no conversions are needed.
If the protocol allows only 7-bit data, a method is needed to map the 8-bit code points to 7-bit ASCII values. This could be accomplished using the iconv() framework and one of the following types of 7-bit encoded methods:
- Map 8-bit data as specified in the POSIX.2 specification for uuencode and uudecode algorithms.
- Optionally, the 8-bit data may be mapped to a 7-bit interchange encoding as defined by the protocol; for example, 7-bit ISO2022 in Xlib or base64 in Multipurpose Internet Message Extensions (MIME).

If the remote host's code set is different from that of the local host, the following two cases may apply. The conversion needed is dependent on the specific protocol used.

If the protocol allows 8-bit data, the protocol will need to specify which side does the iconv() conversion and to specify the encoding on the wire. In some protocols, an 8-bit interchange encoding is recommended that is capable of encoding all possible code sets and identifying character repertoire.
If the protocol allows only 7-bit data, a 7-bit interchange encoding is needed, as is the identifying character repertoire.

iconv Interface

In a network environment, the code sets of the communicating systems and the protocols of communication determine the transformation of user-specified data so that it can be sent to the remote system in a meaningful way. The user data (not user names) may need to be transformed from the sender's code set to the receiver's code set, or 8-bit data may need to be transformed into a 7-bit form to conform to protocols. A uniform interface is needed to accomplish this.

In the following examples, using the iconv() interface is illustrated by explaining how to use iconv_open(), iconv(), and iconv_close(). To do the conversion, iconv_open() must be followed by iconv(). The terms 7-bit interchange and 8-bit interchange are used to refer to any interchange encoding used for 7-bit and 8-bit data, respectively.

Sender and Receiver Use the Same Code Sets:

If the protocol allows 8-bit data, use 8-bit data because the same code set is being used. No conversion is needed.

If the protocol allows only 7-bit data, use iconv():

Sender

cd =iconv_open(locale_codeset, uuencoded );

Receiver

cd = iconv_open("uucode", locale_codeset );

Sender and Receiver Use Different Code Sets:

If the protocol allows 8-bit data:

Sender

cd = iconv_open(locale_codeset, 8-bitinterchange );

Receiver

cd = iconv_open(8-bitinterchange, locale_codeset );

If the protocol allows only 7-bit data, do the following:

Sender

cd = iconv_open(locale_codeset, 7-bitinterchange );

Receiver

cd = iconv_open(7-bitinterchange, locale_codeset );

The locale_codeset() refers to the code set being used locally by the application. Note that while the nl_langinfo()(CODESET) function may be used to obtain the code set associated with the current locale, it is implementation-dependent whether any conversion names match the return from the nl_langinfo()(CODESET) function.

The Table 3-1 outlines how iconv() can be used to perform conversions for various conditions. Specific protocols may dictate other conversions needed.

Table 3-1 Using iconv to Perform Conversions


	Communication with system using the same code set (for example, XYZ)		Communication with system using different code sets or receiver's code set is unknown
Conversion to Use	7-bit Protocol	8-bit Protocol	7-bit Protocol	8-bit Protocol
code XYZ	Invalid	Best Choice	Invalid	Invalid if remote code set is unknown
7-bit Interchange ISO2022	OK	OK	Best Choice	OK
8-bit Interchange ISO2022 ISO 10646	Invalid [Invalid means the interchange encoding should not be used for the choice of code set and type of protocol.]	OK	Invalid	Best Choice
7-bit Untagged quoted-printable uucode	OK	OK	Requires code set identification	Requires code set identification
8-bit Untagged base64	Invalid	OK	Requires code set identification	Requires code set identification

Stateful and Stateless Conversions

Code sets can be classified into two categories: stateful encodings and stateless encodings.

Stateful Encodings

Stateful encoding uses sequences of control codes, such as shift-in/shift-out, to change character sets associated with specific code values.

For instance, under compound text, the control sequence "ESC$(B" can be used to indicate the start of Japanese 16-bit data in a data stream of characters, and "ESC(B" can be used to indicate the end of this double-byte character data and the start of 8-bit ASCII data. Under this stateful encoding, the bit value 0x43 could not be interpreted without knowing the shift state. The EBCDIC Asian code sets use shift-in/shift-out controls to swap between double- and single-byte encodings, respectively.

Converters that are written to do the conversion of stateful encodings to other code sets tend to be a little complex due to the extra processing needed.

Stateless Encodings

Stateless code sets are those that can be classified as one of two types:

Single-byte code sets, such as the ISO8859 family
Multibyte code sets, such as PC codes for Japanese and Shift-JIS (SJIS)

The term multibyte code sets is also used to refer to any code set that needs one or more bytes to encode a character; multibyte code sets are considered stateless.

Note -

Conversions are meaningful only if the code sets represent the same character set.