Sun Java System Messaging Server 6.3 Administration Guide

12.4.2.7 Character Set Labeling and Eight-Bit Data

Keywords: charset7, charset8, charsetesc, sevenbit, eightbit, eightnegotiate, eightstrict

Character Set Labeling

The MIME specification provides a mechanism to label the character set used in a plain text message. Specifically, a charset= parameter can be specified as part of the Content-type: header line. Various character set names are defined in MIME, including US-ASCII (the default), ISO-8859-1, ISO-8859-2, and many more that have been subsequently defined.

Some existing systems and user agents do not provide a mechanism for generating these character set labels; as a result, some plain text messages may not be properly labeled. The charset7, charset8, and charsetesc channel keywords provide a per-channel mechanism to specify character set names to be inserted into message headers which lack character set labelling. Each keyword requires a single argument giving the character set name. The names are not checked for validity. Note, however, that character set conversion can only be done on character sets specified in the character set definition file charsets.txt found in the MTA table directory. The names defined in this file should be used if possible.

The charset7 character set name is used if the message contains only seven bit characters; the charset8 character set name will be used if eight bit data is found in the message; charsetesc will be used if a message containing only seven bit data happens to contain escape characters also. If the appropriate keyword is not specified no character set name will be inserted into Content-type: header lines.

Note that the charset8 keyword also controls the MIME encoding of 8-bit characters in message headers (where 8-bit data is unconditionally illegal). The MTA normally MIME-encodes any (illegal) 8-bit data encountered in message headers, labeling it as the UNKNOWN charset if no charset8 value has been specified.

These character set specifications never override existing labels; that is, they have no effect if a message already has a character set label or is of a type other than text. It is usually appropriate to label MTA local channels as follows:


l ... charset7 US-ASCII charset8 ISO-8859-1 ...
hostname

If there is no Content-type header in the message, it is added. This keyword also adds the MIME-version: header line if it is missing.

The charsetesc keyword tends to be particularly useful on channels that receive unlabeled messages using Japanese or Korean character sets that contain the escape character.

Eight-Bit Data

Some transports restrict the use of characters with ordinal values greater than 127 (decimal). Most notably, some SMTP servers will strip the high bit and thus garble messages that use characters in this eight-bit range.

Messaging Server provides facilities to automatically encode such messages so that troublesome eight bit characters do not appear directly in the message. This encoding can be applied to all messages enqueued to a given channel by specifying the sevenbit keyword. A channel should be marked eightbit if no such restriction exists.

The SMTP protocol disallows eightbit “unless the remote SMTP server explicitly says it supports the SMTP extension allowing eightbit.” Some transports such as extended SMTP may actually support a form of negotiation to determine if eight bit characters can be transmitted. Therefore, the use of the eightnegotiate keyword is strongly recommended to instruct the channel to encode messages when negotiation fails. This is the default for all channels; channels that do not support negotiation will simply assume that the transport is capable of handling eight bit data.

The eightstrict keyword tells Messaging Server to reject any incoming messages with headers that contain illegal eight bit data.