STREAMS Programming Guide

EUC Handling in `ldterm`

The idea of letting post-processing (the o_flags) happen off the host processor is not recommended unless the board software is prepared to deal with international (EUC) character sets properly. The reason for this is that post-processing must take the EUC information into account. ldterm(7M) knows about the screen width of characters (that is, how many columns are taken by characters from each given code set on the current physical display) and it takes this width into account when calculating tab expansions. When using multi-byte characters or multi-column characters ldterm automatically handles tab expansion (when TAB3 is set) and does not leave this handling to a lower module or driver.

By default, multi-byte handling by ldterm is turned off. When ldterm receives an EUC_WSET ioctl(2), it turns multi-byte processing on, if it is essential to properly handle the indicated code set. Thus, if you use single byte 8-bit codes and has no special multi-column requirements, the special multi-column processing is not used at all. This means that multi-byte processing does not reduce the processing speed or efficiency of ldterm unless it is actually used.

The following describes how the EUC handling in ldterm works:

First, the multi-byte and multi-column character handling is only enabled when the EUC_WSET ioctl indicates that one of the following conditions is met:

Code set consists of more than one byte (including the SS2 and/or SS3) of characters
Code set requires more than one column to display on the current device, as indicated in the EUC_WSET structure

Assuming that one or more of the previous conditions exists, EUC handling is enabled. At this point, a parallel array (see ldterm_mod structure) used for other information is allocated and a pointer to it is stored in t_eucp_mp. The parallel array that it holds is pointed to by t_eucp. The t_codeset field holds the flag that indicates which of the code sets is currently being processed on the read side. When a byte with the high bit arrives, it is checked to see if it is SS2 or SS3. If yes, it belongs to code set 2 or 3. Otherwise, it is a byte that comes from code set 1. Once the extended code set flag has been set, the input processor retrieves the subsequent bytes, as they arrive, to build one multi-byte character. The counter field t_eucleft tells the input processor how many bytes remain to be read for the current character. The parallel array t_eucp holds its display width for each logical character in the canonical buffer. During erase processing, positions in the parallel array are consulted to determine how many backspaces need to be send to erase each logical character. (In canonical mode, one backspace of input erases one logical character, no matter how many bytes or columns that character consumes.) This greatly simplifies erase processing for EUC.

The t_maxeuc field holds the maximum length, in memory bytes, of the EUC character mapping currently in use. The eucwioc field is a substructure that holds information about each extended code set.

The t_eucign field aids in output post-processing (tab expansion). When characters are output, ldterm(7M) keeps a column to indicate what the current cursor column is supposed to be. When it sends the first byte of an extended character, it adds the number of columns required for that character to the output column. It then subtracts one from the total width in memory bytes of that character and stores the result in t_eucign. This field tells ldterm(7M) how many subsequent bytes to ignore for the purposes of column calculation. (ldterm(7M) calculates the appropriate number of columns when it sees the first byte of the character.)

The field t_eucwarn is a counter for occurrences of bad extended characters. It is mostly useful for debugging. After receiving a certain number of illegal EUC characters (perhaps because of some problem on the line or with declared values), a warning is given on the system console.

There are two relevant files for handling multi-byte characters: euc.h and eucioctl.h. eucioctl.h contains the structure that is passed with EUC_WSET and EUC_WGET calls. The normal way to use this structure is to get CSWIDTH from the locale using a mechanism such as getwidth(3C) or setlocale(3C) and then copying the values into the structure in eucioctl.h, and sending the structure using an I_STR ioctl(2). The EUC_WSET call informs the ldterm(7M) module about the number of bytes in extended characters and how many columns the extended characters from each set consume on the screen. This allows ldterm(7M) to treat multi-byte characters as single units for the purpose of erase processing and to correctly calculate tab expansions for multi-byte characters.

Note -

LC_CTYPE (instead of CSWIDTH) should be used in the environment in SunOS 5 systems.

The file euc.h has the structure with fields for EUC width, screen width, and wide character width. The following functions are used to set and get EUC widths (these functions assume the environment where the eucwidth_t structure is needed and available):

Example 14-1 EUC

#include <eucioctl.h>								/* need others,like stropts.h*/

struct eucioc eucw;								/*for EUC_WSET/WGET to line disc*/
eucwidth_t width;								/* ret struct from _getwidth() */
/*
 * set_euc					Send EUC code widths to line discipline.
 */
set_euc(struct eucioc *e)
{
	struct strioctl sb;

	sb.ic_cmd = EUC_WSET;
	sb.ic_timout = 15;
	sb.ic_len = sizeof(struct eucioc);
	sb.ic_dp = (char *) e;

	if (ioctl(0, I_STR, &sb) < 0)
			fail();
}
/*
 * euclook.   Get current EUC code widths from line discipline.
 */
euclook(struct eucioc *e)
{
	struct strioctl sb;

	sb.ic_cmd = EUC_WGET;
	sb.ic_timout = 15;
	sb.ic_len = sizeof(struct eucioc);
	sb.ic_dp = (char *) e;

	if (ioctl(0, I_STR, &sb) < 0)
			fail();

	printf("CSWIDTH=%d:%d,%d:%d,%d:%d",
			e->eucw[1], e->scrw[1],
			e->eucw[2], e->scrw[2],
			e->eucw[3], e->scrw[3]);
}

For more detailed descriptions, see System Interface Guide.

EUC Handling in ldterm

Example 14-1 EUC

EUC Handling in `ldterm`