Multi-Byte Support

Introduction

This chapter describes the changes in the behavior of commands and libraries for internationalization, specifically multi-byte support, in the following areas.

mio and vuform support for multi-byte characters (EUC support).
regular expression support for multi-byte characters (EUC support).

where EUC stands for Extended UNIX Code.

Multi-byte support for the Data Entry System

The TUXEDO System provides character terminal based user interfaces such as mio and vuform. This section describes the internationalization of the Data Entry System.

UFORM

Described below are the internationalization of statements in the form-description UFORM language. The UFORM statement consists of the following five types, #SERVICE, #FORM, #PAGE, Basic Descriptor and Extended Descriptor.

Multi-byte characters in UFORM

Table 1 indicates whether or not multi-byte characters are allowed in each field.

Table 1: Fields in UFORM Where Multi-byte Format is Allowed

UFORM Statements ASCII Only Multi-byte Characters

#SERVICE

NAME X

#FORM

STATUSLINE X

FLAGS X

TRANSMODE X

TRANTIME X

FIRSTVAL X

LASTVAL X

#PAGE

STATUSLINE X

FLAGS X

#Basic Descriptor Statement

row X

COL X

LINES X

FLAGS X

MIN X

VALUE X

#Extended Descriptor Statements

HELP X

ERR X

LIT X

STRING X

MENU X

FORMEXIT X

VAL X

FUNC X

The MIN in the Basic Descriptor Statement is defined as the minimum acceptable length of input, and the WIDTH is defined as the number of character positions for each line of the field. However, there are problems with the definition of width as "the number of the character positions," because the screen width and character code length may be different in the case of multi-byte characters. The definition of WIDTH is treated as "screen width" when dealing with multi-byte characters. There are cases where the character code length is larger than the screen width. For example, when the input character is KATAKANA in Japanese, one KATAKANA character occupies one column by screen width on screen, but occupies two bytes by character code length in the buffer. The application program using multi-byte characters must allocate a sufficient buffer size to store the all characters of character code length rather than all characters of screen width in order to communicate between clients and servers correctly.

Figure 1 an example of using a UFORM statement with multi-byte characters. Here, AA, BB, CC, DD, EE, XX, YY, ZZ mean multi-byte characters.

Fig. 1: A UFORM statement with multi-byte characters

#SERVICE NAME=KUFORM
#FORM STATUSLINE=20 LASTVAL=crosschk()
*ROW  COL MIN LINES WIDTH FLAGS VALUE
  3   12  -    1     -     L    "AABBCCDDEEFF"
  5   10  -    1     -     L    "XXYYZZ"
  -   29  7    1     7     IUu   ID
VAL=RE:[0-9]{7}
HELP="AABBCCDDEE"
ERR="XXYYZZ"

Multi-byte Input Validation

The VAL statement is provided for users to specifying the input validation as indicated by the following table. However, the the only type that supports multi-byte characters is RE.

Table 2: Input Validation on VAL

Type Description

ALPHA only input with alphabetic characters is acceptable

ALNUM only input with alphanumeric characters is acceptable

LIT only input that matches one of the literal strings in argument is valid

RE only input matching the regular expression in argument is valid

INTEGER any integer is valid

IR only integers in a range specified in argument are valid

NUMERIC any numeric value is valid

NR only numbers in a range specified in argument are valid

FUNC a user-written function is called to validate the data entered in the associated field

Type	Description
`ALPHA`	only input with alphabetic characters is acceptable
`ALNUM`	only input with alphanumeric characters is acceptable
`LIT`	only input that matches one of the literal strings in argument is valid
`RE`	only input matching the regular expression in argument is valid
`INTEGER`	any integer is valid
`IR`	only integers in a range specified in argument are valid
`NUMERIC`	any numeric value is valid
`NR`	only numbers in a range specified in argument are valid
`FUNC`	a user-written function is called to validate the data entered in the associated field

The FUNC is provided to check the input data using application-defined functions; this can allow the users to check input using their own functions.

If users require to check multi-byte input characters, they could use this capability to provide user-defined functions.

For example, in a Japanese banking application system that manages the customer information, the pronunciation of the customer's name and address may be entered in KATAKANA, and the real name and address may be entered in KANJI.

Figure 2 gives sample code for a chkKANJI() function, that includes an input validation on whether the input data is KANJI or not. Here KANJI is used to mean all two byte characters as defined by JIS X0208.

Fig. 2:Sample code for a user-defined function to check KANJI input

#include <stdio.h>
#include <libw.h>
#include <ctype.h>
#include <widec.h>
#include <wctype.h>
#include <limits.h>
#include <nl_types.h>
#include <fml.h>
#define isWCHAR9(c)  ((_ctmp_=(c)) > 127 ? _iswctype(_ctmp_,_E9) : 0)
extern nl_catd valfCATD;
int
#ifdef _TMPROTOTYPES
chkKANJI(char *val, char **err, char *usrarg, int flag, FLDID *fldid, int occno)
#else
chkKANJI(val, err, usrarg, flag, fldid,occno)
char    *val;
char    **err;
char    *usrarg;
int     flag;
FLDID   *fldid;
int     occno;
#endif
{
    register char *vp;
    wchar_t wc;
    int     mbl;
    int     mbs;
    vp = val;
    for (mbs=0; *(vp + mbs) != NULL; mbs += mbl) {
        if ((mbl = mbtowc($amp;wc, vp+mbs, MB_LEN_MAX)) == -1 || !isWCHAR9(wc)) {
            *err = catgets(valfCATD, 1, 100, "ONLY KANJI CHARACTERS ALLOWED");
            return(-1);	
        }
    }
    return(1);
}

Figure 3 an example showing the definition of a UFORM statement the using above function,

Fig. 3:UFORM statement that names `chkKANJI()` as a VAL function

         VAL=FUNC: chkKANJI()

mio(1) and vuform(1)

The input and output of multi-byte characters using the same method as single byte characters is necessary for internationalization. The commands that handle characters using the screen are mio and vuform. The internationalization of the field operation commands of mio and the layout, command and attribute modes of vuform must be considered.

It is necessary to consider the following: the checking of character's type, cursor movement, insertion of characters, replacement of characters, deletion of characters, the processing of the last position of field or line when inserting or deleting characters, checking of word boundaries, and the difference between screen width and character code length.

mio and attribute mode of vuform operation

The attribute mode commands of vuform are the same as those used in mio to interact with a form.

Here, AA, BB, CC, DD, EE, FF represent multi-byte characters. They occupy two columns per character in the following examples.

Character Input
- Non-Multi-Byte Processing All characters but the CTRL or ESC key on the keyboard have no special meaning, and are used to insert data into unprotected fields. The input character overwrites a character at the cursor byte by byte (column by column). Multi-byte characters could not be entered correctly because these commands currently process characters one byte at at time. After overwriting one byte, the cursor is moved to the next column position.
- Multi-Byte Processing All characters but the CTRL or ESC key on the keyboard have no special meaning, and are used to insert data into unprotected fields. The input character overwrites a character at the cursor by a character. After overwriting one character, the cursor moves to the next character's first column position.

cursor movement, left or right (CTRL-h, CTRL-l)
- Non-Multi-Byte Processing move one column to the left or right
- Multi-Byte Processing move one character to the left or right. The cursor is positioned at the first column of the character.
cursor movement, up or down(CTRL-k, CTRL-j)
- Non-Multi-Byte Processing move in the same column position up or down a line
- Multi-Byte Processing move to the first column of the character which is in the same column position up or down one line. For example, assume that there is one multi-line field of 2 lines with 3 columns per line. The user has entered data AAbcDD, which appears on the screen as shown in Figure 8:

Fig. 8:Cursor movement in multi-line field

		AAb
		\zc DD

If the user enters CTRL-j when the cursor is on 'b', the cursor will move to the first column of 'DD'.

delete one character (CTRL-u)
- Non-Multi-Byte Processing delete one byte(one column)
- Multi-Byte Processing delete one character

insertion of character(CTRL-c)
- Non-Multi-Byte Processing insert one space
- Multi-Byte Processing no modification (insert one space)
When inserting, it is possible to be incapable of displaying the end character of the field for lack of space. In this case the last character can not be entered. In the case of a multi-line field, the character that can not be displayed is displayed starting at the beginning of the next line.
Figure 10 shows a multi-line field which consists of a few lines of 9 columns per line.

Fig. 10: Inserting a character in a multi-line field

	\za AABBCCDD	\(->	\za AA BBCC  (input 'CTRL-c' at the 4th column)
	EEFF		DDEEFF

The layout and command modes of vuform

insert and add character (i, a)
- Non-Multi-Byte Processing Inserting and adding are performed per one byte. When inputing a character at the end of the line, the last input character can be displayed.
- Multi-Byte Processing Inserting and adding are performed per one character. However, if there is not enough space to put the input character on screen, the character cannot be entered.
cursor movement(h, ^h, l, j, k, SPACE, 0, ^, $, fc, Fc, TAB, e, E)
- Non-Multi-Byte Processing move the cursor to the specified position, move the cursor forward or backward one space.
- Multi-Byte Processing If the specified position is not the first column of a character, move the cursor to the first column of the character. The cursor, when moved left or right, will always be put in the first column.
delete a character (x)
- Non-Multi-Byte Processing delete one byte
- Multi-Byte Processing delete one character
Word boundary (cw, w, b, e, dw)
- Non-Multi-Byte Processing A word is defined to be a continuous sequence of alphanumeric characters, or a continuous string of punctuation characters.
- Multi-Byte Processing A word is defined to be a continuous sequence of alphanumeric characters, or a continuous string of punctuation characters. The determination of what is an alphanumeric or punctuation character is done in the current locale.

Regular Expressions

TUXEDO System has its own functions to support regular expressions. Some of the commands and libraries in TUXEDO System call these functions to perform pattern matching.

The characters to be matched by a regular expression can be single-byte characters from the primary code set, or single-byte and multi-byte characters from the supplementary code sets. The pattern processing of regular expressions is performed on a character, rather than byte, basis to allow matching of characters from multi-byte EUC sets. Character strings that contain single-byte or multi-byte characters assigned to the supplementary code sets are matched or substituted in the same way as single-byte ASCII character strings.

The following are regular expression related commands and libraries.

mc
mc can compile UFORM source files that include regular expressions of multi-byte characters correctly.
For example, UFORM can define the regular expression:
VAL=RE:[AA-CC]*
Here, AA, BB, CC mean multi-byte characters. This is a field definition of the character string such that the first character matches one of AA, BB, or CC.
mio
mio handles masks that include the input validation of an internationalized regular expression for one or more fields.
rex
rex is able to process internationalized regular expressions. These regular expressions may include multi-byte characters, locale-specific collation operations, and internationalized character classification operands.
vuform
vuform is a visual form editor that creates the masks. vuform has the following pattern matching capabilities that accept internationalized regular expressions.
- /pattern search forward for field matching pattern
- ?pattern search backward for field matching pattern
Fboolco
This function accepts internationalized regular expressions.
Fboolev
This function searchs the fielded buffer for input matching an internationalized regular expression.
Fboolpr
This function prints Boolean expressions containing internationalized regular expressions.
Ffindocc
This function can successfully search for string field values using internationalized regular expressions.

The following is a specification of features added or changed for internationalization that are also supported.

Special characters
The special characters used in regular expression must be ASCII characters from code set 0. If, for example, the caret (^) or period (.) from a supplementary code set are included as specification characters, they will not function as special characters. If used, they are processed as ordinary characters without any special meaning.
Range specifications
Range specifications can be used with both metacharacters and regular expressions to match specific characters ranges. This facility is useful when handling only a small sub-set of the defined characters. A range specification matches all characters between, and including, the specified range limits. The range limits are separated by an ASCII minus sign (-) and enclosed in ASCII square brackets.
Characters from different code sets are flagged as an error in range expressions. The ordering will be controlled by the current collation locale.
The special characters such as [, ], -, (, ), \\, ^, $, *, ., |, +, ? in regular expressions must be ASCII characters. Although certain supplementary character sets may define additional character codes for these characters, only the ASCII character codes may be used if the special meaning of the character is to be retained.

Multi-Byte Support

Introduction

Multi-byte support for the Data Entry System

UFORM

Multi-byte characters in UFORM

Table 1: Fields in UFORM Where Multi-byte Format is Allowed

Fig. 1: A UFORM statement with multi-byte characters

Multi-byte Input Validation

Table 2: Input Validation on VAL

Fig. 2:Sample code for a user-defined function to check KANJI input

Fig. 3:UFORM statement that names `chkKANJI()` as a VAL function

mio(1) and vuform(1)

mio and attribute mode of vuform operation

Fig. 4: Entering a character smaller than the next character position

Fig. 5: Entering a character larger than the next character position

Fig. 6: Entering a character at the end of a line in a multi-line field

Fig. 7: Entering characters at the end of a field

Fig. 8:Cursor movement in multi-line field

Fig. 9: Deleting a character in a multi-line field

Fig. 10: Inserting a character in a multi-line field

The layout and command modes of vuform

Regular Expressions

UFORM Statements	ASCII Only	Multi-byte Characters
#SERVICE
NAME	X
#FORM
STATUSLINE	X
FLAGS	X
TRANSMODE	X
TRANTIME	X
FIRSTVAL	X
LASTVAL	X
#PAGE
STATUSLINE	X
FLAGS	X
#Basic Descriptor Statement
row	X
COL	X
LINES	X
FLAGS	X
MIN	X
VALUE		X
#Extended Descriptor Statements
HELP		X
ERR		X
LIT	X
STRING	X
MENU	X
FORMEXIT	X
VAL		X
FUNC	X

Multi-Byte Support

Introduction

Multi-byte support for the Data Entry System

UFORM

Multi-byte characters in UFORM

Table 1: Fields in UFORM Where Multi-byte Format is Allowed

Fig. 1: A UFORM statement with multi-byte characters

Multi-byte Input Validation

Table 2: Input Validation on VAL

Fig. 2:Sample code for a user-defined function to check KANJI input

Fig. 3:UFORM statement that names chkKANJI() as a VAL function

mio(1) and vuform(1)

mio and attribute mode of vuform operation

Fig. 4: Entering a character smaller than the next character position

Fig. 5: Entering a character larger than the next character position

Fig. 6: Entering a character at the end of a line in a multi-line field

Fig. 7: Entering characters at the end of a field

Fig. 8:Cursor movement in multi-line field

Fig. 9: Deleting a character in a multi-line field

Fig. 10: Inserting a character in a multi-line field

The layout and command modes of vuform

Regular Expressions

Fig. 3:UFORM statement that names `chkKANJI()` as a VAL function