Multi-Byte Support

Introduction

This chapter describes the changes in the behavior of commands and libraries for internationalization, specifically multi-byte support, in the following areas.

where EUC stands for Extended UNIX Code.

Multi-byte support for the Data Entry System

The TUXEDO System provides character terminal based user interfaces such as mio and vuform. This section describes the internationalization of the Data Entry System.

UFORM

Described below are the internationalization of statements in the form-description UFORM language. The UFORM statement consists of the following five types, #SERVICE, #FORM, #PAGE, Basic Descriptor and Extended Descriptor.

Multi-byte characters in UFORM

Table 1 indicates whether or not multi-byte characters are allowed in each field.

Table 1: Fields in UFORM Where Multi-byte Format is Allowed

UFORM Statements ASCII Only Multi-byte Characters
#SERVICE
NAME X  
#FORM
STATUSLINE X  
FLAGS X  
TRANSMODE X  
TRANTIME X  
FIRSTVAL X  
LASTVAL X  
#PAGE
STATUSLINE X  
FLAGS X  
#Basic Descriptor Statement
row X  
COL X  
LINES X  
FLAGS X  
MIN X  
VALUE   X
#Extended Descriptor Statements
HELP   X
ERR   X
LIT X  
STRING X  
MENU X  
FORMEXIT X  
VAL   X
FUNC X  

The MIN in the Basic Descriptor Statement is defined as the minimum acceptable length of input, and the WIDTH is defined as the number of character positions for each line of the field. However, there are problems with the definition of width as "the number of the character positions," because the screen width and character code length may be different in the case of multi-byte characters. The definition of WIDTH is treated as "screen width" when dealing with multi-byte characters. There are cases where the character code length is larger than the screen width. For example, when the input character is KATAKANA in Japanese, one KATAKANA character occupies one column by screen width on screen, but occupies two bytes by character code length in the buffer. The application program using multi-byte characters must allocate a sufficient buffer size to store the all characters of character code length rather than all characters of screen width in order to communicate between clients and servers correctly.

Figure 1 an example of using a UFORM statement with multi-byte characters. Here, AA, BB, CC, DD, EE, XX, YY, ZZ mean multi-byte characters.

Fig. 1: A UFORM statement with multi-byte characters

#SERVICE NAME=KUFORM
#FORM STATUSLINE=20 LASTVAL=crosschk()
*ROW  COL MIN LINES WIDTH FLAGS VALUE
  3   12  -    1     -     L    "AABBCCDDEEFF"
  5   10  -    1     -     L    "XXYYZZ"
  -   29  7    1     7     IUu   ID
VAL=RE:[0-9]{7}
HELP="AABBCCDDEE"
ERR="XXYYZZ"

Multi-byte Input Validation

The VAL statement is provided for users to specifying the input validation as indicated by the following table. However, the the only type that supports multi-byte characters is RE.

Table 2: Input Validation on VAL

Type Description
ALPHA only input with alphabetic characters is acceptable
ALNUM only input with alphanumeric characters is acceptable
LIT only input that matches one of the literal strings in argument is valid
RE only input matching the regular expression in argument is valid
INTEGER any integer is valid
IR only integers in a range specified in argument are valid
NUMERIC any numeric value is valid
NR only numbers in a range specified in argument are valid
FUNC a user-written function is called to validate the data entered in the associated field

The FUNC is provided to check the input data using application-defined functions; this can allow the users to check input using their own functions.

If users require to check multi-byte input characters, they could use this capability to provide user-defined functions.

For example, in a Japanese banking application system that manages the customer information, the pronunciation of the customer's name and address may be entered in KATAKANA, and the real name and address may be entered in KANJI.

Figure 2 gives sample code for a chkKANJI() function, that includes an input validation on whether the input data is KANJI or not. Here KANJI is used to mean all two byte characters as defined by JIS X0208.

Fig. 2:Sample code for a user-defined function to check KANJI input

#include <stdio.h>
#include <libw.h>
#include <ctype.h>
#include <widec.h>
#include <wctype.h>
#include <limits.h>
#include <nl_types.h>
#include <fml.h>
#define isWCHAR9(c)  ((_ctmp_=(c)) > 127 ? _iswctype(_ctmp_,_E9) : 0)
extern nl_catd valfCATD;
int
#ifdef _TMPROTOTYPES
chkKANJI(char *val, char **err, char *usrarg, int flag, FLDID *fldid, int occno)
#else
chkKANJI(val, err, usrarg, flag, fldid,occno)
char    *val;
char    **err;
char    *usrarg;
int     flag;
FLDID   *fldid;
int     occno;
#endif
{
    register char *vp;
    wchar_t wc;
    int     mbl;
    int     mbs;
    vp = val;
    for (mbs=0; *(vp + mbs) != NULL; mbs += mbl) {
        if ((mbl = mbtowc($amp;wc, vp+mbs, MB_LEN_MAX)) == -1 || !isWCHAR9(wc)) {
            *err = catgets(valfCATD, 1, 100, "ONLY KANJI CHARACTERS ALLOWED");
            return(-1);	
        }
    }
    return(1);
}

Figure 3 an example showing the definition of a UFORM statement the using above function,

Fig. 3:UFORM statement that names chkKANJI() as a VAL function

         VAL=FUNC: chkKANJI()

mio(1) and vuform(1)

The input and output of multi-byte characters using the same method as single byte characters is necessary for internationalization. The commands that handle characters using the screen are mio and vuform. The internationalization of the field operation commands of mio and the layout, command and attribute modes of vuform must be considered.

It is necessary to consider the following: the checking of character's type, cursor movement, insertion of characters, replacement of characters, deletion of characters, the processing of the last position of field or line when inserting or deleting characters, checking of word boundaries, and the difference between screen width and character code length.

mio and attribute mode of vuform operation

The attribute mode commands of vuform are the same as those used in mio to interact with a form.

Here, AA, BB, CC, DD, EE, FF represent multi-byte characters. They occupy two columns per character in the following examples.

  • Character Input

    • Non-Multi-Byte Processing All characters but the CTRL or ESC key on the keyboard have no special meaning, and are used to insert data into unprotected fields. The input character overwrites a character at the cursor byte by byte (column by column). Multi-byte characters could not be entered correctly because these commands currently process characters one byte at at time. After overwriting one byte, the cursor is moved to the next column position.

    • Multi-Byte Processing All characters but the CTRL or ESC key on the keyboard have no special meaning, and are used to insert data into unprotected fields. The input character overwrites a character at the cursor by a character. After overwriting one character, the cursor moves to the next character's first column position.

When the screen width of the input character is smaller than that of the character at the current cursor position, the next character at the current cursor position is displayed as the next character after the input character. Figure 4 shows an example where the 'a' character is entered on BB. After that, the cursor moves to the first column of CC.

Fig. 4: Entering a character smaller than the next character position

        AABBCCDD \(->	AAaCCDD    (input 'a' on BB)

When the screen width of input character is larger than that of the character at current cursor position, the next character from the current cursor position is displayed as the next character after the input character, the cursor position moves to the next character's first column. Figure 5 shows an example where the AA character is entered on 'b'. The cursor moves to the 'c', and the 'c' and 'd' are shifted to the 4th and 5th column position respectively in order for 'b' to be replaced by AA (a two-column character). If there is not enough space at the end of the field to display characters that are shifted, the character(s) are removed from the field.

Fig. 5: Entering a character larger than the next character position

       abcd            \(->	aAAcd          (input 'AA' at b cursor position)

In a multi-line field that consists of more than one line, if the input character cannot be displayed at the end of one line for lack of space, it will be displayed at the beginning of the next line.

Figure 6 shows an example of a multi-line field defined as 9 columns per line. If "EEFF" is entered at the 9th column, the character 'EE' is displayed starting at the first column of the second line rather than at the ninth column of the first line. After that, FF is displayed on the 3rd and 4th columns on the next line. However, the data buffer of this field has "AABBCCDDEEFF" as it is.

Fig. 6: Entering a character at the end of a line in a multi-line field

        AABBCCDD   \(->	AABBCCDD   (input "EEFF" on 9th column)
                              	EEFF

Characters which do not fit at the end of a field because of lack of space on screen can be put in the next field.

Figure 7 shows an example of two fields which are defined as 9 columns per field. If "EEFF" is entered at the 9th column, "EEFF" characters are displayed in the next field because of lack of space in the current field.

Fig. 7: Entering characters at the end of a field

       AABBCCDD   \(->	AABBCCDD   (input "EEFF" on 9th columns)
                              	EEFF

  • cursor movement, left or right (CTRL-h, CTRL-l)

    • Non-Multi-Byte Processing move one column to the left or right

    • Multi-Byte Processing move one character to the left or right. The cursor is positioned at the first column of the character.

  • cursor movement, up or down(CTRL-k, CTRL-j)

    • Non-Multi-Byte Processing move in the same column position up or down a line

    • Multi-Byte Processing move to the first column of the character which is in the same column position up or down one line. For example, assume that there is one multi-line field of 2 lines with 3 columns per line. The user has entered data AAbcDD, which appears on the screen as shown in Figure 8:

Fig. 8:Cursor movement in multi-line field

		AAb
		\zc DD

If the user enters CTRL-j when the cursor is on 'b', the cursor will move to the first column of 'DD'.

  • delete one character (CTRL-u)

    • Non-Multi-Byte Processing delete one byte(one column)

    • Multi-Byte Processing delete one character

In a multi-line field, if the deletion of a character results in space at the end of a line, but the first character on the next line is too big to be displayed in that space, that character remains at the beginning of the next line.

In the case of a field which consists of a few lines of 9 columns (like that shown in Figure 9):

Fig. 9: Deleting a character in a multi-line field

	AABBCCDDa	\(->	AABBCCDD   (delete 'a'  on 9th column)
	EEFF		EEFF

  • insertion of character(CTRL-c)

    • Non-Multi-Byte Processing insert one space

    • Multi-Byte Processing no modification (insert one space)

    When inserting, it is possible to be incapable of displaying the end character of the field for lack of space. In this case the last character can not be entered. In the case of a multi-line field, the character that can not be displayed is displayed starting at the beginning of the next line.

    Figure 10 shows a multi-line field which consists of a few lines of 9 columns per line.

Fig. 10: Inserting a character in a multi-line field

	\za AABBCCDD	\(->	\za AA BBCC  (input 'CTRL-c' at the 4th column)
	EEFF		DDEEFF

The layout and command modes of vuform

  • insert and add character (i, a)

    • Non-Multi-Byte Processing Inserting and adding are performed per one byte. When inputing a character at the end of the line, the last input character can be displayed.

    • Multi-Byte Processing Inserting and adding are performed per one character. However, if there is not enough space to put the input character on screen, the character cannot be entered.

  • cursor movement(h, ^h, l, j, k, SPACE, 0, ^, $, fc, Fc, TAB, e, E)

    • Non-Multi-Byte Processing move the cursor to the specified position, move the cursor forward or backward one space.

    • Multi-Byte Processing If the specified position is not the first column of a character, move the cursor to the first column of the character. The cursor, when moved left or right, will always be put in the first column.

  • delete a character (x)

    • Non-Multi-Byte Processing delete one byte

    • Multi-Byte Processing delete one character

  • Word boundary (cw, w, b, e, dw)

    • Non-Multi-Byte Processing A word is defined to be a continuous sequence of alphanumeric characters, or a continuous string of punctuation characters.

    • Multi-Byte Processing A word is defined to be a continuous sequence of alphanumeric characters, or a continuous string of punctuation characters. The determination of what is an alphanumeric or punctuation character is done in the current locale.

Regular Expressions

TUXEDO System has its own functions to support regular expressions. Some of the commands and libraries in TUXEDO System call these functions to perform pattern matching.

The characters to be matched by a regular expression can be single-byte characters from the primary code set, or single-byte and multi-byte characters from the supplementary code sets. The pattern processing of regular expressions is performed on a character, rather than byte, basis to allow matching of characters from multi-byte EUC sets. Character strings that contain single-byte or multi-byte characters assigned to the supplementary code sets are matched or substituted in the same way as single-byte ASCII character strings.

The following are regular expression related commands and libraries.

  • mc
    mc can compile UFORM source files that include regular expressions of multi-byte characters correctly.

    For example, UFORM can define the regular expression:

    VAL=RE:[AA-CC]*

    Here, AA, BB, CC mean multi-byte characters. This is a field definition of the character string such that the first character matches one of AA, BB, or CC.

  • mio
    mio handles masks that include the input validation of an internationalized regular expression for one or more fields.

  • rex
    rex is able to process internationalized regular expressions. These regular expressions may include multi-byte characters, locale-specific collation operations, and internationalized character classification operands.

  • vuform
    vuform is a visual form editor that creates the masks. vuform has the following pattern matching capabilities that accept internationalized regular expressions.
    • /pattern search forward for field matching pattern

    • ?pattern search backward for field matching pattern

  • Fboolco
    This function accepts internationalized regular expressions.

  • Fboolev
    This function searchs the fielded buffer for input matching an internationalized regular expression.

  • Fboolpr
    This function prints Boolean expressions containing internationalized regular expressions.

  • Ffindocc
    This function can successfully search for string field values using internationalized regular expressions.

The following is a specification of features added or changed for internationalization that are also supported.

  • Special characters
    The special characters used in regular expression must be ASCII characters from code set 0. If, for example, the caret (^) or period (.) from a supplementary code set are included as specification characters, they will not function as special characters. If used, they are processed as ordinary characters without any special meaning.

  • Range specifications
    Range specifications can be used with both metacharacters and regular expressions to match specific characters ranges. This facility is useful when handling only a small sub-set of the defined characters. A range specification matches all characters between, and including, the specified range limits. The range limits are separated by an ASCII minus sign (-) and enclosed in ASCII square brackets.

    Characters from different code sets are flagged as an error in range expressions. The ordering will be controlled by the current collation locale.

    The special characters such as [, ], -, (, ), \\, ^, $, *, ., |, +, ? in regular expressions must be ASCII characters. Although certain supplementary character sets may define additional character codes for these characters, only the ASCII character codes may be used if the special meaning of the character is to be retained.