This chapter describes the changes in the behavior of commands and libraries for internationalization, specifically multi-byte support, in the following areas.
mio and vuform support for multi-byte characters (EUC support).
regular expression support for multi-byte characters (EUC support).
where
EUC
stands for Extended UNIX Code.
The TUXEDO System provides character terminal based user interfaces
such as
mio
and
vuform.
This section describes the internationalization of
the Data Entry System.
Described below are the internationalization of statements in the
form-description UFORM language.
The UFORM statement consists of the following five types,
#SERVICE, #FORM, #PAGE,
Basic Descriptor and Extended Descriptor.
Table 1 indicates whether or not multi-byte characters are allowed
in each field.
The MIN in the Basic Descriptor Statement is
defined as the minimum acceptable length of input,
and the WIDTH is defined as the number of character
positions for each line of the field.
However, there are problems with the definition of
width as "the number of the character positions,"
because the screen width and character code length
may be different in the case of multi-byte characters.
The definition of WIDTH is treated as "screen width"
when dealing with multi-byte characters.
There are cases where the character code length
is larger than the screen width.
For example, when the input character is KATAKANA in Japanese,
one KATAKANA character occupies one column by screen width on screen,
but occupies two bytes by character code length in the buffer.
The application program using multi-byte characters
must allocate a sufficient buffer size to store the all characters
of character code length rather than all characters of
screen width in order to communicate between clients and servers
correctly.
Figure 1 an example of using a
UFORM statement with multi-byte characters.
Here, AA, BB, CC, DD, EE, XX, YY, ZZ mean multi-byte characters.
The VAL statement is provided for users to specifying
the input validation as indicated by the following table.
However, the the only type that supports multi-byte characters is
RE.
The FUNC is provided to check the input data using application-defined
functions;
this can allow the users to check input using their own functions.
If users require to check multi-byte input characters, they could
use this capability to provide user-defined functions.
For example, in a Japanese banking application system
that manages the customer information,
the pronunciation of the customer's name and address
may be entered in KATAKANA, and the real name and address may be entered
in KANJI.
Figure 2 gives sample code for a
chkKANJI()
function, that includes an input validation on whether the input
data is KANJI or not.
Here KANJI is used to mean all two byte characters as
defined by JIS X0208.
Figure 3 an example showing the definition of
a UFORM statement the using above function,
The input and output of multi-byte characters
using the same method as single byte characters is necessary for
internationalization.
The commands that handle characters using the screen are
mio
and
vuform.
The internationalization of the field operation commands of
mio
and the layout, command and
attribute modes of
vuform
must be considered.
It is necessary to consider the following: the checking of
character's type, cursor movement, insertion of characters, replacement of
characters, deletion of characters, the processing of the last position
of field or line when inserting or deleting characters, checking of
word boundaries, and the difference between screen width and
character code length.
The attribute mode commands of
vuform
are the same as those used
in
mio
to interact with a form.
Here, AA, BB, CC, DD, EE, FF
represent multi-byte characters.
They occupy two columns per character in the following examples.
Character Input
Non-Multi-Byte Processing
All characters but the CTRL or ESC key on the keyboard
have no special meaning, and
are used to insert data
into unprotected fields.
The input character overwrites a character at the cursor byte by byte
(column by column).
Multi-byte characters could not be entered correctly
because these commands currently process characters one byte at at time.
After overwriting one byte, the cursor is moved to the next column position.
Multi-Byte Processing
All characters but the CTRL or ESC key on the keyboard
have no special meaning, and
are used to insert data
into unprotected fields.
The input character overwrites a character at the cursor by a character.
After overwriting one character, the cursor moves to the next character's
first column position.
When the screen width of the input character is smaller than that of the character
at the current cursor position,
the next character at the current cursor position
is displayed as the next
character after the input character.
Figure 4 shows an example where the 'a' character is entered on BB.
After that, the cursor moves to the first column of CC.
When the screen width of input character is larger than that of the character
at current cursor position,
the next character from the current cursor position
is displayed as the next
character after the input character,
the cursor position moves to the next character's first column.
Figure 5 shows an example where the AA character is
entered on 'b'. The cursor moves to the 'c',
and the 'c' and 'd' are shifted to the 4th and 5th
column position respectively in order for 'b' to be replaced by AA
(a two-column character).
If there is not enough space at the end of the field
to display characters that are shifted,
the character(s) are removed from the field.
In a multi-line field that consists of more than one line,
if the input character cannot be displayed at the end of one line
for lack of space, it will be displayed at the beginning of the next
line.
Figure 6 shows an example of a multi-line field
defined as 9 columns per line.
If "EEFF" is entered at the 9th column,
the character 'EE' is displayed starting at the first column
of the second line rather than at the ninth column of the first line.
After that, FF is displayed on the 3rd and 4th columns
on the next line.
However, the data buffer of this field has "AABBCCDDEEFF"
as it is.
Characters which do not fit at the end of a
field because of lack of space on screen can be put in the next field.
Figure 7 shows an example of two fields
which are defined as 9 columns per field.
If "EEFF" is entered at the 9th column, "EEFF" characters
are displayed
in the next field because of lack of space in the current field.
cursor movement, left or right (CTRL-h, CTRL-l)
Non-Multi-Byte Processing
move one column to the left or right
Multi-Byte Processing
move one character to the left or right. The cursor is
positioned at the first column of the character.
cursor movement, up or down(CTRL-k, CTRL-j)
Non-Multi-Byte Processing
move in the same column position up or down a line
Multi-Byte Processing
move to the first column of the character which is in the same
column position up or down one line.
For example, assume that there is one multi-line field of 2 lines with
3 columns per line. The user has entered data AAbcDD,
which appears on the screen as shown in Figure 8:
If the user enters CTRL-j when the cursor is on 'b', the cursor will move
to the first column of 'DD'.
delete one character (CTRL-u)
Non-Multi-Byte Processing
delete one byte(one column)
Multi-Byte Processing
delete one character
In a multi-line field,
if the deletion of a character results in space at the end of a line,
but the first character on the next line is too big to be displayed in that
space, that character remains at the beginning of the next line.
In the case of a field which consists of a few lines of 9 columns
(like that shown in Figure 9):
insertion of character(CTRL-c)
Non-Multi-Byte Processing
insert one space
Multi-Byte Processing
no modification
(insert one space)
When inserting, it is possible to be incapable of displaying
the end character of the field
for lack of space.
In this case the last character can not be entered.
In the case of a multi-line field, the character that can not
be displayed is displayed starting at the beginning
of the next line.
Figure 10 shows a multi-line field which consists of a few lines of 9
columns per line.
insert and add character (i, a)
Non-Multi-Byte Processing
Inserting and adding are performed per one byte.
When inputing a character at the end of the line,
the last input character can be displayed.
Multi-Byte Processing
Inserting and adding are performed per one character.
However, if there is not enough space to put the input character on
screen, the character cannot be entered.
cursor movement(h, ^h, l, j, k, SPACE, 0, ^, $, fc, Fc, TAB, e, E)
Non-Multi-Byte Processing
move the cursor to the specified position,
move the cursor forward or backward one space.
Multi-Byte Processing
If the specified position is not the first column of a character,
move the cursor to the first column of the character.
The cursor, when moved left or right,
will always be put in the first column.
delete a character (x)
Non-Multi-Byte Processing
delete one byte
Multi-Byte Processing
delete one character
Word boundary (cw, w, b, e, dw)
Non-Multi-Byte Processing
A word is defined to be a continuous sequence of alphanumeric characters,
or a continuous string of punctuation characters.
Multi-Byte Processing
A word is defined to be a continuous sequence of alphanumeric characters,
or a continuous string of punctuation characters.
The determination of what is an alphanumeric or punctuation character
is done in the current locale.
TUXEDO System has its own functions to support regular expressions.
Some of the commands and libraries in TUXEDO System call these functions
to perform pattern matching.
The characters to be matched by a regular expression can be single-byte
characters from the
primary code set, or single-byte and multi-byte characters from the
supplementary code sets.
The pattern processing of regular expressions is performed on a
character, rather than byte, basis to allow matching of characters
from multi-byte EUC sets.
Character strings that contain single-byte or multi-byte characters
assigned to the supplementary code sets are matched or substituted
in the same way as single-byte ASCII character strings.
The following are regular expression related commands and libraries.
mc
For example, UFORM can define the regular expression:
VAL=RE:[AA-CC]*
Here, AA, BB, CC mean multi-byte characters.
This is a field definition of the character string such
that the first character matches one of AA, BB, or CC.
/pattern
search forward for field matching pattern
?pattern
search backward for field matching pattern
The following is a specification of features added or changed for
internationalization that are also supported.
Special characters
Range specifications
Characters from different code sets are flagged as an error in range
expressions.
The ordering will be controlled by the current collation locale.
The special characters such as [, ], -, (, ), \\, ^, $,
*, ., |, +, ? in regular expressions must be ASCII characters.
Although certain supplementary character sets may define additional character
codes for these characters, only the ASCII character codes may be used if the
special meaning of the character is to be retained.
Multi-byte support for the Data Entry System
UFORM
Multi-byte characters in UFORM
Table 1: Fields in UFORM Where Multi-byte Format is Allowed
UFORM Statements
ASCII Only
Multi-byte Characters
#SERVICE
NAME
X
 
#FORM
STATUSLINE
X
 
FLAGS
X
 
TRANSMODE
X
 
TRANTIME
X
 
FIRSTVAL
X
 
LASTVAL
X
 
#PAGE
STATUSLINE
X
 
FLAGS
X
 
#Basic Descriptor Statement
row
X
 
COL
X
 
LINES
X
 
FLAGS
X
 
MIN
X
 
VALUE
 
X
#Extended Descriptor Statements
HELP
 
X
ERR
 
X
LIT
X
 
STRING
X
 
MENU
X
 
FORMEXIT
X
 
VAL
 
X
FUNC
X
 
Fig. 1: A UFORM statement with multi-byte characters
#SERVICE NAME=KUFORM
#FORM STATUSLINE=20 LASTVAL=crosschk()
*ROW COL MIN LINES WIDTH FLAGS VALUE
3 12 - 1 - L "AABBCCDDEEFF"
5 10 - 1 - L "XXYYZZ"
- 29 7 1 7 IUu ID
VAL=RE:[0-9]{7}
HELP="AABBCCDDEE"
ERR="XXYYZZ"
Multi-byte Input Validation
Table 2: Input Validation on VAL
Type
Description
ALPHA
only input with alphabetic characters is acceptable
ALNUM
only input with alphanumeric characters is acceptable
LIT
only input that matches one of the literal strings in argument is valid
RE
only input matching the regular expression in argument is valid
INTEGER
any integer is valid
IR
only integers in a range specified in argument are valid
NUMERIC
any numeric value is valid
NR
only numbers in a range specified in argument are valid
FUNC
a user-written function is called to validate the data entered in the associated field
Fig. 2:Sample code for a user-defined function to check KANJI input
#include <stdio.h>
#include <libw.h>
#include <ctype.h>
#include <widec.h>
#include <wctype.h>
#include <limits.h>
#include <nl_types.h>
#include <fml.h>
#define isWCHAR9(c) ((_ctmp_=(c)) > 127 ? _iswctype(_ctmp_,_E9) : 0)
extern nl_catd valfCATD;
int
#ifdef _TMPROTOTYPES
chkKANJI(char *val, char **err, char *usrarg, int flag, FLDID *fldid, int occno)
#else
chkKANJI(val, err, usrarg, flag, fldid,occno)
char *val;
char **err;
char *usrarg;
int flag;
FLDID *fldid;
int occno;
#endif
{
register char *vp;
wchar_t wc;
int mbl;
int mbs;
vp = val;
for (mbs=0; *(vp + mbs) != NULL; mbs += mbl) {
if ((mbl = mbtowc($amp;wc, vp+mbs, MB_LEN_MAX)) == -1 || !isWCHAR9(wc)) {
*err = catgets(valfCATD, 1, 100, "ONLY KANJI CHARACTERS ALLOWED");
return(-1);
}
}
return(1);
}
Fig. 3:UFORM statement that names chkKANJI() as a VAL function
VAL=FUNC: chkKANJI()
mio(1) and vuform(1)
mio and attribute mode of vuform operation
Fig. 4: Entering a character smaller than the next character position
AABBCCDD \(-> AAaCCDD (input 'a' on BB)
Fig. 5: Entering a character larger than the next character position
abcd \(-> aAAcd (input 'AA' at b cursor position)
Fig. 6: Entering a character at the end of a line in a multi-line field
AABBCCDD \(-> AABBCCDD (input "EEFF" on 9th column)
EEFF
Fig. 7: Entering characters at the end of a field
AABBCCDD \(-> AABBCCDD (input "EEFF" on 9th columns)
EEFF
Fig. 8:Cursor movement in multi-line field
AAb
\zc DD
Fig. 9: Deleting a character in a multi-line field
AABBCCDDa \(-> AABBCCDD (delete 'a' on 9th column)
EEFF EEFF
Fig. 10: Inserting a character in a multi-line field
\za AABBCCDD \(-> \za AA BBCC (input 'CTRL-c' at the 4th column)
EEFF DDEEFF
The layout and command modes of vuform
Regular Expressions
mc
can compile UFORM source files that include
regular expressions of multi-byte characters correctly.
mio
handles masks that include the input validation
of an internationalized regular expression for one or more fields.
rex
is able to process internationalized regular expressions.
These regular expressions may include multi-byte characters, locale-specific
collation operations, and internationalized character classification operands.
vuform
is a visual form editor that creates the masks.
vuform
has the following pattern matching capabilities
that accept internationalized regular expressions.
This function
accepts internationalized regular expressions.
This function
searchs the fielded buffer for input matching an
internationalized regular expression.
This function prints Boolean expressions
containing internationalized regular expressions.
This function can successfully search for string field values using
internationalized regular expressions.
The special characters used in regular expression must be
ASCII characters from code set 0. If, for example, the caret (^) or
period (.) from a supplementary code set are included as
specification characters, they will not function as special characters.
If used, they are processed as ordinary characters without any special
meaning.
Range specifications can be used with both metacharacters and
regular expressions to match specific characters ranges. This facility
is useful when handling only a small sub-set of the defined
characters.
A range specification matches all characters between, and including,
the specified range limits. The range limits are separated by an
ASCII minus sign (-) and enclosed in ASCII square brackets.