wnn_automaton - automaton for roman character-Kana conversion
Automaton is a function used for roman character-Kana conversion by the IIIMF language Engine Wnn8LE (hereafter called just Wnn8LE) and similar managers. Automaton converts according to the contents set in a table (called a conversion table) to enable versatile conversion. This system provides an Automaton library with the same type of functions (function beginning romkan_ in the Japanese language input library (libwnn)) to enable a wide range of conversion programs.
Automaton performs three conversions in series according to conversion tables (in order, preprocessing, main processing, and postprocesing) and outputs the final results. Processing is handled according to conversion tables for each of the three conversions. Atomaton also has a mode function. The mode can be changed to dynamically change the combinations of the three processing stages. Setting the mode and the switchover codes is performed using conversion tables.
Because the conversion tables are text files, they can be changed easily and you can also easily change to any conversion table. Furthermore, BS (backspace) can be used to return to the previous status after a conversion has been completed until the next conversion is completed.
Although roman character-Kana conversion using Wnn8LE converts only between uppercase English characters into Hiragana, preprocessing and postprocessing can be used to handle various types of inputs and outputs. For example, preprocessing can be used to convert from lowercase English characters to uppercase English characters. Postprocessing can be used to convert from Hiragana to Katakana or from Hitagana to half-width Katakana.
Automaton proceeds with the operation as follows:
Input. English(halfhalf–widthwidth). Upper/lower case.
Preprocessing. To lower or upper case characters.
Mainprocessing. Convert according to Uppercase English to Hiragana table.
Postprocessing. Convert Hiragana to Katakana or halfhalf–widthwidth Kana as required.
Output.
The following conversion tables are used by Automaton:
Mode definition table
Declares the mode and the correspondence tables to use. The file name is mode.
Correspondence tables
Main processing tables
The correspondence table used for preprocessing. The file name begins with "1".
Main processing tables
The correspondence table used for main processing. The file name begins with "2".
Postprocessing tables
The correspondence table used for postprocessing. The file name begins with "3".
The mode definition table contains the mode declaration, the correspondence tables to be used for each mode, and table usage rules for them.
The correspondence tables contain lists of corresponding input codes and output codes. The correspondence tables are separated into those for preprocessing, main processing and postprocessing and any number of correspondence tables can be used for each of these.
Wnn8LE searches for the mode definition table in the following order.
According to setrkfile entries in the Wnn8LE initialization file uumrc.
/usr/lib/wnn/ja_JP/rk/mode file_name
The following table entries can be used:
. . . Indicates repeating more than 0 times.
. . . . . . Indicates repeating more than once.
[ ] Indicates that the item may be omitted.
The mode definition table contains the mode declaration, the correspondence tables to be used for each mode, the determination standards for them, and the mode display text strings.
The mode definition table consists of the following items (1), (2), (3), and (4). The remainder of a line is treated as a comment if a semicolon (;) appears at the beginning of the line or following spaces (including tabs) unless the semicolon is escaped.
The following are considered special strings.
Indicates the HOME environmental variable.
Indicates the directory containing the mode definition table.
The directory containing the conversion table with the standard (/usr/lib/wnn/).
If user is a user name, then it indicates the user's login directory. If the user name is omitted, then it indicates your own login directory.
The mode is declared as follows:
The mode_name is a text string consisting of alphanumeric characters. The [initial_status] is either on or off. The default is off.
The mode declaration is made before the mode is used.
Searches specifications are made for correspondence tables using the following format.
Specify the directory name(s) to be searched when the correspondence tables specified in the mode definition table are not in the same directory as the mode definition table. Multiple directory names can be specified; separate them with spaces. The search directory name must be specified before specifying the correspondence tables.
Any directory names previously stored to search for correspondence tables are deleted and the directory name(s) specified as the argument are stored. Multiple directory names can be specified; separate them with spaces. The search directory name must be specified before specifying the correspondence tables.
There are three ways to make the specifications.
Correspondence table file names or mode display text string . . . . . .
(if Conditional_expression Correspondence_table_specifications or Mode_display_text_string . . .)
(when Conditional_expression Correspondence_table_specifications or Mode_display_text_string . . .)
File names for correspondence tables must begin with '1','2',or '3'. Path names can also be specified. Mode display text strings are text strings placed in quotation marks used to display the current mode.
“string“
Indicates the mode display text string when conversion is ON.
(on_dispmode “string“ )
Indicates the mode display text string when conversion is ON.
(off_dispmode “string“ )
Indicates the mode display text string when conversion is OFF.
(on_unchg)
Indicates the same mode display text string as was used before the mode was changed be used when conversion is ON.
(off_unchg)
Indicates the same mode display text string as was used before the mode was changed be used when conversion is OFF.
This text string is used by Wnn8LE to display the mode.
The Automaton library can read this text string using romkan_dispmode() , but only the last entry is valid if multiple mode display text strings are given for the mode in the mode definition table.
(2), (3) are used to change the correspondence table depending on specified conditions. If the condition in the if statement in (2) is true, then the specification in the if statement is referenced and the specification following the if statement is not referenced. If the condition is false, the if statement is exited and the specification following the if statement is referenced.
If the condition in the when statement in (3) is true, the specification in the statement is referenced. The specification following the when statement, however, is referenced regardless of whether or not the condition is true or false.
(2) or (3) can be used recursively to specify correspondence tables.
Any one of the following can be used for the conditional statement.
|
For example, when (defmode kana) and ( defmode romajikana) are both in the mode definition, ( and kana romajikana) is true when both modes are ON.
Also, (here conditional statements are represented by <1>, <2>, and <3>, and conversion table names are represented by A, B, and C) assume the following statement.
(when <1> A (if <2> B ) C ) (if <3> D ) E
Also assume that conditional statements <1>, <2>, and <3> have been met. Examine the statement from the beginning. First comes (when <1> A (if <2> B) C). Because <1> has been met, "A (if <2> B) C" is examined and table A is selected.
Next comes (if <2> B ) and <2> has been met so table B is selected. Because this is an if statement and the conditional statements have been met, the rest of the current series "A(if <2> B)C" need not be examined. Although this ends examination of "A(if <2> B)C," this series is contained in a when statement, so the remainder of "(when <1> A (if <2> B )C) (if <3> D) E" is examined.
The next portion is (if <3> D). Table D is selected because the condition statement <3> has been met. Because this is an if statement, the rest of "(when <1> A (if <2> B ) C ) (if <3> D ) E" is not examined. The final results is the selection of tables A, B, and D.
Next we will use the mode definition tables used by Wnn8LE as an example.
Three modes are defined in the mode definition table. There are specifications for correspondence table and mode display text string to be used from 2A_CTRL to the end. This table is referenced each time the mode changes and the tables to be used are selected as described above.
(defmode romkan) (defmode katakana) (defmode zenkaku) 2A_CTRL (if romkan 1B_TOUPPER 2B_ROMKANA 2B_JIS (if (not katakana) "[Ar]") (if zenkaku 3B_KATAKANA "[Ar]") 3B_HANKATA "[AIr]") ; "A" and "I" are half-width Katakana. 2B_DAKUTEN (if (not katakana) 1B_ZENHIRA (if zenkaku 3B_ZENKAKU "[A ]") "[AA]") (if zenkaku 1B_ZENKATA 3B_ZENKAKU "[A ]") "[AIA]" ; "A" and "I" are half-width Katakana.
Initially, romkan, katakana, and zenkaku are all OFF. 2A_CTRL is selected as the table at this point. Because romkan is OFF, the following if statement is not referenced, and 2B_DAKUTEN is selected. The conditional statement for the next if statement, (not katakana), is true because katakana is OFF. The inside of the if statement is referenced and 1B_ZENHIRA is selected. Next the if statement inside the if statement is referenced. Because zenkaku is OFF, the conditional statement if false. The if statement is thus not referenced.
Next the mode display text string "[A[hiragana-A]]" is selected and the rest of the conversion table series is not examined.
The correspondence tables contain the conversion data (input codes and corresponding output codes) for preprocessing, main processing, and postprocessing.
Preprocessing and postprocessing serve supplemental roles to main processing. The following restrictions thus apply to preprocessing and postprocessing correspondence tables.
Item (2), below, is not possible. Also, there can only be one form for each input and output code in item (1) that results in a character when evaluated.
Item (2), below, is not possible. Also, there can only be one form for each input code in item (1) that results in a character when evaluated, and the buffer remainder cannot be entered.
All lines in the correspondence table must contain one of the following items (1) to (3) or must be empty. Lines of this form are repeated to form the correspondence table.
input_code [output_code [buffer_remainder]]
input_code function
Variable declaration
Each entry must occupy no more than one line. The remainder of a line is treated as a comment if a semicolon (;) appears at the beginning of the line or following spaces (including tabs) unless the semicolon is escaped.
The output code or buffer remainder will be treated as a null string if omitted. Input codes, output codes, and buffer remainders must contain strings of the following without intervening spaces: forms that evaluate to characters and forms that evaluate to text strings.
Forms are considered to evaluate to characters or text strings if the form is replaced by the character or text string.
The following types of forms evaluate to characters.
Character notations are shown below (these differ from character notations treated as forms that evaluate to text strings).
Not including the following: "(", ")", "'", """, "\", ";", space
Not including the following: "'", "\", and "^"
Indicates a control character. The character must be an ASCII code from 32 to 126 (lowercase English characters). "^?" is the delete code.
Indicates the character following "\". Not including numerals and odx. "\n", "\t", "\b", "\r", and "\f" are the same as C language escape strings. "\e" and "\E" indicate the Escape code.
Indicates the character with the specified octal code.
Indicates the character with the specified octal code.
Indicates the character with the specified decimal code.
Indicates the character with the specified hexadecimal code.
|
|
Variable names are any English alphanumeric text strings beginning with an English letter that do not correspond to function names, functions, and declarations (defvar). Here, an underline '_' is considered as an English letter.
The following types of forms evaluate to characters.
Character notations are shown below (these differ from character notations treated as forms that evaluate to text strings).
Not including the following: """, "^", "\"
Indicates a control character. The character must be an ASCII code from 32 to 126 (lowercase English characters). "^?" is the delete code.
Indicates the character following "\". Not including numerals and odx. "\n", "\t", "\b", "\r", and "\f" are the same as C language escape strings. "\e" and "\E" indicate the Escape code.
Indicates the character with the specified octal code. ";" is required when the code is followed by a number.
Indicates the character with the specified octal code. ";" is required when the code is followed by a number.
Indicates the character with the specified decimal code. ";" is required when the code is followed by a number.
Indicates the character with the specified hexadecimal code. ";" is required when the code is followed by a number.
" " The null string is indicated by omitting the character notation as follows:
|
The mode name must be defined in the mode definition table.
|
if and unless can only be entered for input codes. on, off, and switch can only be entered for output codes in the main processing table.
The following function names can only be entered for output codes in the main processing table.
|
The following functions can be used. These functions can be used independently.
An error will be generated if the corresponding input code is received.
The previous mode definition table is read in again to reset conversion. If there is an error in the new conversion table, an error message is display and the settings in the previous (original) conversion table are used.
The characters given as arguments for list specify the range of the variable.
The range of the variable can be set to all characters by specifying "all".
Variables that can be used as forms that evaluate to characters and the range of the variables is defined. Declarations are made as shown above. Variable notations are given as variable names or as (variable_name . . . . . . ). Character notations are the same as for forms that evaluate to characters.
The following line can be used because it evaluates to a character or text string.
(toupper (tolower Y))
The following line, however, cannot be used because a form that evaluates to a text string is used as the argument to another function:
(toupper (tohankata [hiragana-KA]))
Variables can be used effectively when the same patterns appear many times in conversions, such as in the following example.
(defvar a1 (list K S T H Y R W G Z D B P)) (a1)(a1) [small tsu] (a1)
The above two lines achieve the same conversions as the following lines. Both show methods of handling assimulated sounds (Sokuon) in roman character-Kana conversion.
|
Variables are equivalent to the characters given as the range of the variable in the variable declaration (defvar).
WARNINGS
Variables must be defined within the table using defvar.
Variable definitions are valid anywhere within the table. You can define the variable a1 within two different tables as required and the a1 will be treated as two separate variables. You cannot, however, define the same variable twice within any one table using defvar.
A variable always has the same value within a single line in a correspondence table.
(defvar a1 (list A B)) (a1)(tolower (a1)) 3
The text strings "Aa" and "Bb" will be converted to "3" in the above example and not to "Ab" and "Ba".
Inputs code is matched with input codes in the tables starting from the left. Thus, when examining input codes from the left in the tables, a variable must not be used where it will be treated as the argument of a function before it is matched to specific characters, such as in the following example:
(defvar a1 (list a b)) (toupper (a1))(a1) 3
Here, it appears that the input "Aa" will be converted to "3", but first a match is attempted between "A" and (toupper (a1)). This is not possible because the argument of toupper is the variable a1, which does not yet have a value. A check is made for this type of setting when tables are read into the system.
Any variable appearing in the output codes or buffer remainder section must appear in the input code section, i.e., must have been assigned a value when matched to an input code.
(defvar a1 (list K S)) (defvar a2 (list a)) (a1)(a1) (a2) (a1)
The above programming is not correct because the variable a2 is not matched to an input code, but appears for an output code.
First, the code that is input is grouped into character units (characters with 2-byte codes are also treated as one character). This is called the input code. The input code goes through preprocessing, main processing, and postprocessing before the final output is produced. In preprocessing, each input code corresponds to one output code. The output code from preprocessing becomes the input code for main processing.
Inputs codes in the preprocessing table that is currently being used are examined in order from the beginning. When a match is found for the input code, the corresponding output code (i.e., the output code written on the same line as the input code) is output. If there is more than one table specified in the mode definition table, they are examined in the same order as they listed in the mode definition table. If no matching input code is found in a table (including when no table is specified), the input code is output unaltered. This is the same for main processing and postprocessing as well.
In main processing, input code is continuously added to the buffer as long as there is still a chance that a longer match will be found in the input codes in the table (i.e., when a number of characters from the beginning of the current section of input code have already been matched somewhere in the table). Each time more input code is added to the buffer, comparisons are again made in order from the beginning of the input codes listed in the main processing table. As long as there is the chance of the input code in the buffer matching with the longest entry in the table (i.e., when a number of characters from the beginning of the current section of input code have already been matched somewhere in the table) a conversion is not finalized and more input code is awaited.
The code in the buffer is, however, output as nonfinalized characters to enable displaying and other processing. Codes for input errors and mode changes are also output.
These codes are differentiated from normal output codes and do not undergo postprocessing.
When the contents of the buffer matches the longest possible input code in the table (if more than one match is made, then the first one in the table is used), the corresponding output code is output and if no buffer remainder has been specified, the part of the buffer that was matched is deleted from the buffer. If a buffer remainder was specified, it replaces the portion in the buffer that was matched and the above operation is repeated.
If no possibility of a match is found in the table, the first character in the buffer is output unaltered. If the output code for a matched input code is a function that changes the mode (on, off, switch, etc.), the correspondence table is changed according to the specifications in the mode definition table. The functions that change the mode should be placed in the tables where they are required regardless of the status of the modes. If a match is made for the input code corresponding to the function (restart) the mode definition table will be reread. However, the same file as the one for the previous mode definition table will be used. This function can be used to change to an edited version of the conversion tables (including the mode definition table) while the Automaton is running without have to stop the Automaton.
In postprocessing, more than one output code can be output for one input code as the final output. In all other ways, postprocessing is the same as preprocessing.
In the following example "ls –la carriage_return" is output when "Ls" or "LS" is input.
(defvar a1 (list s)) (a1) (toupper (a1))
LS "LS –la\n"
(defvar a1 (all)) (a1) (tolower (a1))
See attributes(5) for descriptions of the following attributes:
|