International Language Environments Guide

Chapter 4 Supported Asian Locales

This chapter provides information on localization related information for the Japanese, Indic, and Thai languages. The sections in this chapter are:

For Korean and Chinese locale support, see the following documents.

Oracle Solaris 10 User's Guide - Korean
Oracle Solaris 10 User's Guide - Simplified Chinese
Oracle Solaris 10 User's Guide - Traditional Chinese

Japanese Localization

This section describes Japanese locale-specific information. For more information, see the documents in Oracle Solaris 10 User Collection - Japanese (written in Japanese).

Japanese Locales

Four Japanese locales, which support different character encodings, are available in the current Oracle Solaris environment. The ja and ja_JP.eucJP locales are based on the Japanese EUC. The ja_JP.eucJP locale conforms to the UI-OSF Japanese Environment Implementation Agreement Version 1.1 and the ja locale conforms to the traditional specification from earlier Oracle Solaris releases. The ja_JP.PCK locale is based on PC-Kanji code (known as Shift_JIS) and the ja_JP.UTF-8 is based on UTF-8.

See the eucJP(5) man page for a map showing Japanese EUC and the character set. See the PCK(5) man page for the map showing PC-Kanji code and the character set.

Japanese Character Sets

The supported Japanese character sets include:

JIS X 0201–1976
JIS X 0208–1990
JIS X 0212–1990
JIS X 0213:2004 (only characters defined in Unicode4.0)

JIS X 0212–1990 is not supported in the ja_JP.PCK locale. JIS X 0213:2004 is supported in the ja_JP.UTF-8 locale only. Not all characters defined in the JIS X 0213:2004 are available. Only those characters defined in the Unicode 4.0 character set are available.

Vendor-defined characters (VDC) and user-defined characters (UDC) are also supported. VDCs occupy unused (reserved) code points of JIS X 0208–1990 or JIS X 0212–1990. UDCs occupy the same code points as VDCs, except those code points allocated for VDCs.

Japanese Fonts

Three Japanese font formats are supported: bitmap, TrueType, and Type1. The Japanese Type1 font includes only JIS X 0212 for printing. The Type1 font is also used by UDC.

Japanese bitmap fonts are described in the following table.

Table 4–1 Japanese Bitmap Fonts


Full Family Name	Subfamily	Format	Vendor	Encoding
`sun gothic`	R, B	PCF(12,14,16,20,24)		JIS X 0208–1983, JIS X 0201–1976
`sun minchou`	R	PCF(12,14,16,20,24)		JIS X 0208–1983, JIS X 0201–1976
`ricoh hg gothic b`	R	PCF(10,12,14,16,18,20,24)	RICOH	JIS X 0208–1983, JIS X 0201–1976
`ricoh hg mincho l`	R	PCF(10,12,14,16,18,20,24)	RICOH	JIS X 0208–1983, JIS X 0201–1976
`ricoh gothic`	R	PCF(10,12,14,16,18,20,24)	RICOH	JIS X 0212–1990, JIS X 0213:2004
`ricoh mincho`	R	PCF(10,12,14,16,18,20,24)	RICOH	JIS X 0212–1990, JIS X 0213:2004
`ricoh heiseimin`	R	PCF(12,14,16,18,20,24)	RICOH	JIS X 0212–1990

Japanese TrueType fonts are described in the following table.

Table 4–2 Japanese TrueType Fonts


Full Family Name	Subfamily	Format	Vendor	Encoding
`ricoh hg gothic b`	Fixed	TrueType	RICOH	JIS X 0208–1983, JIS X 0201–1976
`ricoh hg mincho l`	Fixed	TrueType	RICOH	JIS X 0208–.1983, JIS X 0201–1976
`ricoh hg gothicb sun`	Fixed, Proportional	TrueType	RICOH	JIS X 0201–176, JIS X 0208–1983, JIS X 0213–2004
`ricoh hg minchol sun`	Fixed, Proportional	TrueType	RICOH	JIS X 0201–1976, JIS X 0208–1983, JIS X 0213–2004
`ricoh heiseimin`	Fixed	TrueType	RICOH	JIS X 0212–1990

Japanese Input Systems

ATOK for Solaris (equivalent to ATOK17) is the default Japanese input system. Wnn6 is also available. To switch ATOK for Solaris to Wnn6, see Japanese Environment User's Guide (written in Japanese).The kkcv Japanese input system is available for Japanese Solaris 1.x BCP support.

Terminal Setting for Japanese Terminals

To use Japanese locales on a character-based terminal (TTY) you must use terminal settings to make line editing work correctly.

If your terminal is a CDE Terminal emulator (dtterm), use stty(1) with the argument -defeucw in any Japanese locale (ja, ja_JP.PCK, or ja_JP.UTF-8). For example, in the ja locale you would type:
```
% setenv LANG ja
% stty defeucw 	
```
If your terminal is not a CDE Terminal emulator but the code set of your terminal is the same as that of the current locale, use stty(1) with the argument -defeucw.
If your terminal's code set doesn't match that of the current locale, use setterm(1) to enable code conversion. For example, if you are in the ja locale but your terminal requires PCK (Shift_JIS code), you would type:
```
% setenv LANG ja 
% setterm -x PCK 		
```
See the setterm(3CURSES) man page for details.

Japanese `iconv` Module

Several Japanese code set conversions are supported with iconv(1) and iconv(3C). See the iconv_ja(5) man page for details.

User-Defined Character Support

The user-defined character utility sdtudctool handles both outline (Type1) and bitmap (PCF) fonts. Some utilities are also available to migrate the UDC fonts that were created by old utilities in prior releases, such as fontedit, type3creator, and fontmanager.

Indic Localization

Phonetic lookup based input method (Shabdalipi) and continuous phonetic input method are available for all Indic languages which are supported in the UTF-8 locale. The input methods and virtual keyboards allow you to enter Indic text in all of the CDE applications.

The following data flow illustrates the workings of the Indic input process.

Data flow indicating workings of Indic input process

How to Use the Indic Input Methods

Click the input status area to display the input method selection menu.

Select an input method from the menu.

Alternatively, you can press the F6 key to select from among the available input methods.

You can also type the Compose-hi key sequence to select the input method that you used previously.

Press the F5 key to select the Indic script you want to use.
1. For the keyboard-based (indic INSCRIPT keyboard) input method, use the keyboard images shown in Indic Keyboards.
2. For the phonetic lookup-based input method, type the first English phonetic equivalent character corresponding to the character in the target script.
  
  Select from a list of choices displayed in the lookup window.
3. For the continuous phonetic input method, type in English phonetic equivalents continuously.
  
  The corresponding characters in the target script are displayed in the preedit and will be committed when subsequent input makes the preedit text unambiguous or by an explicit commit. Refer to figures given in Mapping for the Continuous Phonetic Based Input Method for illustrations of the mapping from the English tokens to the UTF-8 codepoints of the target script for the continuous phonetic input method.

Press Control-spacebar to switch back to English/European input mode.

Alternatively, click in the status area to select the English/European input mode from the input mode selection window.

Indic Keyboards

The following figures show the keyboard layouts that are available for the Indic input method.

The following figure shows the layout of the Bengali keyboard.

The following figure shows the layout of the Devanagari keyboard.

The following figure shows the layout of the Gujarati keyboard.

The following figure shows the layout of the Gurmukhi keyboard.

The following figure shows the layout of the Kannada keyboard.

The following figure shows the layout of the Malayalam keyboard.

The following figure shows the layout of the Tamil keyboard.

The following figure shows the layout of the Teluga keyboard.

Understanding the Mappings

The images in Mapping for the Continuous Phonetic Based Input Method show the mappings between English tokens and their equivalent codepoints in each of the target scripts supported. The CONSONANT category means the mapping is between the English tokens and consonants of the script. The VOWEL category means that mapping from English tokens and vowels of the script. The OTHER category includes mapping of characters that do not exhibit the properties of consonants and vowels (whose form does not change depending on the surrounding character).

The keywords CONSONANT, VOWEL and OTHER also mean that these characters are part of Unicode standard. The section SPECIAL CONSONANT, SPECIAL VOWEL or SPECIAL OTHER means that though in principle these characters display the properties of consonants, vowels or others they are not officially part of the Unicode standard and are font dependent. They are assigned codepoint values in Unicode Private User Area. They are supported in Oracle Solaris UTF-8 locales and the mapping may not work in a different platform.

These mapfiles are not the same as the ones in your system, but slightly edited ones for removing unneeded keywords for the context of this discussion.

In the VOWELS and SPECIAL VOWELS section, an independent form and a dependent form is displayed for the same English token depending on the context. See How the Continuous Phonetic Input Method Works.

The Malayalam script contains a special ‘CHILLU’ section, that is actually the SPECIAL OTHER category.

Mapping for the Continuous Phonetic Based Input Method

The following figures show the existing mappings from English to the phonetic equivalent characters in the target Indic scripts. Use these illustrations as a reference until you know all the mappings for the script that you use. Mappings given here are intuitive, so you should be able to input most of the characters without looking up the illustration.

Note –

In these mappings, special characters such as ‘.’ and ‘|’ included as part of the mapping are escaped with a ‘\’ character. If not escaped, the ‘|’ character acts as a separator when more than one token represents the same UTF-8 character.

Figure 4–1, Figure 4–2, and Figure 4–3 show the English to Bengali mappings for consonants, vowels, and others.

Figure 4–1 Map for Bengali Consonants

graphical representation of map for Bengali consonants

Figure 4–2 Map for Bengali Vowels

graphical representation of map for Bengali vowels

Figure 4–3 Map for Bengali Others

graphical representation of map for Bengali others

Figure 4–4, Figure 4–5, and Figure 4–6 show the English to Gujarati mappings for consonants, vowels, and others.

Figure 4–4 Map for Gujarati Consonants

graphical representation of map for Gujarati consonants

Figure 4–5 Map for Gujarati Vowels

graphical representation of map for Gujarati vowels

Figure 4–6 Map for Gujarati Others

graphical representation of map for Gujarati others

Figure 4–7, Figure 4–8, and Figure 4–9 show the English to Gurmukhi mappings for consonants, vowels, and others.

Figure 4–7 Map for Gurmukhi Consonants

graphical representation of map for Gurmukhi consonants

Figure 4–8 Map for Gurmukhi Vowels

graphical representation of map for Gurmukhi vowels

Figure 4–9 Map for Gurmukhi Others

graphical representation of map for Gurmukhi others

Figure 4–10, Figure 4–11, and Figure 4–12 show the English to Hindi mappings for consonants, vowels, and others.

Figure 4–10 Map for Hindi Consonants

graphical representation of map for Hindi consonants

Figure 4–11 Map for Hindi Vowels

graphical representation of map for Hindi vowels

Figure 4–12 Map for Hindi Others

graphical representation of map for Hindi others

Figure 4–13, Figure 4–14, and Figure 4–15 show the English to Kannada mappings for consonants, vowels, and others.

Figure 4–13 Map for Kannada Consonants

graphical representation of map for Kannada consonants

Figure 4–14 Map for Kannada Vowels

graphical representation of map for Kannada vowels

Figure 4–15 Map for Kannada Others

graphical representation of map for Kannada others

Figure 4–16, Figure 4–17, and Figure 4–18 show the English to Malayalam mappings for consonants, vowels, and others.

Figure 4–16 Map for Malayalam Consonants

graphical representation of map for Malayalam consonants

Figure 4–17 Map for Malayalam Vowels

graphical representation of map for Malayalam vowels

Figure 4–18 Map for Malayalam Others

graphical representation of map for Malayalam others

Figure 4–19 and Figure 4–20 show the English to Tamil mappings for consonants and vowels.

Figure 4–19 Map for Tamil Consonants

graphical representation of map for Tamil consonants

Figure 4–20 Map for Tamil Vowels

graphical representation of map for Tamil vowels

Figure 4–21,Figure 4–22, and Figure 4–23 show the English to Telugu mappings for consonants, vowels, and others.

Figure 4–21 Map for Telugu Consonants

graphical representation of map for Telugu consonants

Figure 4–22 Map for Telugu Vowels

graphical representation of map for Telugu vowels

Figure 4–23 Map for Telugu Others

graphical representation of map for Telugu others

How the Continuous Phonetic Input Method Works

For each Indic script, a ‘virama’ or equivalent sign combined with a consonant gives the half form (or ready to combine form) of the consonant. Whenever a multiple key combination corresponding to a consonant is typed, the consonant + virama form is output, symbolizing that the characters are ready to combine.

Consonants, at initial input, will assume their half form and will be a full syllable or their variation when followed by a vowel.

Two consecutive consonants remain as the ready to combine half forms. Half forms can be converted by the layout engine as a single combined character or can remain as those independent forms that are also syntactically valid for every language.

Any vowel that forms the beginning of a word or is followed by another vowel appears in independent form. A vowel that immediately follows a consonant assumes dependent forms.

Characters that do not change shapes in any context are called others. These characters are neither consonants nor vowels.

Digits and other punctuation marks that do not form a part of a character are mapped one to one.

Using these principles, a parser is written that will parse the input into these different categories and output the language-specific Unicode codepoints. The continuous phonetic input method engine does not deal with layout or rendering, which will be done by other modules in the system.

Thai Localization

The current Oracle Solaris environment supports three Thai input levels and four Thai keyboard layouts.

Thai Input Methods

The following Thai input methods are supported in this release. These input methods are specified in the Thai IT Standard for character sequence checking.

Passthrough level, no input check
Basic input check level
Strict input check level

The passthrough level, with no sequence check, is the default in this release as it was in previous Oracle Solaris releases.

You can use the F2 function key to switch from one input level to the next.

Thai Keyboard Layouts

Four different keyboard layouts are supported for the Thai input method.

Kedmanee (TIS820-2531) keyboard layout. The Kedmanee layout was designed for the typewriter, not the computer keyboard. The limited number of keys on the typewriter keyboard meant that some of the Thai special characters were not available in the layout. TIS820-2531 has adopted the Kedmanee layout for use with a computer keyboard.
TIS820-2538 keyboard layout. This enhanced Kedmanee layout is an updated version of the TIS820-2531 layout that includes some of the Thai special characters that were unavailable in the original Kedmanee layout. Currently, TIS820-2538 is the only Thai keyboard layout standard that is issued by Thai Industrial Standard Institute.
Pattajoti keyboard layout. The Pattajoti layout was also designed for the typewriter, but with better finger-load distribution.
Configurable keyboard layout. User-defined keyboard layout for the Thai input method.

Thai Input Method Auxiliary Window

The Thai input method auxiliary window supports the following functions and utilities:

Input level switching. You can click the input level button on the auxiliary palette to choose the passthrough, basic, or strict as your input level.
Thai virtual keyboards. You can click the keyboard button to display the Thai virtual keyboard to use to enter Thai characters.