International Language Environments Guide

Chapter 4 Overview of `en_US.UTF-8` Locale Support

Unicode Overview

The Unicode Standard is the universal character encoding standard used for representation of text for computer processing. It is fully compatible with the International Standard ISO/IEC 10646-1:1999, and contains all the same characters and encoding points as ISO/IEC 10646. The Unicode Standard provides additional information about the characters and their use. Any implementation that conforms to Unicode also conforms to ISO/IEC 10646.

Unicode provides a consistent way of encoding multilingual plain text and brings order to a chaotic state of affairs that has made it difficult to exchange text files internationally. Computer users who deal with multilingual text, business people, linguists, researchers, scientists, and others, find that the Unicode Standard greatly simplifies their work. Mathematicians and technicians, who regularly use mathematical symbols and other technical characters, also find the Unicode Standard valuable.

The design of Unicode is based on the simplicity and consistency of ASCII, but goes beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. It uses a 16-bit encoding that provides code points for more than 65,000 characters. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique 16-bit value, and does not use complex modes or escape codes. While 65,000 characters are sufficient for encoding most of the many thousands of characters used in major languages of the world, the Unicode standard and ISO 10646 provide an extension mechanism called UTF-16 that allows for encoding as many as a million more characters, without use of escape codes. This is sufficient for all known character encoding requirements, including full coverage of all historic scripts of the world.UTF-16 allows exactly 16 x 65536 additional code points and still uses the two byte entities to represent characters. However those 16 x 65536 characters require two two byte entities (for a total of four bytes) per each character. For more details on the UTF-16, refer to section C.3 of "The Unicode Standard, Version 2.0" from Unicode Consortium, or Annex C of ISO/IEC 10646-1:1999, Information Technology--Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane.

Unicode Locale: `en_US.UTF-8` Support Overview

The en_US.UTF-8 locale is a significant Unicode locale in the Solaris 8 product. It supports and provides multiscript processing capability by using UTF-8 as its codeset. It can input and output text in multiple scripts. This was the first locale with this capability in the Solaris operating environment.

Note -

UTF-8 is a file system safe Universal Character Set Transformation Format of Unicode / ISO/IEC 10646-1 formulated by X/Open-Uniforum Joint Internationalization Working Group (XoJIG) in 1992 and approved by ISO and IEC, as Amendment 2 to ISO/IEC 10646-1:1993 in 1996. This standard has been adopted by the Unicode Consortium, the International Standards Organization, and the International Electrotechnical Commission as a part of Unicode 2.0 and ISO/IEC 10646-1.

en_US.UTF-8 supports computation for every code point value, which is defined in Unicode 3.0 and ISO/IEC 10646-1. In the Solaris 8 environment, language script support is not limited to pan-European locales, but also includes Asian scripts such as Korean, Traditional Chinese, Simplified Chinese, and Japanese. Due to limited font resources, Solaris 8 software includes only character glyphs from the following character sets:

ISO 8859-1 (most Western European languages, such as English, French, Spanish, and German)
ISO 8859-2 (most Central European languages, such as Czech, Polish, and Hungarian)
ISO 8859-4 (Scandinavian and Baltic languages)
ISO 8859-5 (Russian)
ISO 8859-6 (Arabic, including many more presentation form character glyphs)
ISO 8859-7 (Greek)
ISO 8859-8 (Hebrew)
ISO 8859-9 (Turkish)
TIS 620.2533 (Thai, including many more presentation form character glyphs)
ISO 8859-15 (most Western European languages with euro sign)
GB 2312-1980 (Simplified Chinese)
Big5 (Traditional Chinese)
JIS X0201-1976, JIS X0208-1983 (Japanese)
KS C 5601-1992 Annex 3 (Korean)

If a user displays characters for which the en_US.UTF-8 locale does not have corresponding glyphs, the locale displays 'no-glyph' glyph instead, as in the following example:

Starting with the Solaris 8 environment, the locale is available for all clusters except the Core cluster.

Exactly the same level of en_US.UTF-8 locale support is provided for both 64-bit and 32-bit Solaris systems.

Note -

Motif and CDE desktop applications and libraries support the en_US.UTF-8 locale. However, OpenWindows, XView, and, OPENLOOK DeskSet applications and libraries do not support the en_US.UTF-8 locale.

Desktop Input Methods

CDE provides the ability to enter localized input for an internationalized application that is using Xm Toolkit. The XmText[Field] widgets are enabled to interface with input methods from each locale. Input methods are internationalized because some language environments write their text from right-to-left, top-to-bottom, and so forth. Within the same application, you can use several fonts that apply different input methods.

The pre-edit area displays the string that is being pre-edited. This can be done in four modes:

OffTheSpot
OverTheSpot (default)
Root
None

In OffTheSpot mode, the location is just below the MainWindow area at the right of the status area. In OverTheSpot mode, the pre-edit area is at the cursor point. In Root mode, the pre-edit and status areas are separate from the client's window.

Note -

In the Solaris 8 environment, there are native Asian input methods for Simplified/Traditional Chinese, Japanese, and Korean in addition to the current multi-script input methods for Unicode locales. This section includes descriptions of selected input methods, how to use them, and how to switch between them.

Script Selection and Input Modes

The en_US.UTF-8 locale supports multiple scripts. The en_US.UTF-8 locale has a total of twelve input modes:

English/European
Cyrillic
Greek
Arabic
Hebrew
Thai
Unicode Hexadecimal and Octal code input methods
Table lookup input method
Japanese
Korean
Simplified Chinese
Traditional Chinese

To switch into a certain input mode, you can either type in an input mode switch compose key sequence for each input mode, or press the left-most mouse button at the status area of your application to open an input mode selection window and select from the listed input modes as follows:

English/European Input Mode

The English/European input mode includes not only the English alphabet but also characters with diacritical marks (for example, á, è, î, õ, and ü) and special characters (such as ¡, §, ¿) from European scripts.

This input mode is the default mode for any application. The input mode is displayed at the bottom left corner of the GUI application.

To insert characters with diacritical marks or special characters from Latin-1, Latin-2, Latin-4, Latin-5, and Latin-9, you must type a Compose Sequence, as shown in the following examples:

For Ä, press and release Compose, then A, and then "
For ¿, press and release Compose, then ?, and then ?

When there is no <Compose> key available on your keyboard, you can substitute for the <Compose> key by simultaneously pressing the <Control> key, the <Shift> key and the <t> keys together.

For the input of the Euro currency symbol (Unicode value U+20AC) from the locale, you can use any one of following input sequences:

<AltGraph> and <e> together
<AltGraph> and <4> together
<AltGraph> and <5> together

These input sequences mean that you press both keys simultaneously. If there is no <AltGraph> key available on your keyboard, you can substitute the <Alt> key for the <AltGraph> key.

The following tables show the most commonly used Compose Sequences in Latin-1, Latin-2, Latin-4, Latin-5, and Latin-9 script input for the Solaris operating environment.

Note -

To start these sequences, press the <Compose> key and release it.

The following table lists the Common Latin-1 Compose Sequences.

Table 4-1 Common Latin-1 Compose Sequences


Press and Release	Press and Release	Result
[Spacebar]	[Spacebar]	No-break space
s	1	Superscripted 1
s	2	Superscripted 2
s	3	Superscripted 3
!	!	Inverted exclamation mark
x	o	Currency symbol ¤
p	!	Paragraph symbol ¶
/	u	mu u
'	"	acute accent ´
,	,	cedilla Ç
"	"	diaeresis ¨
-	^	macron ¯
o	o	degree o
x	x	multiplication sign x
+	-	plus-minus +-
-	-	soft hyphen -
-	:	division sign /
-	a	ordinal (feminine) ª
-	o	ordinal (masculine) º
-	,	not sign ¬
.	.	middle dot ·
1	2	vulgar fraction ½
1	4	vulgar fraction ¼
3	4	vulgar fraction ¾
<	<	left double angle quotation mark «
>	>	right double angle quotation mark »
?	?	inverted question mark ¿
A	`	A grave À
A	'	A acute Á
A	*	A ring above Å
A	"	A diaeresis Ä
A	^	A circumflex Â
A	~	A tilde Ã
A	E	AE diphthong Æ
C	,	C cedilla Ç
C	o	copyright sign ©
D	-	Capital eth ð
E	`	E grave È
E	'	E acute É
E	"	E diaeresis Ë
E	^	E circumflex Ê
I	`	I grave Ì
I	'	I acute Í
I	"	I diaeresis Ï
I	^	I circumflex Î
L	-	pound sign £
N	~	N tilde Ñ
O	`	O grave Ò
O	'	O acute Ó
O	/	O slash Ø
O	"	O diaeresis Ö
O	^	O circumflex Ô
O	~	O tilde Õ
R	O	registered mark \256
T	H	Thorn þ
U	`	U grave Ù
U	'	U acute Ú
U	"	U diaeresis Ü
U	^	U circumflex Û
Y	'	Y acute ý
Y	-	yen sign ¥
a	`	a grave à
a	'	a acute á
a	*	a ring above å
a	"	a diaeresis ä
a	~	a tilde ã
a	^	a circumflex â
a	e	ae diphthong æ
c	,	c cedilla ç
c	/	cent sign ¢
c	o	copyright sign ©
d	-	eth ð
e	`	e grave è
e	'	e acute é
e	"	e diaeresis ë
e	^	e circumflex ê
i	`	i grave ì
i	'	i acute í
i	"	i diaeresis ï
i	^	i circumflex î
n	~	n tilde ñ
o	`	o grave ò
o	'	o acute ó
o	/	o slash ø
o	"	o diaeresis ö
o	^	o circumflex ô
o	~	o tilde õ
s	s	German double s ß
t	h	thorn þ
u	`	u grave ù
u	'	u acute ú
u	"	u diaeresis ü
u	^	u circumflex û
y	'	y acute y
y	"	y diaeresis ÿ
\|	\|	broken bar ¦

The following table lists the Common Latin-2 and Latin-4 Compose Sequences.

Table 4-2 Common Latin-2 Compose Sequences


Press and Release	Press and Release	Result
a	' '	ogonek
u	' '	breve
v	' '	caron
"	' '	double acute
A	a	A ogonek
A	u	A breve
C	'	C acute
C	v	C caron
D	v	D caron
-	D	D stroke
E	v	E caron
E	a	E ogonek
L	'	L acute
L	-	L stroke
L	>	L caron
N	'	N acute
N	v	N caron
O	>	O double acute
S	'	S acute
S	v	S caron
S	,	S cedilla
R	'	R acute
R	v	R caron
T	v	T caron
T	,	T cedilla
U	*	U ring above
U	>	U double acute
Z	'	Z acute
Z	v	Z caron
Z	.	Z dot above
k	k	kra
A	_	A macron
E	_	E macron
E	.	E dot above
G	,	G cedilla
I	_	I macron
I	~	I tilde
I	a	I ogonek
K	,	K cedilla
L	,	L cedilla
N	,	N cedilla
O	_	O macron
R	,	R cedilla
T	\|	T stroke
U	~	U tilde
U	a	U ogonek
U	_	U macron
N	N	Eng
a	_	a macron
e	_	e macron
e	.	e dot above
g	,	g cedilla
i	_	i macron
i	~	i tilde
i	a	i ogonek
k	,	k cedilla
l	,	l cedilla
n	,	n cedilla
o	_	o macron
r	,	r cedilla
t	\|	t stroke
u	~	u tilde
u	a	u ogonek
u	_	u macron
n	n	eng

The following table lists the Common Latin-5 Compose Sequences.

Table 4-3 Common Latin-5 Compose Sequences


Press and Release	Press and Release	Result
G	u	G breve
I	.	I dot above
g	u	g breve
i	.	i dotless

Any Compose Sequences already described do not re-appear in this table.

The following table lists the Common Latin-9 Compose Sequences.

Table 4-4 Common Latin-9 Compose Sequences


Press and Release	Press and Release	Result
o	e	Diphthong oe
O	E	Diphthong OE
Y	"	Y diaeresis

Cyrillic Input Mode

To switch to Cyrillic input mode, either press <Compose> <c> <c> at your keyboard, or press the left-most mouse button at the status area of your application and select "[Cyrillic]" from the Input Mode Selection Window.

The input mode is displayed at the bottom left corner of your GUI application.

After you switch to Cyrillic input mode, you cannot enter English or European text. To switch back to the English/European input mode, type <Control> + <Space> from your keyboard, or select "[English/European]" input mode from the Input Mode Selection Window by using your mouse. The Russian keyboard layout appears in the following figure.

Figure 4-1 Cyrillic Keyboard

You can also switch into other input modes by typing the corresponding input mode switch key sequence.

Greek Input Mode

To switch to Greek input mode, either press Compose <g> <g> at your keyboard, or press the left-most mouse button at the status area of your application and select "[Greek]", from the Input Mode Selection Window.

The input mode is displayed at the left bottom corner of your GUI application.

After you switch to Greek input mode, you cannot enter English or European text. To switch back to the English/European input mode, type <Control> + <Space> from your keyboard, or select "[English/European]" input mode from the Input Mode Selection Window by using your mouse. The Greek keyboard layouts appear in the following two figures.

Figure 4-2 Greek Euro Keyboard

Figure 4-3 Greek UNIX Keyboard

Arabic Input Mode

To switch to Arabic input mode, type <Compose> <a> <r> from your current input mode. The input mode is displayed at the left bottom corner of your GUI application. After you switch to the Arabic input mode, you have to switch back to English/European input mode to enter English/European characters by typing <Control> and <Space> together.

You can also switch into other input modes by either typing the corresponding input mode switch key sequence from your keyboard, or selecting an input mode from the Input Mode Selection Window by using y our mouse.

Figure 4-4 Arabic Keyboard

Hebrew Input Mode

To switch into Hebrew input mode, type <Compose> <h> <h> from your current input mode. The input mode is displayed at the bottom left corner of your GUI application. You can also switch into the Hebrew input mode by pressing the left-most mouse button at the status area of your application and then selecting "[Hebrew]" from the Input Mode Selection Window.

After you have switched into the Hebrew input mode, you have to switch back to the English/European input mode to enter English/European characters. To switch your input mode, you can either type the corresponding input mode switch key sequence of your next input mode from your keyboard, or select an input mode from the Input Mode Selection Window by using your mouse. The Hebrew keyboard layout is shown in the following figure:

Figure 4-5 Hebrew Keyboard

Thai Input Mode

To switch into Thai input mode, type <Compose> <t> <t> from your current input mode. The input mode displays at the left bottom corner of your GUI application.

After you have switched into the Thai input mode, you have to switch back to English/European input mode to enter English/European characters. To switch your input mode, either type the corresponding input mode switch key sequence of your next input mode from your keyboard, or select an input mode from the Input Mode Selection Window by using your mouse. The Thai keyboard layout is shown in the following figure:

Figure 4-6 Thai Keyboard

Unicode Hexadecimal and Octal Code Input Method Input Modes

To switch into the Unicode hexadecimal code input method input mode, type <Compose> <u> <h> from your current input mode. You can also select "[Unicode Hex]" from the Input Mode Selection Window by using your mouse. The input mode is displayed at the left bottom corner of your application.

If you prefer the octal number system, you can also switch into the Unicode octal code input method input mode by typing <Compose> <u> <o> from your current input mode or by selecting "[Unicode Octal]" from the Input Mode Selection Window

To use these input mode, you need to know about either the hexadecimal or the octal code point values of the characters. Refer to The Unicode Standard, Version 3.0 for the mapping between code point values and characters. To input a character, type four hexadecimal digits if you are in the Unicode hexadecimal code input method input mode, for instance, 00a1 for Inverted Exclamation Mark, 03b2 for Greek Small Letter Beta, ac00 for a Korean Hangul Syllable KA, 30a2 for Japanese Katakana Letter A, 4e58 for a Unified Han character, and so on. Users can use both uppercase and lowercase letters of A, B, C, D, E, and, F for hexadecimal digits. If you prefer the octal number system instead of hexadecimal numbers, you can input octal digits, 0 to 7. If you mistype a digit or two, you can delete the digits by using the <Delete> key or the <Backspace> key.

Table Lookup Input Method Input Mode

To switch into table lookup input method input mode, type <Compose> <l> <l> from your current input mode. The input mode is displayed at the bottom left corner of your GUI application.

After you turn on the input mode, there is a lookup group window showing multiple groups of characters. You can choose any one of the groups to enter characters from the group. Once you select a group, there will be the second lookup window showing multiple candidates of available Unicode characters belonging to the group of your choice. You can choose any one of the candidates by moving your pointer and clicking the left button on your mouse. You can also select any one of the candidates by choosing a left-hand-side letter associated with each of the candidates.

You can also see the next set of candidates by typing <Control> and <n> keys together. Similarly, to see the previous set of candidates, type the <Control> and <p> keys together. The <n> stands for 'next' and the <p> stands for 'previous'.

After you are finished using the current input mode, you can switch into another input mode by typing a corresponding input mode switch key sequence.

Japanese Input Mode

To switch into the Japanese input mode, type either <Compose> <j> <a> from your keyboard or select "[ Japanese ]" from the Input Mode Selection Window by using your mouse. The input mode is displayed at the left bottom corner of your application. The following figure shows a Japanese input method mode of ATOK12:

To use the native Japanese input system, you need to install one or more of Japanese locales on your system. Once you install the Japanese locales, you will be able to use any one of native Japanese input systems like ATOK12, ATOK8, Wnn6, or cs00.

For more details on how to use the Japanese Input System, refer to "ATOK12 User's Guide", "ATOK8 User's Guide", "Wnn6 User's Guide", and, "cs00 User's Guide."

Korean Input Mode

To switch into the Korean input mode, type either <Compose> <k> <o>from your keyboard, or select "[ Korean ]" from the Input Mode Selection Window by using your mouse. The input mode is displayed at the left bottom corner of your application. The following figure shows Phonetic Hangul input method which is one of many native Korean input methods available.

To have the native Korean input system, you need to install one or more Korean locale on your system. Once you install the Korean locale, you will be able to use the native Korean input system. For more details on how to use the Korean Input System, refer to "Korean Solaris User's Guide".

Simplified Chinese Input Mode

To switch input Simplified Chinese input mode, type either <Compose> <s> <c> from your keyboard, or select "[ S-Chinese ]" from the Input Mode Selection Window by using your mouse. The input mode is displayed at the left bottom corner of your application. The following figure shows New Pin Yin input method which is one of many native Simplified Chinese input methods available.

To use the native Simplified Chinese input system, you need to install one or more Simplified Chinese locales on your system. Once you install the Simplified Chinese locales, you will be able to use the native Simplified Chinese input system. For more details on how to use Simplified Chinese Input System, refer to "Simplified Chinese Solaris User's Guide."

Traditional Chinese Input Mode

To switch input Traditional Chinese input mode, type either <Compose> <t> <c> from your keyboard or select "[ T-Chinese ]" from the Input Mode Selection Window by using your mouse. The input mode is displayed at the left bottom corner of your application. The following figure shows the TsangChieh input method which is one of many native Traditional Chinese input methods available.

To have the native Traditional Chinese input system, you need to install one or more of Traditional Chinese locales at your system. Once you install the Traditional Chinese locales, you will be able to use the native Traditional Chinese input system. For more details on how to use the Traditional Chinese Input System, refer to "Traditional Chinese Solaris User's Guide".

Input Mode Switch Key Sequence Summary

Users can switch from one input mode to another without any restrictions. The following table shows the input mode switch key sequences for each input mode.

Table 4-5 Input Mode Switch Key Sequences


Input Mode	Key Sequences
English/European	<Control> + <Space>
Cyrillic	<Compose> <c> <c>
Greek	<Compose> <g> <g>
Arabic	<Compose> <a> <r>
Hebrew	<Compose> <h> <h>
Thai	<Compose> <t> <t>
Unicode hexadecimal code input method	<Compose> <u> <h>
Table lookup input method	<Compose> <l> <l>
Unicode octal code input method	<Compose> <u> <o>
Japanese	<Compose> <j> <a>
Korean	<Compose> <k> <o>
Simplified Chinese	<Compose> <s> <c>
Traditional Chinese	<Compose> <t> <c>

System Environment

Locale Environment Variable

To use the en_US.UTF-8 locale environment, choose the locale first. Be sure you have the en_US.UTF-8 locale installed on your system.

How to Use the `en_US.UTF-8` Locale Environment

In a TTY environment, choose the locale first, by setting the LANG environment variable to en_US.UTF-8, as in the following C-shell example:
system% setenv LANG en_US.UTF-8
Make sure that other categories are not set (or are set to en_US.UTF-8) , since the LANG environment variable has a lower priority than other environment variables, such as LC_ALL, LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_NUMERIC, LC_MONETARY and LC_TIME have at setting the locale. See the setlocale(3C) man page for more details about the hierarchy of environment variables.

To check current locale settings in various categories, use the locale(1) utility.

system% locale 
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

You can also start the en_US.UTF-8 environment from the CDE desktop. At the CDE login screen's Options -> Language menu, choose en_US.UTF-8.

TTY Environment Setup

Depending on the terminal and terminal emulator, such as dtterm(1)that you are using, you may need to push certain codeset-specific STREAMS modules onto your Streams.

For more information on STREAMS modules and streams in general, see the STREAMS Programming Guide.

The following table shows STREAMS modules supported by the en_US.UTF-8 locale in the terminal environment:

Table 4-6 32-bit STREAMS Modules Supported by en_US.UTF-8


32-bit STREAMS Module	Description
`/usr/kernel/strmod/u8lat1`	Code conversion STREAMS module between `UTF-8` and ISO 8859-1 (Western European)
`/usr/kernel/strmod/u8lat2`	Code conversion STREAMS module between `UTF-8` and ISO 8859-2 (Eastern European)
`/usr/kernel/strmod/u8koi8`	Code conversion STREAMS module between`UTF-8` and `KOI8-R` (Cyrillic)

The following table lists the 64-bit STREAMS Modules Supported by en_US.UTF-8.

Table 4-7 64-bit STREAMS Modules Supported by en_US.UTF-8


64-bit STREAMS module	Description
`/usr/kernel/strmod/sparcv9/u8lat1`	Code conversion STREAMS module between`UTF-8` and ISO 8859-1 (Western European)
`/usr/kernel/strmod/sparcv9/u8lat2`	Code conversion STREAMS module between `UTF-8` and ISO 8859-2 (Eastern European)
`/usr/kernel/strmod/sparcv9/u8koi8`	Code conversion STREAMS module between `UTF-8` and KOI8-R (Cyrillic)

Loading a STREAMS Module at Kernel

To load a STREAMS module at kernel, first become root:

system% su
Password:
 system#

To determine whether you are running a 64-bit Solaris or 32-bit Solaris system, use the isainfo(1)utility as follows:

system# isainfo -v
	64-bit sparcv9 applications
	32-bit sparc applications
	 system#

If the command returns this information, you are running the 64-bit Solaris system. If you are running the 32-bit Solaris system, the utility shows the following:

system# isainfo -v
	32-bit sparc applications
	 system#

Use modinfo(1M) to be certain that your system has not already loaded the STREAMS module:

system# modinfo | grep u8lat1  modulename
system#

If the STREAMS module, such as u8lat1, is already installed, the output looks as follows:

system# modinfo | grep u8lat1
89 ff798000  4b13  18   1  u8lat1 (UTF-8 <--> ISO 8859-1 module)
system#

If the module is already installed, you don't need to load it. However, if the module has not yet been loaded, use modload(1M) as follows:

system# modload /usr/kernel/strmod/u8lat1 modulename

This loads the 32-bit u8lat1 STREAMS module at the kernel so you can push it onto a Stream. If you are running the 64-bit Solaris product, use modload(1M) as follows:

system# modload /usr/kernel/strmod/sparcv9/u8lat1

The STREAMS module is loaded at the kernel and you can now push it onto a Stream.

To unload a module from the kernel, usemodunload(1M), as shown below. In this example, the u8lat1 module is being unloaded.

system# modinfo | grep u8lat1
89 ff798000  4b13  18   1  u8lat1 (UTF-8 <--> ISO 8859-1 module)
system# modunload -i 89

`dtterm` and Terminals Capable of Input and Output of UTF-8 Characters

Unlike in previous releases of the Solaris operating environment, the dtterm(1) and any other terminals that support input and output of the UTF-8 codeset do not need to have any other additional STREAMS module in their Stream. ldterm(7M) module is now codeset independent and supports Unicode/UTF-8 as well.

For the proper terminal environment setup for the Unicode locales, use the stty(1) utility as follows:

system% stty defeucw

Note -

Since /usr/ucb/stty is not internationalized, use /bin/stty instead.

Terminal Support for Latin-1, Latin-2, or KOI8-R

For terminals that support only Latin-1 (ISO 8859-1), Latin-2 (ISO 8859-2), or KOI8-R, you should have the following STREAMS configuration:

head <-> ttcompat <->  ldterm <->  u8lat1 <-> TTY

Note -

This configuration is only for terminals that support Latin-1. For Latin-2 terminals, replace the STREAMS module u8lat1 with u8lat2. For KOI8-R terminals, replace the module with u8koi8.

Make sure you already have the STREAMS module loaded into the kernel.

To set up the STREAMS configuration shown above, use strchg(1), as follows:

system% cat > tmp/mystreams
ttcompat
ldterm
u8lat1
ptem
^D
system% strchg -f /tmp/mystreams

Be sure that you are either root or the owner of the device when you use strchg(1). To see the current configuration, usestrchg(1), as follows:

system% strconf
ttcompat
ldterm
u8lat1
ptem
pts
system%

To reset the original configuration, set the STREAMS configuration as follows:

system% cat > /tmp/orgstreams
ttcompat
ldterm
ptem
^D
system% strchg -f  /tmp/orgstreams

Setting Terminal Options

To set up the UTF-8 text edit behavior on TTY, you must first set some terminal options usingstty(1), as follows:

system% /bin/stty  defeucw

Note -

Because /usr/ucb/stty is not yet internationalized, you should use /bin/stty instead.

You can also query the current settings using:stty(1) with the -a option, as shown below:

system% /bin/stty -a

Saving the Settings in `~/.cshrc`

Assuming the necessary STREAMS modules are already loaded with the kernel, you can save the following lines in your .cshrc file (C shell example) for convenience:

setenv LANG en_US.UTF-8
if ($?USER != 0 && $?prompt != 0) then
				 cat >! /tmp/mystreams$$ << _EOF
				 ttcompat
				 u8euc
				 ldtterm
				 eucu8
				 ptem
_EOF
				 /bin/strchg -f /tmp/mystream$$
				 /bin/rm -f /tmp/mystream$$
				 /bin/stty cs8 -istrip defeucw
endif

With these lines in your.cshrc file, you do not have to type all of the commands each time. Note that the second _EOF should be in the first column of the file. You can also create a file called mystreams and save it so that .cshrc refers to mystreams instead of creating it whenever you start a C shell.

Code Conversions

The en_US.UTF-8 locale supports various code conversions among major codesets of several countries throughiconv(1) andiconv(3).

Note -

In the Solaris 8 environment, the utility geniconvtbl enables user-defined code conversions. The user-defined code conversions created with the geniconvtbl utility can be used with both iconv(1) and iconv(3). For more detail on this utility, refer togeniconvtbl(1) andgeniconvtbl(4) man pages.

The available fromcode and tocode names that can be applied to iconv(1) and iconv_open(3)are shown in the following table. For more details on iconv code conversion, see theiconv(1) andiconv_open(3),iconv(3), andiconv_close(3) man pages. For more information on available code conversions, see iconv_en_US.UTF-8(5).

Also see Appendix A, iconv Code Conversions.

Note -

UCS-2, UCS-4, UTF-16 are all fixed-width Unicode/ ISO/IEC 10646 representation forms that recognizes Byte Order Mark (BOM) characters defined in the Unicode 3.0 and ISO/IEC10646-1:1999 standards. Other forms, like UCS-2BE, UCS-4BE, and UTF-16BE, are all fixed-width Unicode/ ISO/IEC 10646 representation forms that do not recognize the BOM character and also assume Big Endian byte ordering. Representation forms like UCS-2LE, UCS-4LE, UTF-16LE, on the other hand, assume Little Endian byte ordering. They also do not recognize the BOM character.

Note -

For associated scripts/languages of ISO 8859-* and KOI8-*, see http://czyborra.com/charsets/iso8859.html.

Printing

A new and enhanced mp(1) print filter is available in the Solaris 8 environment that can print various input file formats including flat text files written inUTF-8. It uses TrueType and Type 1 scalable fonts and X11 bitmap fonts available on the Solaris system.

The output from the utility is standard PostScript, and can be sent to any PostScript printer.

Note -

Starting with the next release of the Solaris environment, xutops(1) will be obsolete.

To use the utility, type the following:

system% mp filename | lp

You can also use the utility as a filter, since the utility accepts stdin stream:

system% cat filename | mp | lp

You can set the utility as a printing filter for a line printer. For example, the following command sequence tells the printer service LP that the printer lp1 accepts only mp format files. This command line also installs the printer lp1 on port /dev/ttya. See thelpadmin(1M) man page for more details.

system# lpadmin -p lp1 -v /dev/ttya -I MP
system# accept lp1
system# enable lp1

Using lpfilter(1M), you can add the utility for a filter as follows:

system# lpfilter -f filtername -F pathname

The command tells LP that a converter (in this case, xutops) is available through the filter description file named pathname. The pathname can determined as follows:

Input types: simple 
Output types: MP
Command: /usr/bin/mp

The filter converts the default type file input to PostScript output using /usr/bin/mp.

To print a UTF-8 text file, use the following command

system% lp -T MP UTF-8-file

For more detail on mp(1), refer to the mp(1) man page.

DtMail

As a result of increased coverage in scripts, Solaris 8 DtMail running in the en_US.UTF-8 locale supports various MIME character sets shown below.

US-ASCII (7-bit US ASCII)
UTF-8 (UCS Transmission Format 8 of Unicode)
UTF-7 (UCS Transmission Format 7 of Unicode)
ISO-8859-1 (Latin-1)
ISO-8859-2 (Latin-2)
ISO-8859-3 (Latin-3)
ISO-8859-4 (Latin-4)
ISO-8859-5 (Latin/Cyrillic)
ISO-8859-6 (Latin/Arabic)
ISO-8859-7 (Latin/Greek)
ISO-8859-8 (Latin/Hebrew)
ISO-8859-9 (Latin-5)
ISO-8859-10 (Latin-6)
ISO-8859-15 (Latin-9)
KOI8-R (Cyrillic)
ISO-2022-JP (Japanese)
ISO-2022-KR and EUC-KR (Korean)
ISO-2022-CN (Simplified Chinese)
ISO-2022-TW (Traditional Chinese)
ISO-8859-13 (Latin-7/Baltic)
ISO-8859-14 (Latin-8/Celtic)
KOI8-U (Cyrillic/Ukranian)
Shift_JIS (Japanese in Shift JIS)
BIG5 (Traditional Chinese in BIG5)
GB2312 (Simplified Chinese in EUC)
TIS-620 (Thai)
UTF-16 (UCS Transmission Format 16 of Unicode)
UTF-16BE (UTF-16 Big-Endian of Unicode)
UTF-16LE (UTF-16 Little-Endian of Unicode)

This support allows users to view virtually any kind of email encoded in various MIME character sets from any region of the world in a single instance of DtMail. The decoding of received email is done by DtMail, which looks at the MIME character set and content transfer encoding provided with the email.

However, in case of sending, you need to specify a MIME character set that is understood by the recipient mail user agent (in other words, mail client), unless you want to use the default MIME character set provided by the en_US.UTF-8locale. To switch the character set of out-going email, at the 'New Message' window, type either <CONTROL> + <y> or click the "Format" menu button and then click again on the "Change Char Set" button by using your mouse. The next available character set name displays at left bottom corner on top of the Send button.

If your email message header or message body contains characters that cannot be represented by the MIME charset specified, the system automatically switches the MIME character set to the UTF-8 that can represent any character.

If your message contains characters from the 7-bit US-ASCII character set only, your email's default MIME character set is US-ASCII. Any mail user agent can interpret such email messages without any loss of characters or information.

If your message contains characters from a mixture of scripts, your email's default MIME character set is UTF-8. Any 8-bit characters of UTF-8 are encoded with Quoted-Printable encoding. For more detail on MIME, registered MIME charsets, and Quoted-Printable encoding, refer to RFC 2045, 2046, 2047, 2048, 2049, 2279, 2152, 2237, 1922, 1557, 1555, and 1489.

Programming Environment

Appropriately, internationalized applications should automatically enable the en_US.UTF-8 locale, but proper FontSet/XmFontList definitions in the application's resource file are required.

For information on internationalized applications, see Creating Worldwide Software: Solaris International Developer's Guide, 2nd edition.

FontSet Used with X Applications

The en_US.UTF-8 locale in the Solaris 8 environment supports fonts for the following character sets.

ISO 8859-1
ISO 8859-2
ISO 8859-4
ISO 8859-5
ISO 8859-7
ISO 8859-9
ISO 8859-15
BIG5
GB 2312-1980
JIS X0201.1976
JIS X0208.1983
KS C 5601.1992 Annex 3
ISO 8859-6 and Unicode based one
ISO 8859-8
TIS 620.2533 based one

Because the Solaris 8 environment supports the CDE desktop environment, each character set has a guaranteed sets of fonts.

The following is a list of the Latin-1 fonts that are supported in the Solaris 8 product:

-dt-interface system-medium-r-normal-xxs sans utf-10-100-72-72-p-59-iso8859-1
-dt-interface system-medium-r-normal-xs sans utf-12-120-72-72-p-71-iso8859-1
-dt-interface system-medium-r-normal-s sans utf-14-140-72-72-p-82-iso8859-1
-dt-interface system-medium-r-normal-m sans utf-17-170-72-72-p-97-iso8859-1
-dt-interface system-medium-r-normal-l sans utf-18-180-72-72-p-106-iso8859-1
-dt-interface system-medium-r-normal-xl sans utf-20-200-72-72-p-114-iso8859-1
-dt-interface system-medium-r-normal-xxl sans utf-24-240-72-72-p-137-iso8859-1

For information on CDE common font aliases, including -dt-interface user-* and -dt-application-* aliases, see Common Desktop Environment: Internationalization Programmer's Guide.

In the en_US.UTF-8 locale, utf is also supported as a common font alias. A font set for an application should have a collection of fonts that contains each of the character sets, as in the following example:

fs = XCreateFontSet(display,
"-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-1,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-2,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-4,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-5,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-6,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-7,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-8,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-9,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-15,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-big5-1,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-jisx0208.1983-0,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-jisx0201.1976-0,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-ksc5601.1992-3,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-gb2312.1980-0,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-tis620.2533-0,
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-unicode-fontspecific",
  &missing_ptr, &missing_count, &def_string);

Or, put more simply:

fs = XCreateFontSet(display,
				"-dt-interface system-medium-r-normal-*s*utf*",
				 &missing_ptr, &missing_count, &def_string);

XmFontList Definition as CDE/Motif Applications

As with FontSet definition, the XmFontList resource definition of an application should also include each font of the character sets that the locale supports.

Example 4-1 XmNFontList Definition for the `en_US.UTF-8` Locale

*fontList:\
 -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-1;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-2;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-4;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-5;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-6;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-7;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-8;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-9;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-iso8859-15;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-big5-1;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-jisx0208.1983-0;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-jisx0201.1976-0;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-ksc5601.1992-3;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-gb2312.1980-0;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-tis620.2533-0;\
  -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-unicode-fontspecific:

Or, put more simply:

*XmPushButton.fontList:\
			-dt-interface system-medium-r-normal-*s*utf*:

For more details on the XmFontList and the XmNFontList, refer to the XmFontList(3X) man page, OSF/Motif Programmer's Guide, and the resource section of each Motif widget in the OSF/Motif Programmer's Reference Manual.

Chapter 4 Overview of en_US.UTF-8 Locale Support

Unicode Overview

Unicode Locale: en_US.UTF-8 Support Overview

Desktop Input Methods

Script Selection and Input Modes

English/European Input Mode

Cyrillic Input Mode

Figure 4-1 Cyrillic Keyboard

Greek Input Mode

Figure 4-2 Greek Euro Keyboard

Figure 4-3 Greek UNIX Keyboard

Arabic Input Mode

Figure 4-4 Arabic Keyboard

Hebrew Input Mode

Figure 4-5 Hebrew Keyboard

Thai Input Mode

Figure 4-6 Thai Keyboard

Unicode Hexadecimal and Octal Code Input Method Input Modes

Table Lookup Input Method Input Mode

Japanese Input Mode

Korean Input Mode

Simplified Chinese Input Mode

Traditional Chinese Input Mode

Input Mode Switch Key Sequence Summary

System Environment

Locale Environment Variable

How to Use the en_US.UTF-8 Locale Environment

TTY Environment Setup

Loading a STREAMS Module at Kernel

dtterm and Terminals Capable of Input and Output of UTF-8 Characters

Terminal Support for Latin-1, Latin-2, or KOI8-R

Setting Terminal Options

Saving the Settings in ~/.cshrc

Code Conversions

Printing

DtMail

Programming Environment

FontSet Used with X Applications

XmFontList Definition as CDE/Motif Applications

Example 4-1 XmNFontList Definition for the en_US.UTF-8 Locale

Chapter 4 Overview of `en_US.UTF-8` Locale Support

Unicode Locale: `en_US.UTF-8` Support Overview

How to Use the `en_US.UTF-8` Locale Environment

`dtterm` and Terminals Capable of Input and Output of UTF-8 Characters

Saving the Settings in `~/.cshrc`

Example 4-1 XmNFontList Definition for the `en_US.UTF-8` Locale