This chapter provides an overview of UTF-8 locale support. The chapter covers the following topics:
Unicode is the universal character encoding standard used for representation of text for computer processing. Unicode is fully compatible with the international standards ISO/IEC 10646-1:2000 and ISO/IEC 10646–2:2001, and contains all the same characters and encoding points as ISO/IEC 10646. The Unicode Standard provides additional information about the characters and their use. Any implementation that conforms to Unicode also conforms to ISO/IEC 10646.
Unicode provides a consistent way of encoding multilingual plain text and facilitates exchanging international text files. Computer users who deal with multilingual text, business people, linguists, researchers, scientists, and others find that the Unicode Standard greatly simplifies their work. Mathematicians and technicians who regularly use mathematical symbols and other technical characters also find the Unicode Standard valuable.
The maximum possible number of code points Unicode can support is 1,114,112 through seventeen 16-bit planes. Each plane can support 65,536 different code points.
Among the more than one million code points that Unicode can support, version 4.0 curently defines 96,382 characters at plane 0, 1, 2, and 14. Planes 15 and 16 are for private use characters, also known as user-defined characters. Planes 15 and 16 together can support total 131,068 user-defined characters.
UTF-8 is a variable-length encoding form of Unicode that preserves ASCII character code values transparently. This form is used as file code in Oracle Solaris Unicode locales.
UTF-16 is a 16-bit encoding form of Unicode. In UTF-16, characters up to 65,535 are encoded as single 16-bit values. Characters mapped above 65,535 to 1,114,111 are encoded as pairs of 16-bit values (surrogates).
UTF-32 is a fixed-length, 21-bit encoding form of Unicode usually represented in a 32-bit container or data type. This form is used as the process code (wide-character code) in Oracle Solaris Unicode locales.
For more details on the Unicode Standard and ISO/IEC 10646 and their various representative forms, refer to the following sources:
The Unicode Standard, Version 4.0 from the Unicode Consortium
ISO/IEC 10646-1:2000, Information Technology-Universal Multiple-Octet Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane
ISO/IEC 10646-2: Information Technology-Universal Multiple-Octet Character Set (UCS) - Part 2: Secondary Multilingual Plane for Scripts and Symbols, Supplementary Plane for CJK Ideographs, Special Purpose Plane
The Unicode Consortium web site at http://www.unicode.org/.
The Unicode/UTF-8 locales support Unicode 4.0. The en_US.UTF-8 locale provides multiscript processing support by using UTF-8 as its codeset. This locale handles processing of input and output text in multiple scripts, and was the first locale with this capability in the Oracle Solaris operating system. The capabilities of other UTF-8 locales are similar to those of en_us.UTF-8. The discussion of en_US.UTF-8 that follows applies equally to these locales.
UTF-8 is a file-system safe Universal Character Set Transformation Format of Unicode/ISO/IEC 10646-1 formulated by X/Open-Uniforum Joint Internationalization Working Group (XoJIG) in 1992 and approved by ISO and IEC, as Amendment 2 to ISO/IEC 10646-1:1993 in 1996. This standard has been adopted by the Unicode Consortium, the International Standards Organization, and the International Electrotechnical Commission as a part of Unicode 4.0 and ISO/IEC 10646-1.
Unicode locales in the Oracle Solaris environment support the processing of every code point value that is defined in Unicode 4.0 and ISO/IEC 10646-1 and 10646-2. Supported scripts include pan-European and Asian scripts and also complex text layout scripts for the Arabic, Hebrew, Indic, and Thai languages.
Some Unicode locales, notably the Asian locales, include more Kanji or Hanzi glyphs.
ISO 8859-9 (Turkish)
KSC 5601–1992 Annex 3 (Korean)
HKSCS (Traditional Chinese, Hong Kong)
IS 13194.1991, also known as ISCII (Hindi, including many more presentation-form character glyphs)
If you try to view characters for which the en_US.UTF-8 locale does not have corresponding glyphs, the locale displays a no-glyph glyph instead, as shown in the following illustration:
The locale is selectable at installation time and may be designated as the system default locale.
The same level of en_US.UTF-8 locale support is provided for both 64-bit and 32-bit Oracle Solaris systems.
Motif and CDE desktop applications and libraries support the en_US.UTF-8 locale. However, XView™ and OLIT libraries do not support the en_US.UTF-8 locale.
CDE provides the ability to enter localized input for an internationalized application using Xm Toolkit. The XmText[Field] widgets are enabled to interface with input methods from each locale. Input methods are internationalized because some language environments write their text from right-to-left, top-to-bottom, and so forth. Within the same application, you can use different input methods that apply several fonts.
The preedit area displays the string that is being pre-edited. Writing text can be done in four modes:
In OffTheSpot mode, the location is just below the main window area at the right of the status area. In OverTheSpot mode, the pre-edit area is at the cursor point. In Root mode, the preedit and status areas are separate from the client's window.
For more details, refer to the XmNpreeditType resource description in the VendorShell(3X) man page.
Oracle Solaris has been adopting Internet Intranet Input Method Framework (IIIMF) to support multiple language input or scripts. The IIIM server starts per user in all the UTF-8 locales and Asian locales. It serves both IIIM and XIM (X input method) clients.
In European UTF-8 locales, Compose key input or dead key input is also available. For more information, see Appendix A, Compose and Dead Key Input.
Various IMEs (Input Method Engine) are available such as Chinese, Japanese, Korean, Thai, Indic, Unicode (HEX/OCTAL). IIIMF also supports various EMEA keyboard layout emulations such as French, Russian or Arabic. You can find the existing IMEs through Input Method Preference Editor (iiim-properties).
Asian IMEs that includes Chinese, Japanese, Korean, Thai, and Indic are available only when the corresponding locale support is installed.
The English/European IME (Latin input) mode enables input of some Latin characters with diacritical marks, for example, á, è, î, õ and ü, without using Compose key. For example, " + A generates Ä.
Table Lookup IME to input characters from Unicode character tables has been removed. Use the Character Map application (charmap) instead.
To activate and deactivate the input method, press the IM trigger key (for example, Shift_L + Alt_L). The current selected IME is activated. The Default IM trigger key is determined depending on the locale in which you log in the desktop for the first time. You can confirm the current IM trigger key by checking the Trigger Keys tab of iiiim-properties.
The IM status window shows the current input mode and selected IME. By default the IM status window is located at the bottom left corner of each application in the CDE environment. In the JDS environment, the IM status is shown on the Input Method Switcher application, iiim-panel resides in the Notification area on the Gnome panel.
To switch the IME, click the left mouse button on IM status window or iiim-panel. The language selection menu appears. Select the appropriate language that you want to switch.
When the IM status window is used, the language selection menu is not available for gnome (GTK based) applications. You can switch to the IME through non-GTK application if the option, The language is applied to all applications, is enabled in iiim-properties. Otherwise use the iiim-panel.
For more information, see iiim-properties online help.
Input Method Preference Editor, iiim-properties, customizes various IIIM behaviors. For example, you can change the IM trigger key, display the IM status and language selection menu, or add and delete IMEs.
You can start iiim-properties from the command line, iiim-panel or the desktop menu (Preferences menu in JDS and workspace menu->Tools in CDE).
For more customize options and detail information, see iiim-properties online help.
IIIMF version has been upgraded from revision 10 to revision 12 since Oracle Solaris 10 6/06 release and each IME has also been upgraded correspondingly. This document explains about input methods based on IIIMF revision 12.
To upgrade IIIMF to revision 12 on earlier Oracle Solaris 10 system, apply the following patches.
120410-xx(SPARC) / 120411-xx(x86) - IIIMF rev.12 patch 121675-xx(SPARC) / 121676-xx(x86) - Japanese ATOK17 patch 121677-xx(SPARC) / 121678-xx(x86) - Japanese Wnn8 patch 120412-xx(SPARC) / 120413-xx(x86) - S-Chinese patch 120414-xx(SPARC) / 120415-xx(x86) - T-Chinese, Korean, Thai, Indic patch
IIIMF revision 10 is no longer supported in any of Oracle Solaris 10 releases.
This section describes locale environment variables, TTY environment setup, 32–bit and 64–bit STREAMS modules, and terminal support.
system% locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_ALL=
To use the en_US.UTF-8 locale desktop environment, choose the locale first. In a TTY environment, choose the locale first by setting the LANG environment variable to en_US.UTF-8, as in the following C-shell example:
system% setenv LANG en_US.UTF-8
Make sure that the LC_ALL, LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_NUMERIC, LC_MONETARY, and LC_TIME categories are not set, or are set to en_US.UTF-8. If any of these categories is set, they override the lower-priority LANG environment variable. See the setlocale(3C) man page for more details about the hierarchy of environment variables.
You can also start the en_US.UTF-8 environment from the CDE desktop. At the CDE login screen's Options -> Language menu, choose en_US.UTF-8.
For more information on STREAMS modules and streams in general, see the STREAMS Programming Guide.Table 5–1 STREAMS Modules Supported by en_US.UTF-8
32-bit STREAMS module
Code conversion STREAMS module between UTF-8 and ISO8859-1 (Western European)
Code conversion STREAMS module between UTF-8 and ISO8859-2 (Eastern European)
Code conversion STREAMS module between UTF-8 and KOI8-R (Cyrillic)
Starting with the Oracle Solaris 10 release, the 32-bit kernel is no longer supported for the SPARC platform. Table 5–1 applies only to the 32-bit kernel for the x86 platform. For more details, refer to the Release Notes.
The following table lists the 64–bit STREAMS modules supported by en_US.UTF-8.Table 5–2 64–bit STREAMS Modules Supported by en_US.UTF-8
64-bit STREAMS Module
Code conversions STREAMS module between UTF-8 and ISO8859-1 (Western European)
Code conversions STREAMS module between UTF-8 and ISO8859-2 (Eastern European)
Code conversions STREAMS module between UTF-8 and KOI8-R (Cyrillic)
system# isainfo -v
Determine whether your system has already loaded the STREAMS module.
system# modinfo | grep modulename
If the STREAMS module, such as u8lat1, is already installed, the output looks as follows:
system# modinfo | grep u8lat1 89 ff798000 4b13 18 1 u8lat1 (UTF-8 <--> ISO 8859-1 module)
If the module has not already been loaded, load it using the modload(1M) command.
As root, verify that the kernel module is loaded.
For example, to verify the u8lat1 is loaded, you would type:
system# modinfo | grep u8lat1 89 ff798000 4b13 18 1 u8lat1 (UTF-8 <--> ISO 8859-1 module)
Use the modunload(1M) command to unload the kernel.
For example, to unload the u8lat1 module, you would type:
system# modunload -i 89
system% cat > tmp/mystreams ttcompat ldterm u8lat1 ptem ^D system% strchg -f /tmp/mystreams
Be sure that you are either root or the owner of the device when you use strchg(1).
Run the strconf command to examine the current configuration.
system% strconf ttcompat ldterm u8lat1 ptem pts system%
Run the strchg command to reset the original configuration.
system% cat > /tmp/orgstreams ttcompat ldterm ptem ^D system% strchg -f /tmp/orgstreams
Unlike the older releases of the Oracle Solaris operating system, the dtterm and xterm terminal emulators and any other terminals that support input and output of the UTF-8 code set, do not need to have any additional STREAMS modules in their streams. The ldterm module is now codeset independent and supports Unicode/UTF-8 if you set up the terminal environment with the stty(1) utility.
To set up the proper terminal environment for the Unicode locales, use the stty(1) utility.
system% /bin/stty defeucw
To query the current settings, use the -a option of the stty utility, as shown below:
system% /bin/stty -a
Because /usr/ucb/stty is not internationalized, use /bin/stty instead.
head <-> ttcompat <-> ldterm <-> u8lat1 <-> TTY
This configuration is only for terminals that support Latin-1. For Latin-2 terminals, replace the STREAMS module u8lat1 with u8lat2. For KOI8-R terminals, replace the module with u8koi8.
Make sure you already have the STREAMS module loaded into the kernel.
setenv LANG en_US.UTF-8 if ($?USER != 0 && $?prompt != 0) then cat >! /tmp/mystreams$$ << _EOF ttcompat ldtterm u8lat1 ptem _EOF /bin/strchg -f /tmp/mystreams$$ /bin/rm -f /tmp/mystreams$$ /bin/stty cs8 -istrip defeucw endif
With these lines in your.cshrc file, you do not have to type all of the commands each time you use the STREAMS module. Note that the second _EOF should start from the first column of the file.
In the current Oracle Solaris environment, the utility geniconvtbl enables user-defined code conversions. The user-defined code conversions created with the geniconvtbl utility can be used with both iconv(1) and iconv(3). For more details about this utility, refer to the geniconvtbl(1) and geniconvtbl(4) man pages.
The available fromcode and tocode names that can be applied to iconv, iconv_open, and sdtconvtool are described by the following man pages:
UCS-2, UCS-4, UTF-16 and UTF-32 are all Unicode/ ISO/IEC 10646 representation forms that recognize Byte Order Mark (BOM) characters defined in the Unicode 4.0 and ISO/IEC 10646-1:2000 standards if the character appears at the beginning of the character stream. Other forms, like UCS-2BE, UCS-4BE, UTF-16BE, and UTF-32BE, are fixed-width Unicode/ISO/IEC 10646 representation forms that do not recognize the BOM character and also assume big endian byte ordering. Representation forms like UCS-2LE, UCS-4LE, UTF-16LE, and UTF-32LE, on the other hand, assume little endian byte ordering. These forms also do not recognize the BOM character.
For associated scripts and languages of ISO8859–* and KO18–*, see http://czyborra.com/charsets/iso8869.html.
Oracle Solaris desktop environment uses fontconfig for font configuration. For more information about how to configure fonts in Oracle Solaris, refer to Chapter 4, Configuring Fonts, in Java Desktop System Release 3 Administration Guide.
US-ASCII (7-bit US ASCII)
UTF-8 (UCS Transmission Format 8 bit)
UTF-7 (UCS Transmission Format 7 bit)
ISO-2022-JP and EUC-JP (Japanese)
ISO-2022-KR and EUC-KR (Korean)
ISO-2022-CN (Simplified Chinese)
Shift_JIS (Japanese in Shift JIS)
GB2312 (Simplified Chinese in EUC)
UTF-16 (UCS Transmission Format 16 bit)
UTF-16BE (UTF-16 Big-Endian)
UTF-16LE (UTF-16 Little-Endian)
Big5 (Traditional Chinese)
UTF-32 (UCS Transmission Format 32 bit)
UTF-32BE (UTF-32 Big-Endian)
UTF-32LE (UTF-32 Little-Endian)
This support enables users to view virtually any kind of email encoded in various character sets from any region of the world in a single instance of DtMail. DtMail decodes received email by looking at the MIME charset and content transfer encoding provided with the email. Windows-125x MIME charsets are supported.
For sending email, you need to specify a MIME charset that is understood by the recipient mail user agent (mail client), or you can use the default MIME charset provided by the en_US.UTF-8 locale. You can switch the character set of outgoing email, in the New Message window, press Control Y, or click the Format menu button and then click the Change Char Set button. The next available character set name displays in the bottom left corner at the top of the Send button.
If your email message header or message body contains characters that cannot be represented by the MIME charset specified, the system automatically switches the charset to UTF-8 which can represent any character.
If your message contains characters from the 7-bit US-ASCII character set only, the default MIME charset of your email is US-ASCII. Any mail user agent can interpret such email messages without loss of characters or information.
If your message contains characters from a mixture of scripts, the default MIME charset is UTF-8. Any 8-bit characters of UTF-8 are encoded with Quoted-Printable encoding. For more details on MIME, registered MIME charsets, and Quoted-Printable encoding, refer to RFCs 2045, 2046, 2047, 2048, 2049, 2279, 2152, 2237, 1922, 1557, 1555, and 1489.
For information on internationalized applications, see Creating Worldwide Software: Oracle Solaris International Developer's Guide, 2nd edition.
For information about the FontSet used with X applications, please see Unicode Locale: en_US.UTF-8 Support.
Each character set has an associated set of fonts in the Oracle Solaris desktop environment.
The following is a list of the Latin-1 fonts that are supported in the current Oracle Solaris environment:
-dt-interface system-medium-r-normal-xxs sans utf-10-100-72-72-p-59-iso8859-1 -dt-interface system-medium-r-normal-xs sans utf-12-120-72-72-p-71-iso8859-1 -dt-interface system-medium-r-normal-s sans utf-14-140-72-72-p-82-iso8859-1 -dt-interface system-medium-r-normal-m sans utf-17-170-72-72-p-97-iso8859-1 -dt-interface system-medium-r-normal-l sans utf-18-180-72-72-p-106-iso8859-1 -dt-interface system-medium-r-normal-xl sans utf-20-200-72-72-p-114-iso8859-1 -dt-interface system-medium-r-normal-xxl sans utf-24-240-72-72-p-137-iso8859-1
For information on CDE common font aliases, including -dt-interface user-* and-dt-application-* aliases, see Common Desktop Environment: Internationalization Programmer's Guide.
In the en_US.UTF-8 locale, utf is also included in the locale's common font aliases as an additional attribute in the style field of the X logical font description name. Therefore, to have a proper set of fonts, the additional style has to be included in the font set creation as in the following example:
fs = XCreateFontSet(display, "-dt-interface system-medium-r-normal-s*utf*", &missing_ptr, &missing_count, &def_string);
As with FontSet definition, the XmFontList resource definition of an application should also include the additional style attribute supported by the locale.
*fontList:\ -dt-interface system-medium-r-normal-s*utf*: