This chapter introduces the new features and the key concepts of Oracle Solaris internationalization and localization. The chapter covers the following topics.
The current Oracle Solaris release includes a number of new features, including Unicode 4.0 support for the UTF-8 locales, enhanced keyboard support, and several improvements to the mp print filter.
The Oracle Solaris internationalization architecture eases the development, the deployment, and the management of applications and language services around the world. A single multilingual product provides support for 55 different languages and 345 locales. In addition, support is available for the complex text layout that is required for Thai and Hindi scripts. Bidirectional text capability is also supported for languages such as Arabic and Hebrew.
Input methods, character sets, codeset conversion, and other language-related features are supported for a number of different Oracle Solaris locales. You can deploy applications in multiple language environments by following standard APIs. You can also customize language attributes, change converter tables, or add a new input method editor in the Oracle Solaris environment.
The Oracle Solaris 10 globalization framework enables you to follow a common reference implementation to enhance the compatibility and the interoperability of global applications. The codeset independent approach to globalization enables you to operate in both native language and Unicode locales. The Oracle Solaris framework provides the power to scale across platforms. A rich set of data converters ensures interoperability between various encodings and different third-party platforms.
The Oracle Solaris platform also enables multinational corporations to scale their server administration worldwide. Unlike competing platforms, the Oracle Solaris platform uses a service-based approach to administration of language services. Server administrators can enable language services remotely across a worldwide network, regardless of the client system. This client-independent approach enables system upgrades without changing client applications. For example, a user does not have to change a local client application in order to read email in Arabic sent from an Internet cafe in Paris.
The following new features are available in the current Oracle Solaris release. More information about each feature can be found at Appendix B, Language Support Features and Enhancements.
Unicode Version 3.2 and 4.0 Support
Unicode Version 4.0 introduces 1226 new characters over Unicode Version 3.2. This version also includes both normative changes and informative changes as described in Unicode Standard 4.0 (ISBN 0-321-18578-1).
Unicode 3.2 defines more strict UTF-8 byte sequences as "UTF-8 Corrigendum".
Code Points |
1st Byte |
2nd Byte |
3rd Byte |
4th Byte |
---|---|---|---|---|
U+0000..U+007F |
00..7F | |||
U+0080..U+07FF |
C2..DF |
80..BF | ||
U+0800..U+0FFF |
E0 |
A0..BF |
80..BF | |
U+1000..U+CFFF |
E1..EC |
80..BF |
80..BF | |
U+D000..U+D7FF |
ED |
80..9F |
80..BF | |
U+D800..U+DFFF |
ill-formed | |||
U+E000..U+FFFF |
EE..EF |
80..BF |
80..BF | |
U+10000..U+3FFFF |
F0 |
90..BF |
80..BF |
80..BF |
U+40000..U+FFFFF |
F1..F3 |
80..BF |
80..BF |
80..BF |
U+100000..U+10FFFF |
F4 |
80..8F |
80..BF |
80..BF |
These sequences exclude the surrogate code points between U+D800 and U+DFFF. The sequences also inhibit any other illegal byte values. To comply with the new definition, Unicode locale methods and the UTF-8 iconv modules are enhanced to detect the newly defined UTF-8 invalid byte sequences. For more informations, see Unicode Version 4.0 Support.
Auto encoding finder
The auto encoding finder is a utility for global character handling. Through a general-purpose interface, the auto encoding finder provides an easy way to detect the encoding of a particular file or string. Encoding detection simplifies access to various language character encodings. For example, the utility simplifies the display of web pages that do not specify encoding information. Search engines, knowledge databases, and machine translation tools might also need to detect the encoding of the language data being accessed. The Auto Encoding Finder tool simplifies this process.
For more information, see the auto_ef(1) or libauto_ef(3LIB) man pages.
Locale administrator
The locale administrator enables you to query and configure the locales for a Oracle Solaris operating system through a command-line interface. Using the localeadm(1M) tool, you can display information about locale packages that are installed on the system or that reside on a particular device or directory. You can add and remove locales on the current system on a per-region basis. For more information see Software Support for Localization.
Locale Creator
Locale Creator is a command line and graphical user interface tool that enables users to create and install Oracle Solaris locales. Using Locale Creator users can create installable Oracle Solaris packages containing customized locale data of a specific locale. After the created package has been installed, the user has a fully-working locale on the system. For more information, see the following:
localectr command at /usr/bin/localectr -h
localectr(1M) man page
iconv Code Conversions
Various new iconv code conversions between single-byte PC and Windows code pages and various Unicode forms have been added. For more information, see the iconv_en_US.UTF-8(5) man page.
Oracle Solaris Unicode Locales
New Unicode locales are added to Oracle Solaris. The new locales are available on system login. In addition, all EMEA, Central and South American locales have been migrated to Common Locale Data Repository (CLDR). For details, see Supported Locales. For information on CLDR, see Common Locale Data Repository (CLDR)
Input Method Support
New Internet Intranet Input Method Framework (IIIMF), new Language Engines and EMEA Keyboard Layout Emulation Support has been added. For more information, see: IIIMF and Language Engines and Appendix B, Language Support Features and Enhancements. For more information, see Input Method Features.
Keyboard Layouts Support
New keyboard layouts have been integrated into current version of Oracle Solaris. For more information see Keyboard Support in the Oracle Solaris Environment.
setxkbmap
A new feature for switching keyboard layouts has been integrated into Oracle Solaris and is available for the Xorg Server. setxkbmap enables switching the keyboard layout simultaneously when using Xorg Server. This command maps the keyboard using the layout determined by various options specified on the command line. For information, see the setxkbmap man pages.
Internationalization and localization are different procedures. Internationalization is the process of making software portable between languages or regions, while localization is the process of adapting software for specific languages or regions. Internationalized software can be developed using interfaces that modify program behavior at runtime in accordance with specific cultural requirements. Localization involves establishing online information to support a language or region, called a locale.
Unlike software that must be completely rewritten before it can work with different native languages and customs, internationalized software does not require rewriting. The internationalized software can be ported from one locale to another without change. The Oracle Solaris system is internationalized, providing the infrastructure and interfaces you need to create internationalized software.
An internationalized application's executable image is portable between languages and regions. To internationalize software:
Use the interfaces described in this book to create software with an environment that can be modified by dynamically recompiling.
Divide software into executable code and all the messages that the user might see. Keep the message strings in a message catalog.
Message strings are translated for a language or region. A locale includes the message strings and methods to specify sorting.
To use a localized version of a product, the user sets certain environment variables. The product then displays messages that are translated into the language of the locale. Date, time, currency, and other information is formatted and displayed according to locale-specific conventions. Message translations and online help contents are provided throughout different layers, as illustrated in the following diagram.
The OS (operating system) locale layer provides the basic locale database and functions that are plugged into the OS system interface at the application's runtime. Applications access these OS locale modules through standard APIs.
The X11 locale layer provides the interface to the X input method and X output method to X11 applications for local text input and display. Fonts enable applications to display characters from various languages.
CDE/Motif is built on top of the X11 window system. Hence, CDE/Motif can utilize the X11 locale capability through X11 APIs. Oracle Solaris localizations have various locale-specific configurations for CDE applications in order to make the desktop functional within the target locale. Message translations and online help contents are provided throughout different layers.
A key concept for application programs is that of a program's locale. The locale is an explicit model and definition of a native-language environment. The notion of a locale is explicitly defined and included in the library definitions of the ANSI C Language standard.
A locale consists of a number of categories for which country-dependent formatting or other specifications exist. A program's locale defines its code sets, date and time formatting conventions, monetary conventions, decimal formatting conventions, and collation (sort) order.
A locale can be composed of a base language, country (territory) of use, and an optional codeset. Codeset is usually assumed. For example, German is de, an abbreviation for Deutsch, while Swiss German is de_CH, CH being an abbreviation for Confederation Helvetica. This convention allows for specific differences by country, such as currency unit notation.
More than one locale can be associated with a particular language, which allows for regional differences. For example, an English-speaking user in the United States can select the en_US locale (English for the United States), while an English-speaking user in Great Britain can select en_GB (English for Great Britain).
Generally the locale name is specified by the LANG environment variable. Locale categories are subordinate to LANG but can be set separately, in which case they override LANG. If the LC_ALL operand is set, it overrides LANG and all the separate locale categories.
The locale naming convention is:
language[_territory][.codeset] [@modifier]
where a two-letter language code is from ISO 639, a two-letter territory code is from ISO 3166, codeset is the name of the codeset that is being used in the locale, and modifier is the name of the characteristics that differentiate the locale from the locale without the modifier.
All Oracle Solaris product locales preserve the Portable Character Set characters with US-ASCII code values.
For more information on the portable character set, refer to “X/Open CAE Specification: System Interface Definitions, Issue 5” (ISBN 1–85912–186–1).
A single locale can have more than one locale name. For example, POSIX is the same locale as C.
The C locale, also known as the POSIX locale, is the POSIX system default locale for all POSIX-compliant systems. The Oracle Solaris operating system is a POSIX system. The Single UNIX Specification, Version 3, defines the C locale. Register to read and download the specification at: http://www.unix.org/version3/online.html.
You can specify that your internationalized programs run in the C locale, in one of two ways:
Unset all locale environment variables.
system% unsetenv LC_ALL LANG LC_CTYPE LC_COLLATE LC_NUMERIC \ LC_TIME LC_MONETARY LC_MESSAGES
Unsets all locale environment variables. Runs the application in the C locale.
Explicitly set the locale to C or POSIX.
system% setenv LC_ALL C system% setenv LANG C
Some applications check the LANG environment variables without actually calling setlocale(3C) to reference the current locale. In this case, setenv explicitly sets the C locale by specifying the LC_ALL and LANG locale environment variables. For the precedence relationship among locale environment variables, see the setlocale(3C) man page.
To check the current locale settings in a terminal environment, run the locale(1) command.
system% locale
A full Oracle Solaris locale has all the listed functions and the localized system messages in the relevant language. Partial locales have no localized messages installed. All locales in the Oracle Solaris environment are capable of displaying localized messages, provided that localized messages for the relevant language are installed. For example, the following locales can be either partial or full locales:
de_DE.ISO8859–1
de_DE.ISO8859–15
de_DE.UTF-8
de_AT.ISO8859–1
de_AT.ISO8859–15
de_CH.ISO8859–1
When the German message translations are installed from the Oracle Solaris DVD, all of the above locales become full locales because they have access to a fully translated desktop. The Oracle Solaris DVD contains message translations for the following languages and locales:
German
French
Spanish
Brazilian Portuguese
Italian
Japanese
Korean
Simplified Chinese locale
Traditional Chinese locale
All partial and full locales as well as message translations are available on the Oracle Solaris DVD.
Different cultures often use different conventions to format numbers, to write the date and time, to delimit words and phrases, or to quote written and spoken material. A locale determines how the following operations, files, formats, and expressions are handled for different regions:
Encoding and processing of text data
Language identification and encoding of resource files
Rendering and layout of text strings
Interchange of text between clients
Input method selection to meet the codeset and text processing requirements of the chosen script
Font and icon files that are culturally specific
Actions and file types
User Interface Definition (UID) files
Date and time formats
Numeric formats
Monetary formats
Collation order
Regular expression handling specific to the locale
Format for informative and diagnostic messages and interactive responses
The Oracle Solaris environment separates language and culture-dependent information from the application and saves the information outside the application. This method eliminates the need to translate, rewrite, or recompile the application for each market. The only requirement to enter a new market is to localize the external information to the local language and customs.
The locale categories are as follows:
Controls the behavior of character handling functions.
Specifies date and time formats, including month names, days of the week, and common full and abbreviated representations.
Specifies monetary formats, including the currency symbol for the locale, thousands separator, sign position, the number of fractional digits, and so forth.
Specifies the decimal delimiter (or radix character), the thousands separator, and the grouping.
Specifies a collation order and regular expression definition for the locale.
Specifies the language in which the localized messages are written, and affirmative and negative responses of the locale (yes and no strings and expressions).
Specifies the layout engine that provides information about language rendering. Language rendering (or text rendering) depends on the shape and direction attributes of a script.
The localization of a product should be done in consultation with native users in that target language or region. Certain information styles and formats might seem perfectly obvious and universal to the developer. However, to the user these formats could look awkward, wrong, or even offensive. The following sections describe the elements in the Oracle Solaris operating system that you can customize to meet the localization requirements for your product.
The following table shows some of the ways in which different locales write 11:59 P.M.
Table 1–1 International Time Formats
Locale |
Format |
---|---|
Canadian |
23:59 |
Finnish |
23.59 |
German |
23.59 Uhr |
Norwegian |
23.59 |
Thai |
23:59 |
British English |
23:59 |
Time is represented by both a 12-hour clock and a 24-hour clock. The hour and minute separator can be either a colon ( : ) or a period ( . ) or a dash ( - ).
Time zone splits occur between and within countries. Although a time zone can be described in terms of how many hours it is ahead of, or behind, Coordinated Universal Time, UTC (or Greenwich Mean Time, GMT), this number is not always an integer. For example, Newfoundland is in a time zone that is half an hour different from the adjacent time zone.
Daylight Savings Time (DST) starts and ends on dates that can vary from country to country. Many countries do not implement DST at all. Additionally, Daylight Savings Time can vary within a time zone. In the U.S. for example, the implementation is a state decision.
The following table shows some of the date formats used around the world. Variations can exist even within a country.
Table 1–2 International Date Formats
Locale |
Convention |
Example |
---|---|---|
Canadian (English) |
dd/mm/yy |
16/07/10 |
Danish |
dd/mm/yy |
16/07/10 |
Finnish |
dd.mm.yyyy |
16.07.2010 |
French |
dd/mm/yy |
16/07/10 |
German |
dd.mm.yy |
16.07.10 |
Italian |
dd/mm/yy |
16/07/10 |
Norwegian |
dd.mm.yy |
16.07.10 |
Spanish |
dd/mm/yy |
16/07/10 |
Swedish |
yyyy-mm-dd |
2010–07–16 |
Great Britain |
dd/mm/yyyy |
16/07/2010 |
United States |
mm/dd/yy |
07/16/10 |
Thai |
mm/dd/yyyy |
07/16/2010 |
Great Britain and the United States are two of the few places in the world that use a period to indicate the decimal place. Many other countries use a comma instead. The decimal separator is also called the radix character. Likewise, while Great Britain and the United States use a comma to separate groups of thousands, many other countries use a period instead, and some countries separate thousands groups with a thin space.
Data files containing locale-specific formats are frequently misinterpreted when transferred to a system in a different locale. For example, a file containing numbers in a French format is not useful to a British-specific program.
The following table shows some commonly used numeric formats.
Table 1–3 International Numeric Conventions
Locale |
Large Number |
---|---|
Canadian (English) |
4,294,967.00 |
Danish |
4.294 967.295,00 |
Finnish |
4 294 967 295,00 |
French |
4 294 967 295,00 |
German |
4,294,967.00 |
Italian |
4.294.967,00 |
Norwegian |
4.294.967.295,00 |
Spanish |
4.294.967.295,00 |
Swedish |
4 294 967 295,00 |
Great Britain |
4,294,967,295.00 |
United States |
4,294,967,295.00 |
Thai |
4,294,967,295.00 |
No particular locale conventions exist that specify how to separate numbers in a list.
Currency units and presentation order vary greatly around the world. Local and international symbols for currency can differ. The following table shows monetary formats in some countries.
Table 1–4 International Monetary Conventions
Locale |
Currency |
Example |
---|---|---|
Canadian (English) |
Dollar ($) |
$1,234.56 |
Canadian (French) |
Dollar ($) |
1 234,56$ |
Danish |
Kroner (kr) |
Kr 1.234,56 |
Finnish |
Euro () |
1 234,56 |
French |
Euro () |
1,234 |
Japanese |
Yen (¥) |
¥ 1,234 |
Norwegian |
Krone (kr) |
kr 1.234,56 |
Swedish |
Krona (Kr) |
1 234,56 Kr |
Great Britain |
Pound (£) |
£1,234.56 |
United States |
Dollar ($) |
$1,234.56 |
Thai |
Baht |
2539 Baht |
The current release supports the Euro currency. Local currency symbols are still available for backward compatibility.
Table 1–5 User Locales That Support the Euro Currency
Region |
Locale Name |
ISO Code Set |
---|---|---|
Austria |
de_AT.ISO8859-15 |
8859-15 |
Belgium (French) |
fr_BE.ISO8859-15 |
8859-15 |
Belgium (Flemish) |
nl_BE.ISO8859-15 |
8859-15 |
Denmark |
da_DK.ISO8859-15 |
8859-15 |
Estonia |
et_EE.ISO8859–15 |
8859–15 |
Finland |
fi_FI.ISO8859-15 |
8859-15 |
France |
fr_FR.ISO8859-15 |
8859-15 |
Germany |
de_DE.ISO8859-15 |
8859-15 |
Great Britain |
en_GB.ISO8859-15 |
8859-15 |
Ireland |
en_IE.ISO8859-15 |
8859-15 |
Italy |
it_IT.ISO8859-15 |
8859-15 |
Netherlands |
nl_NL.ISO8859-15 |
8859-15 |
Portugal |
pt_PT.ISO8859-15 |
8859-15 |
Catalan Spain |
ca_ES.ISO8859-15 |
8859–15 |
Spain |
es_ES.ISO8859-15 |
8859-15 |
Sweden |
sv_SE.ISO8859-15 |
8859-15 |
U.S.A. |
en_US.ISO8859-15 |
8859-15 |
Euro locales are based on the ISO8859–15 code set.
Keep in mind that a converted currency amount can require a different amount of space than the original amount, for example, $1,000 can become 1 000,00 Kr.
The current status of the locale settings for locales within the euro zone is illustrated for the LC_MONETARY operand of the locale utility. The status for Germany, for example, is shown in the following table.
Table 1–6 German Locale and Corresponding LC_MONETARY Operand
Locale |
LC_MONETARY |
---|---|
de_DE.ISO8859–1 |
DM |
de_DE.ISO8859–15 |
Euro |
de_DE.UTF-8 |
Euro |
de_DE.ISO8859–15@euro |
Euro |
de_DE.UTF-8@euro |
Euro |
This section describes important differences between languages.
In English, words are usually separated by a space character. Languages such as Chinese, Japanese, and Thai, however, often have no delimiter between words.
Sorting order for particular characters is not the same in all languages. For example, the character “ö” sorts with the ordinary “o” in Germany, but sorts separately in Sweden, where it is the last letter of the alphabet. In some languages, characters have weight to determine the priority of the character sequences. For example, the Thai dictionary defines sorting through the sequences of characters that have different weights.
Character sets can differ in the number of alphabetic characters and special characters. While the English alphabet contains only 26 characters, some languages contain many more characters. Japanese, for example, can contain over 20,000 characters and Chinese can contain an even higher number of characters.
The alphabets of most western European countries are similar to the standard 26-character alphabet used in English-speaking countries. These alphabets often also include some additional basic characters, some marked or accented characters, and some ligatures.
Japanese text is composed of three different scripts mixed together:
Although each character in Hiragana has an equivalent in Katakana, Hiragana is the most common script, with cursive rather than block-like letter forms. Kanji characters are used to write root words. Katakana is mostly used to represent “foreign” words, that is, words imported from languages other than Japanese.
Kanji has tens of thousands of characters, but the number commonly used has declined steadily over the years. Now only about 3500 are frequently used, although the average Japanese writer has a vocabulary of about 2000 Kanji characters. Nonetheless, computer systems must support more than 7000 characters in accordance with the Japan Industry Standard (JIS) requirements. In addition, there are about 170 Hiragana and Katakana characters. On average, 55% of Japanese text is Hiragana, 35% Kanji, and 10% Katakana. Arabic numerals and Roman letters are also present in Japanese text.
Although completely avoiding the use of Kanji is possible, most Japanese readers find a text that is composed without any Kanji hard to understand.
Korean text can be written using a phonetic writing system called Hangul. Hangul has more than 11,000 characters, which consist of consonants and vowels known as jamos. About 3000 characters from the entire Hangul vocabulary of characters are usually used in Korean computer systems. Korean also uses ideographs based on the set invented in China, called Hanja. Korean text requires over 6000 Hanja characters. Hanja is used mostly to avoid confusion when Hangul would be ambiguous. Hangul characters are formed by combining consonants and vowels. After these characters are combined, they can compose one syllable, which is a Hangul character. Hangul characters are often arranged in a square, so that the group takes up the same space as a Hanja character. Arabic numerals, Roman letters, and special symbol characters are also present in Korean text.
A Thai character can be defined as a column position on a display screen with four display cells. Each column position can have up to three characters. The composition of a display cell is based on the Thai character's classification. Some Thai characters can be composed with another character's classification. If both characters can be composed together, both characters are in the same cell. Otherwise, they are in separate cells.
Chinese usually consists entirely of characters from the ideographic script called Hanzi.
In the People's Republic of China (PRC) there are about 7000 commonly used Hanzi characters in the GB2312 (zh locale), more than 20,000 characters in the GBK charset (zh.GBK locale), and about 30,000 characters in the GB18030-2000 charset (zh_CN.GB18030 locale), including all CJK extension A characters defined in Unicode 3.0.
In Taiwan, the most frequently used charsets are the CNS11643-1992 (zh_TW locale) and the Big5 (zh_TW.BIG5 locale). They share about 13,000 Hanzi characters.
In Hong Kong, 4702 characters have been added into the Big5 charset to become the Big5-HKSCS charset (zh_HK.BIG5HK).
If a character is not a root character, it usually consists of two or more parts, two being most common. In two-part characters, one part generally represents meaning, and the other represents pronunciation. Occasionally both parts represent meaning. The radical is the most important element, and characters are traditionally arranged by radical, of which there are several hundred. A single sound can be represented by many different characters, which are not interchangeable in usage. A single character can have different sounds.
Some characters are more appropriate than others in a given context. The appropriate character is distinguished phonetically by the use of tones. By contrast, spoken Japanese and Korean lack tones.
Several phonetic systems represent Chinese. In the People's Republic of China the most common is pinyin, which uses Roman characters and is widely employed in the West for place names such as Beijing. The Wade-Giles system is an older phonetic system, formerly used for place names such as Peking. In Taiwan zhuyin (or bopomofo), a phonetic alphabet with unique letter forms, is often used instead.
Hebrew text is used for writing scripts in the Hebrew and Yiddish languages. Hebrew uses a bidirectional script. Hebrew letters are written and read from right to left, while numbers are read from left to right. Any English text that is embedded in Hebrew text is also read from left to right.
Hebrew uses a 27-character alphabet, and takes punctuation marks and numbers from the standard Latin (or English) alphabet. Hebrew text also includes vowel and pronunciation marks. These marks appear either as a dot (dagesh) inside the base character, vowel marks below the character, or accents to the upper left of the character. These marks are generally only used in liturgical text, and are rarely seen in day-to-day use. Hebrew has no uppercase letters.
Hindi text is written in a script called Devanagari, which means the writing of the gods. Hindi is a phonetic language, and is written as a series of syllables. Each syllable is built up of alphabetic pieces (the Devanagari characters) of three types: consonant letters, independent vowels, and dependent vowel signs. The syllable itself consists of a consonant and vowel core, with an optional preceding consonant. Unlike English, which starts from a baseline, Devanagari characters hang from a horizontal line (called the head stroke) written at the top of the characters. These characters can combine or change shape depending on their context. Like Hebrew, Hindi text makes no distinction between uppercase and lowercase letters.
Not all characters on the U.S. keyboard appear on other keyboards. Similarly, other keyboards often contain many characters not visible on the U.S. keyboard.
Any keyboard can be used to input characters from any locale because input is handled by the Oracle Solaris operating system.
On SPARC® and on x86 based platform machines, the Compose key can be used to produce any Latin character with a diacritic in any of the supported ISO8859 character sets. The Compose key can be used with Latin-based locales, but not with Korean, Chinese, or Japanese locales, except the UTF-8 locales.
Within each country, a small number of paper sizes are commonly used. Normally, one of those sizes is much more common than the others. Most countries follow ISO Standard 216: “Writing paper and certain classes of printed matter-Trimmed sizes-A and B series.”
Internationalized applications should not make assumptions about the page sizes available to them. The Oracle Solaris system provides no support for tracking the output page size. This tracking is the responsibility of the application program. The following table shows common international page sizes.
Table 1–7 Common International Page Sizes
Paper Type |
Dimensions |
Countries |
---|---|---|
ISO A4 |
21.0 cm by 29.7 cm |
Everywhere except U.S. |
ISO A5 |
14.8 cm by 21.0 cm |
Everywhere except U.S. |
JIS B4 |
25.9 cm by 36.65 cm |
Japan |
JIS B5 |
18.36 cm by 25.9 cm |
Japan |
U.S. Letter |
8.5 inches by 11 inches |
U.S. and Canada |
U.S. Legal |
8.5 inches by 14 inches |
U.S. and Canada |