International Language Environments Guide

Chapter 1 Solaris Internationalization Overview

This section discusses some general information about internationalization and localization.

The Solaris 9 product includes full Unicode 3.1 support, as defined in Unicode and ISO/IEC 10646, for selected locales. The Solaris 9 release is a major release for Sun's international markets. It includes a number of new features.

The Solaris 9 operating environment has been designed to speak the languages of the world since its inception. With a pluggable, service-based approach to globalization, the Solaris internationalization architecture eases development, deployment and management of applications and language services around the world. In one convenient, multilingual product, users benefit from extensive support for 39 different languages and 162 locales, including complex text layout environments needed to support Thai and Hindi, and bidirectional layout environments for languages like Arabic and Hebrew.

The Solaris internationalization architecture provides a flexible and pluggable method of handling input methods, character set encodings, codeset conversion and other basic aspects of language services. You can choose between the powerful tools already provided, or customize your environment. You can deploy applications in multiple language environments without knowing how input methods work or which codeset converter needs to be enabled, simply by following standard APIs. You can also customize particular language attributes. The architecture enables you to change converter tables or add a new input method editor.

The source code for the Solaris X globalization framework was released to the open community in the fall of 2000. You now have the ability to enhance compatibility and interoperability of global applications by following a common reference implementation while also participating in the evolution of the code base. The codeset independent approach to globalization enables you to operate in native encoded environments or join the growing world of Unicode. The Solaris framework gives the power to scale across platforms with a rich set of data converters designed to ensure interoperability between various encodings and various platforms (from Microsoft Windows or Macintosh, for example).

Solaris also helps multinational corporations scale their server administration worldwide. Unlike competitive platforms, the Solaris platform uses a service-based approach to administration of language services. Server administrators can enable language services remotely across a worldwide network, regardless of the client system. This client-independent approach enables the easy upgrade of the system without changing client applications. For example, an Arabic-speaking user needing to read an email from an internet cafe in Paris would still be able to read that email in his or her own language without modifying the local client application.

New Internationalization and Localization Features

The following features are new to the Solaris 9 release:

Internationalization and Localization Defined

Internationalization and localization are different procedures. Internationalization is the process of making software portable between languages or regions, while localization is the process of adapting software for specific languages or regions. Internationalized software can be developed using interfaces that modify program behavior at runtime in accordance with specific cultural requirements. Localization involves establishing online information to support a language or region, called a locale.

Unlike software that must be completely rewritten before it can work with different native languages and customs, internationalized software does not require rewriting. The internationalized software can be ported from one locale to another without change. The Solaris system is internationalized, providing the infrastructure and interfaces you need to create internationalized software.

Basic Steps in Internationalization

An internationalized application's executable image is portable between languages and regions. To internationalize software, you should:

Message strings are translated for a language or region. A locale includes the message strings and methods to specify sorting.

To use a localized version of a product, the user sets certain environment variables. The product then displays messages in their translated form. Date, time, currency and other information is formatted and displayed according to locale-specific conventions. Message translations and online help contents are provided throughout different layers, as described in the following diagram.

Figure 1–1 Functions and Structure of Locales in the Solaris Operating Environment

Graphic

Localization Functions in Solaris Interfaces

The OS locale layer provides the basic locale database and functions that are plugged into the OS system interface at the application's runtime. Applications access these OS locale modules through standard APIs.

The X11 locale layer provides the interface to X input method and X output method so that the X11 applications can allow local text input and display. Fonts are provided to enable applications to display characters from various languages.

CDE/Motif is built on top of the X11 window system. Hence, it can utilize the X11 locale capability through X11 APIs. Solaris localizations have various locale-specific configurations for CDE applications in order to make the desktop functional within the target locale. Message translations and online help contents are provided throughout different layers.

What Is a Locale?

A key concept for application programs is that of a program's locale. The locale is an explicit model and definition of a native-language environment. The notion of a locale is explicitly defined and included in the library definitions of the ANSI C Language standard.

A locale consists of a number of categories for which there is country-dependent formatting or other specifications. A program's locale defines its codesets, date and time formatting conventions, monetary conventions, decimal formatting conventions, and collation (sort) order.

A locale can be composed of a base language, the country (territory) of use, and optional codeset. Codeset is usually assumed. For example, German is de, an abbreviation for Deutsch, while Swiss German is de_CH, CH being an abbreviation for Confederation Helvetica. This allows for specific differences by country, such as currency units notation.

More than one locale can be associated with a particular language, which allows for regional differences. For example, an English-speaking user in the United States can select the en_US locale (English for the United States), while an English-speaking user in Great Britain can select en_GB (English for Great Britain).

Generally the locale name is specified by the LANG environment variable. Locale categories are subordinate to LANG, but can be set separately, in which case they override LANG. If the LC_ALL operand is set, it overrides not only LANG, but all the separate locale categories as well.

The locale naming convention is:

language[_territory][.codeset] [@modifier]

where a two-letter language code is from ISO 639, a two-letter territory code is from ISO 3166, codeset is the name of the codeset that is being used in the locale, and modifier is the name of the characteristics that differentiate it from the locale without the modifier.

All Solaris product locales preserve the Portable Character Set characters with US-ASCII code values.

For more information on the Portable Character Set, refer to “X/Open CAE Specification: System Interface Definitions, Issue 5” (ISBN 1–85912–186–1).

A single locale can have more than one locale name. For example, POSIX is the same as C.

Full and Partial Locales

A full Solaris locale has all of the listed functions and the localized system messages in the relevant language. Partial locales have no localized messages installed. All locales in the Solaris environment are capable of displaying localized messages, provided that localized messages for the relevant language are installed. For example, the following locales can be either partial or full locales:

When the German message translations are installed using the Language CD, all of the above locales become full locales because they have access to a fully translated desktop. The language CD contains message translations for the following languages and locales:

All partial locales are available on the Software CD. Message translations are available on the Languages CD.

All English locales are also full locales and are available on the Software CD.

Behavior Affected by Locales

Different cultures often use different conventions for writing the date and time, formatting numbers, delimiting words and phrases, and quoting material. Throughout the system, a locale determines the behavior of the following items:

The Solaris environment separates language and culture-dependent information from the application and saves it outside the application. Doing so eliminates the need to translate, rewrite, or recompile the application for each market. The only requirement to enter a new market is to localize the external information to the local language and customs.

Locale Categories

The locale categories are as follows:

LC_CTYPE

Controls the behavior of character handling functions.

LC_TIME

Specifies date and time formats, including month names, days of the week, and common full and abbreviated representations.

LC_MONETARY

Specifies monetary formats, including currency symbol for the locale, thousands separator, sign position, the number of fractional digits, and so forth.

LC_NUMERIC

Specifies the decimal delimiter (or radix character), the thousands separator, and the grouping.

LC_COLLATE

Specifies a collation order, and regular expression definition for the locale.

LC_MESSAGES

Specifies the language in which the localized messages are written, affirmative and negative responses of the locale (yes and no strings and expressions).

LO_LTYPE

Specifies the layout engine that provides information about language rendering. Language rendering (or text rendering) consists of text shaping and directionality.

Using Locale Categories for Localization

The localization of a product should be done in consultation with native users in that target language or region. Certain information styles and formats might seem perfectly obvious and universal to the developer but to the user, they could look awkward, wrong, or even offensive. The following sections describe the elements in the Solaris operating environment that you can control and specify so that you can successfully localize your product.

Time Formats

The following table shows some of the ways in which different locales write 11:59 P.M.

Table 1–1 International Time Formats

Locale 

Format  

Canadian 

23:59  

Finnish 

23.59  

German 

23.59 Uhr  

Norwegian 

23.59  

Thai 

23:59 

Great Britain 

23:59  

Time is represented by both a 12-hour clock and a 24-hour clock. The hour and minute separator can be either a colon ( : ) or a period ( . ).

Time zone splits occur between and within countries. Although a time zone can be described in terms of how many hours it is ahead of, or behind, Coordinated Universal Time, UTC (or Greenwich Mean Time, GMT), this number is not always an integer. For example, Newfoundland is in a time zone that is half an hour different from the adjacent time zone.

Daylight Savings Time (DST) starts and ends on different dates that can vary from country to country. Many countries do not implement DST at all. Additionally, Daylight Savings Time can vary within a time zone. In the U.S. it is a state decision.

Date Formats

The following table shows some of the date formats used around the world. Notice that even within a country, there can be variations.

Table 1–2 International Date Formats

Locale 

Convention 

Example 

Canadian (English) 

dd/mm/yy 

24/08/01 

Danish 

yyyy-mm-dd 

2001–08–24 

Finnish 

dd.mm.yyyy 

24.08.2001 

French 

dd/mm/yyyy 

24/08/2001 

German 

yyyy-mm-dd 

2001–08–24 

Italian 

dd/mm/yy 

24/08/01 

Norwegian 

dd-mm-yy 

24–08–01 

Spanish 

dd-mm-yy 

24-08-01 

Swedish 

yyyy-mm-dd 

2001-08-24 

Great Britain 

dd/mm/yy 

24/08/01  

United States 

mm-dd-yy 

08-24-01 

Thai 

dd/mm/yyyy 

24/08/2001 

Numbers

Great Britain and the United States are two of the few places in the world that use a period to indicate the decimal place. Many other countries use a comma instead. The decimal separator is also called the radix character. Likewise, while Great Britain and the United States use a comma to separate groups of thousands, many other countries use a period instead, and some countries separate thousands groups with a thin space.

Data files containing locale-specific formats are frequently misinterpreted when transferred to a system in a different locale. For example, a file containing numbers in a French format is not useful to a British-specific program.

The following table shows some commonly used numeric formats.

Table 1–3 International Numeric Conventions

Locale 

Large Number  

Canadian (English) 

4,294,967.00 

Danish 

4.294 967.295,00 

Finnish 

4 294 967 295,00 

French 

4 294 967 295,00 

German 

4,294,967.00 

Italian 

4.294.967,00 

Norwegian 

4.294.967.295,00 

Spanish 

4.294.967.295,00  

Swedish 

4 294 967 295,00 

Great Britain 

4,294,967,295.00 

Uhited States 

4,294,967,295.00 

Thai 

4,294,967,295.00 


Note –

There are no particular locale conventions that specify how to separate numbers in a list.


Currency

Currency units and presentation order vary greatly around the world. Local and international symbols for currency can differ. The following table shows monetary formats in some countries.

Table 1–4 International Monetary Conventions

Locale 

Currency 

Example  

Canadian (English) 

Dollar ($) 

$1,234.56 

Canadian (French  

Dollar ($) 

1 234,56$ 

Danish  

Kroner (kr) 

Kr 1.234,56  

Finnish  

Euro (Graphic)

Graphic1 234,56

French  

Euro (Graphic)

Graphic1,234

Japanese 

Yen (¥) 

¥ 1,234  

Norwegian 

Krone (kr) 

kr 1.234,56  

Swedish  

Krona (Kr) 

1 234,56 Kr 

Great Britain 

Pound (£) 

£1,234.56  

United States 

Dollar ($) 

$1,234.56  

Thai 

Baht 

2539 Baht 

Euro 

Euro (Graphic)

Graphic5,000

The Solaris 9 software supports the euro currency. Local currency symbols are still available for backward compatibility.

Table 1–5 User Locales To Support the Euro Currency

Region 

Locale Name 

ISO Codeset 

Austria 

de_AT.ISO8859-15

8859-15 

Belgium (French) 

fr_BE.ISO8859-15

8859-15 

Belgium (Flemish) 

nl_BE.ISO8859-15

8859-15 

Denmark 

da_DK.ISO8859-15

8859-15 

Finland 

fi_FI.ISO8859-15

8859-15 

France 

fr_FR.ISO8859-15

8859-15 

Germany 

de_DE.ISO8859-15

8859-15 

Ireland 

en_IE.ISO8859-15

8859-15 

Italy 

it_IT.ISO8859-15

8859-15 

Netherlands 

nl_NL.ISO8859-15

8859-15 

Portugal 

pt_PT.ISO8859-15

8859-15 

Catalan Spain 

ca_ES.ISO8859-15

8859–15 

Estonia 

et_EE.ISO8859–15

8859–15 

Spain 

es_ES.ISO8859-15

8859-15 

Sweden 

sv_SE.ISO8859-15

8859-15 

Great Britain 

en_GB.ISO8859-15

8859-15 

U.S.A. 

en_US.ISO8859-15

8859-15 

Euro locales are based on the ISO8859–15 codeset.

Keep in mind that a converted currency amount can take up more or less space than the original amount. To illustrate: $1,000 can become Graphic1.307.000.

The current status of the locale settings for locales within the euro zone is illustrated for the LC_MONETARY operand of the locale utility. The status for Germany, for example, is shown in the following table.

Table 1–6 German Locale and Corresponding LC_MONETARY

Locale 

LC_MONETARY 

de_DE.ISO8859–1

DM 

de_DE.ISO8859–15

Euro 

de_DE.UTF-8

Euro 

de_DE.ISO8859–15@euro

Euro 

de_DE.UTF-8@euro

Euro 

Language Word and Letter Differences

This section describes important differences between languages.

Word Delimiters

In English, words are usually separated by a space character. In languages such as Chinese, Japanese, and Thai, however, there is often no delimiter between words.

Sort Order

Sorting order for particular characters is not the same in all languages. For example, the character “ö” sorts with the ordinary “o” in Germany, but sorts separately in Sweden, where it is the last letter of the alphabet. In some languages, characters have weight to determine the priority of the character sequences. For example, the Thai dictionary defines sorting through the sequences of characters that have different weights.

Character Sets

Character sets can differ in the number of alphabetic characters and special characters. While the English alphabet contains only 26 characters, some languages contain many more characters. Japanese, for example, can contain over 20,000 characters, Chinese can contain even more characters.

Western European Alphabets

The alphabets of most western European countries are similar to the standard 26-character alphabet used in English-speaking countries, but there are often some additional basic characters, some marked (or accented) characters, and some ligatures.

Japanese Text

Japanese text is composed of three different scripts mixed together: Kanji ideographs derived from Chinese, and two phonetic scripts (or syllabaries), hiragana and katakana.

Although each character in hiragana has an equivalent in katakana, hiragana is the most common script, with cursive rather than block-like letter forms. Kanji characters are used to write root words. Katakana is mostly used to represent “foreign” words, that is, words “imported” from languages other than Japanese.

Kanji has tens of thousands of characters, but the number commonly used has been declining steadily over the years. Now only about 3500 are frequently used, although the average Japanese writer has a vocabulary of about 2000 kanji characters. Nonetheless, computer systems must support more than 7000 because that is what the Japan Industry Standard (JIS) requires. In addition, there are about 170 hiragana and katakana characters. On average, 55% of Japanese text is hiragana, 35% kanji, and 10% katakana. Arabic numerals and Roman letters are also present in Japanese text.

Although completely avoiding the use of kanji is possible, most Japanese readers find a text that is composed without any kanji hard to understand.

Korean Text

Korean text can be written using a phonetic writing system called Hangul. Hangul has more than 11,000 characters, which consist of consonants and vowels known as jamos. About 3000 characters from the entire Hangul vocabulary of characters are usually used in Korean computer systems. Korean also uses ideographs based on the set invented in China, called hanja. Korean text requires over 6000 hanja characters. Hanja is used mostly to avoid confusion when Hangul would be ambiguous. Hangul characters are formed by combining consonants and vowels. After combining them, they can compose one syllable, which is a Hangul character. Hangul characters are often arranged in a square, so that the group takes up the same space as a hanja character. Arabic numerals, Roman letters, and special symbol characters are also present in Korean text.

Thai Text

A Thai character can be defined as a column position on a display screen with four display cells. Each column position can have up to three characters. The composition of a display cell is based on the Thai character's classification. Some Thai characters can be composed with another character's classification. If they can be composed together, both characters are in the same cell. Otherwise, they are in separate cells.

Chinese Text

Chinese usually consists entirely of characters from the ideographic script called hanzi.

If a character is not a root character, it usually consists of two or more parts, two being most common. In two-part characters, one part generally represents meaning, and the other represents pronunciation. Occasionally both parts represent meaning. The radical is the most important element, and characters are traditionally arranged by radical, of which there are several hundred. A single sound can be represented by many different characters, which are not interchangeable in usage. A single character can have different sounds.

Some characters are more appropriate than others in a given context—the appropriate one is distinguished phonetically by the use of tones. By contrast, spoken Japanese and Korean lack tones.

Several phonetic systems represent Chinese. In the People's Republic of China the most common is pinyin, which uses Roman characters and is widely employed in the West for place names such as Beijing. The Wade-Giles system is an older phonetic system, formerly used for place names such as Peking. In Taiwan zhuyin (or bopomofo), a phonetic alphabet with unique letter forms, is often used instead.

Hebrew Text

Hebrew text is used for writing scripts in the Hebrew and Yiddish languages, and predates the English language by thousands of years. Hebrew is an example of a bidirectional script, in that Hebrew letters are written and read from right to left, while numbers are read from left to right. Any English text that is embedded in Hebrew text is also read from left to right.

Hebrew uses a 27-character alphabet, and takes punctuation marks and numbers from the standard Latin (or English) alphabet. Hebrew text also includes vowel and pronunciation marks. These marks appear either as a dot (Dagesh) inside the base character, vowel marks below the character, or accents to the upper left of the character. These marks are generally only used in liturgical text, and are rarely seen in day-to-day use. There are also no uppercase letters in Hebrew.

Hindi Text

Hindi text is written in a script called Devanagari, which means "the writing of the gods". Hindi is a phonetic language, and is written as a series of syllables. Each syllable is built up of alphabetic pieces (the Devanagari characters) of three types: consonant letters, independent vowels and dependent vowel signs. The syllable itself consists of a consonant and vowel core, with an optional preceding consonant. Unlike English, which starts from a baseline, Devanagari characters hang from a horizontal line (called the head stroke) written at the top of the characters. These characters can combine or change shape depending on their context. Like Hebrew, Hindi text makes no distinction between uppercase or lowercase letters.

Keyboard Differences

Not all characters on the U.S. keyboard appear on other keyboards. Similarly, other keyboards often contain many characters not visible on the U.S. keyboard.


Note –

On SPARCTM machines, the Compose key can be used to produce any Latin character with a diacritic in any of the supported ISO8859 character sets.

The Compose key can be used with Latin-based locales, but not with Korean, Chinese, or Japanese locales, except the UTF-8 locales.

Any keyboard can be used to input characters from any locale because input is handled by the Solaris operating environment.


Differences in Paper Sizes

Within each country, a small number of paper sizes are commonly used. Normally, one of those sizes is much more common than the others. Most countries follow ISO Standard 216: “Writing paper and certain classes of printed matter-Trimmed sizes-A and B series.”

Internationalized applications should not make assumptions about the page sizes available to them. The Solaris system provides no support for tracking output page size. Tracking this is the responsibility of the application program. The following table shows common international page sizes.

Table 1–7 Common International Page Sizes

Paper Type 

Dimensions 

Countries  

ISO A4 

21.0 cm by 29.7 cm 

Everywhere except U.S. 

ISO A5 

14.8 cm by 21.0 cm 

Everywhere except U.S. 

JIS B4 

25.9 cm by 36.65 cm 

Japan  

JIS B5 

18.36 cm by 25.9 cm 

Japan 

U.S. Letter 

8.5 inches by 11 inches 

U.S. and Canada 

U.S. Legal 

8.5 inches by 14 inches 

U.S. and Canada