International Language Environments Guide

Chapter 1 Solaris Internationalization Overview

The Solaris 8 product includes full Unicode 3.0 support, as defined in ISO-10646, for selected locales. The Solaris 8 release is a major release for Sun's international markets. It includes a number of new features. All partial locales including multibyte locales such as Japanese locales are now available on the Base Solaris 8 product.

New Internationalization and Localization Features in Solaris 8

Simplified Chinese UTF-8 locale. This provides broader support for Unicode with the addition of new UTF-8 locales. Unicode is often used in a mixed script environment, where it is necessary to display text from multiple languages in a single environment.
Traditional Chinese UTF-8 locale
Asian printing enhancements
Support for 90 locales on the base Solaris CD. This is a new packaging approach to universal language coverage.
Enhanced Sdtudctool --support for migration of UDC (User Defined Character) from Microsoft Windows. Localized for all Asian locales.
Three additional locales have been added for Iceland (ISO8859-1), U.S.A. (ISO8859-15), and Russia (ANSI1251). The new U.S.A. locale adds support for the euro currency glyph. The new Russian locale is in addition to the existing ISO8859-5 and KOI8-R locales. It provides native Microsoft data encoding support. The new ISO8859-1 locale for Iceland marks the introduction of Icelandic support to the Solaris environment.
Customer-extensible codeset conversion. New codeset conversion can be added by using the geniconvtbl utility. Existing codeset conversions can be modified.
European locale repackaging
Euro font
Adding Japanese iconv modules -- conversions for IBM mainframe codesets and conversions between Unicode and Shift-JIS for Microsoft codeset.
Euro currency. All foreign exchange, banking, and finance industries in the European community are converting from using their local currencies to using the Euro. Euro currency support has been enhanced in the Solaris 8 environment with the addition of U.S. and Estonian ISO8859-15 locales.
Multibyte Partial locale -- framework of multibyte locale support is included in the Base Solaris product.
Enhanced Unicode iconv modules. The iconv module has been enhanced for various Unicode encoding formats and international and de facto industry standard codesets.

Internationalization and Localization Defined

Internationalization is the process of making software portable between languages or regions, while localization is the process of adapting software for specific languages or regions. International software can be developed using interfaces that modify program behavior at runtime in accordance with specific cultural requirements. Localization involves establishing online information to support a language or region, called a locale.

Unlike software that must be completely rewritten before it can work with different native languages and customs, internationalized software does not require rewriting. It can be ported from one locale to another without change. The Solaris system is internationalized, providing the infrastructure and interfaces you need to create internationalized software.

Internationalization and localization are different procedures.

Internationalization is the process of making software that is independent of any locale. It can then be adapted to specific locales.

Basic Steps in Internationalization

An internationalized application's executable image is portable between languages and regions. To internationalize software, you should:

Use the interfaces described in this book to create software with an environment that can be modified dynamically without the necessity of recompiling the software.
Divide software into executable and messages. The messages include all printable and displayable messages that the user sees. Keep the message strings in a message catalog.

Message strings are translated for a language and a region. A locale includes the message strings and methods to specify sorting.

Locales are not the same as a language. A language can contain various regions. For example, French is spoken in France and Canada, but each country has different ways of displaying monetary and time information.

To use a localized version of a product, the user sets the environment variables. The product then displays the user messages in their translated form. Date, time, currency and other information is formatted and displayed according to locale-specific conventions.

What Is a Locale?

A locale can be composed of both a base language, the country (territory) of use, and possibly codeset (which is usually assumed). For example, German is de, an abbreviation for Deutsch, while Swiss German is de_CH, CH being an abbreviation for Confederation Helvetica. This allows for specific differences by country, such as currency units notation.

Note -

More than one locale can be associated with a particular language, which allows for regional differences. For example, an English-speaking user in the United States can select the en_US locale (English for the United States), while an English-speaking user in Great Britain can select en_GB (English for Great Britain).

The key concept for application programs is that of a program's locale. The locale is an explicit model and definition of a native-language environment. The notion of a locale is explicitly defined and included in the library definitions of the ANSI C Language standard.

The locale consists of a number of categories for which there is country-dependent formatting or other specifications. A program's locale defines its codesets, date and time formatting conventions, monetary conventions, decimal formatting conventions, and collation (sort) order.

Generally the locale name is specified by the LANG environment variable. Locale categories are subordinate to LANG, but can be set separately, in which case they override LANG. If LC_ALL is set, it overrides not only LANG, but all the separate locale categories as well.

Full and Partial Locales

A full Solaris locale has all of the listed functions and the localized system messages in the relevant language. If no localized messages are installed, then all locales would be classified as "partial locales". Several locales in the Solaris environment are capable of displaying localized messages, provided that the relevant language support is installed. For example, there are several locales which can use German messages:

de_DE.ISO8859-1
de_DE.ISO8859-15
de_DE.UTF-8
de_AT.ISO8859-1
de_AT.ISO8859-15
de_CH.ISO8859-1

When the German messages translations are installed using the Language CD, all of the above locales will become "full locales", because they will have access to a fully translated desktop. The language CD contains message translations for the following languages:

German
French
Spanish
Swedish
Italian
Japanese
Korean
Simplified Chinese
Traditional Chinese

All partial locales are also available in the base product, but message translations are available only in the multilingual Solaris product.

Cultural Conventions

Different cultures use different conventions for writing the date, the time, numbers, currency, delimiting words and phrases, and quoting material.

A locale defines the behavior of a program at runtime according to a language or cultural region's conventions. Throughout the system, a locale determines the behavior of the following:

Encoding and processing of text data
Identifying the language and encoding of resource files and their text values
Rendering and layout of text strings
Interchanging text that is used for interclient text communication
Encoding and decoding for interclient text communication
Selecting the input method (that is, which codeset is generated) and the processing of text data
Font and icon files that are culturally specific
Actions and file types
User Interface Definition (UID) files
Date and time formats
Numeric formats
Monetary formats
Collation order
Format for informative and diagnostic messages and interactive responses

The Solaris environment separates language and culture-dependent information from the application and saves it outside the application.

By separating the language and culture-dependent information from the application, the developer does not need to translate, rewrite, or recompile the application for each market. The only requirement to enter a new market is to localize the external information to the local language and customs.

Locale Categories

The locale categories are as follows:

LC_CTYPE

Controls the behavior of character handling functions
LC_TIME

Specifies date and time formats, including month names, days of the week, and common full and abbreviated representations
LC_MONETARY

Specifies monetary formats. Few SunOS system commands or library routines actually use this category
LC_NUMERIC

Specifies the decimal separator (or radix character) and the thousands separator
LC_COLLATE

Specifies the sorting order for a locale and the string conversions required to attain this ordering
LC_MESSAGES

Specifies the language in which the localized messages are written
LO_LTYPE

Specifies the layout engine that provides information about language rendering. Language rendering (or text rendering) consists of text shaping and directionality.

Using Locale Categories for Localization

The localization of a product should be done in consultation with native users in that target language or region. Certain styles and information styles and formats might seem perfectly obvious and universal to the developer, but to the user, these look either awkward, wrong, or even offensive. The following pages describe the elements that the Solaris operating environment allows you to control and specify so that you can successfully internationalize your product.

Time Formats

Table 1-1 shows some of the ways to write 11:59 P.M.

Table 1-1 International Time Formats


Locale	Format
Canadian	23:59
Finnish	23.59
German	23.59 Uhr
Norwegian	Kl 23.59
Thai	11:59 PM
U.K.	11.59 PM

Time is represented by both a 12-hour clock and a 24-hour clock. The hour and minute separator can be either a colon ( : ) or a period ( . ).

Time zone splits occur between and within countries. Although a time zone can be described in terms of how many hours it is ahead of, or behind, Greenwich Mean Time (GMT), this number is not always an integer. For example, Nekeybfoundland is in a time zone that is half an hour different from the adjacent time zone.

Daylight Savings Time (DST) starts and ends on different dates that can vary from country to country.

Date Formats

Table 1-2 shows some of the date formats used around the world. Notice that even within a country, there can be variations.

Table 1-2 International Date Formats


Locale	Convention	Example
Canadian (English and French)	yyyy-mm-dd	1998-08-13
Danish	yyyy-mm-dd	1999-08-24
Finnish	dd.mm.yyyy	13.08.1998
French	dd/mm/yyyy	13/08/1999
German	yyyy-mm-dd	1999-09-18
Italian	dd.mm.yy	13.08.98
Norwegian	dd.mm.yy	13.08.98
Spanish	dd-mm-yy	13-08-98
Swedish	yyyy-mm-dd	1998-08-13
GB-English	dd/mm/yy	13/08/98
US-English	mm-dd-yy	08-13-98
Thai	dd/mm/yyyy	10/12/2009

Numbers

Decimal and Thousands Separators

Great Britain and the United States are two of the few places in the world that use a period to indicate the decimal place. Many other countries use a comma instead. The decimal separator is also called the radix character. Likewise, while the U.K. and U.S. use a comma to separate groups of thousands, many other countries use a period instead, and some countries separate thousands groups with a thin space. Table 1-3 shows some commonly used numeric formats.

Table 1-3 International Numeric Conventions


Locale	Large Number
Canadian (English and French)	4 294 967 295,000
Danish	4 294 967 295,000
Finnish	4 294 967 295,000
French	4 294 967 295,000
German	4 294 967.295,000
Italian	4.294.967.295,000
Norwegian	4.294.967.295,000
Spanish	4.294.967.295,000
Swedish	4 294 967 295,000
GB-English	4,294,967,295.00
US-English	4,294,967,295.00
Thai	4,294,967,295.00

Data files containing locale-specific formats are frequently misinterpreted when transferred to a system in a different locale. For example, a file containing numbers in a French format is not useful to a U.K.-specific program.

List Separators

There are no particular locale conventions that specify how to separate numbers in a list. They are sometimes comma-delimited in Great Britain and the U.S., but often spaces and semicolons are used.

Currency

Currency units and presentation order vary greatly around the world.Table 1-4 shows monetary formats in some countries.

Table 1-4 International Monetary Conventions


Locale	Currency	Example
Canadian (English)	Dollar ($)	$1 234.56
Canadian (French)	Dollar ($)	1 234.56$
Danish	Kroner (kr)	kr 1.234,56
Finnish	Markka (mk)	1.234,56 mk
French	Franc (F)	1 234,56 F
German	Deutsche Mark (DM)	DM 1.234,56
Italian	Lira (L)	L1.234,56
Japanese	Yen	41,234 Yen
Norwegian	Krone (kr)	kr 1.234,56
Spanish	Peseta (Pts)	1.234,56Pts
Swedish	Krona (Kr)	1.234,56 kr
GB-English	Pound	31,234.56 pounds
US-English	Dollar ($)	$1,234.56
Thai	Baht	2539 Baht
Euro	EUR	400,00

Note -

Local and international symbols for currency can differ. For example, the designation for the French franc is "F" in France but this is often written as FRF' internationally to distinguish it from other francs, such as the Swiss franc or the Polynesian franc.

Note -

Euro locales are based on the ISO8859-15 character set. See "European Localization" for available locales.

Be aware also that a converted currency amount can take up more or less space than the original amount. To illustrate: $1,000 can become L1.307.000.

Language Word and Letter Differences

Word Delimiters

In English, words are separated by a space character. In languages such as Chinese, Japanese and Thai, however, there is often no delimiter between words.

Sort Order

Sorting order for particular characters is not the same in all languages. For example, the character "ö" sorts with the ordinary "o" in Germany, but sorts separately in Sweden, where it is the last letter of the alphabet. In some languages, characters have weight to determine the priority of the character sequences. For example, in Thai, the Thai dictionary defines sorting through the sequences of characters that have different weights.

Character Sets

Number of Characters

While the English alphabet contains only 26 characters, some languages contain many more characters. Japanese, for example, can contain over 40,000 characters, Chinese even more.

Western European Alphabets

The alphabets of most western European countries are similar to the standard 26-character alphabet used in English-speaking countries, but there are often some additional basic characters, some marked (or accented) characters, and some ligatures.

Japanese Text

Japanese text is composed of three different scripts mixed together: Kanji ideographs derived from Chinese, and two phonetic scripts (or syllabaries), Hiragana and Katakana.

Although each character in Hiragana has an equivalent in Katakana, Hiragana is the most common script, with cursive rather than block-like letter forms. Kanji characters are used to write root words. Katakana is mostly used to represent "foreign" words--words "imported" from languages other than Japanese.

Kanji has tens of thousands of characters, but the number commonly used has been declining steadily over the years. Now only about 3500 are frequently used, although the average Japanese writer has a vocabulary of about 2000 Kanji characters. Nonetheless, computer systems must support more than 7000 because that is what the Japan Industry Standard (JIS) requires. In addition, there are about 170 Hiragana and Katakana characters. On average 55% of Japanese text is Hiragana, 35% Kanji, and 10% Katakana. Arabic numerals and Roman letters are also present in Japanese text.

Although it is possible to completely avoid the use of Kanji, most Japanese readers find text containing Kanji easier to understand.

Korean Text

Korean text can be written using a phonetic writing system called Hangul. Hangul has more than 11,000 characters, which consist of 19 consonants, 21 vowels, and an optional 27 consonants. About 3,000 Hangul characters from the entire Hangul characters are usually used in Korean computer systems. Korean also uses ideographs based on the set invented in China, called Hanja. Korean text requires over 6,000 Hanja characters. Hanja is used mostly to avoid confusion when Hangul would be ambiguous. Hangul characters are formed by combining consonants and vowels. After combining them, they can compose one syllable, which is a Hangul character. Hangul characters are often arranged in a square, so that the group takes up the same space as a Hanja character. Arabic numerals, Roman letters, and special symbol characters are also present in Korean text.

Thai Text

A Thai character can be defined as a column position on a display screen with four display cells. Each column position can have up to three characters. The composition of a display cell is based on the Thai character's classification. Some Thai characters can be composed with another character's classification. If they can be composed together, both characters are in the same cell. Otherwise, they are in separate cells.

Chinese Text

Chinese usually consists entirely of characters from the ideographic script called Hanzi. In the People's Republic of China (PRC) there are about 7000 commonly used Hanzi characters in GB2312 (zh locale) and more than 20,000 characters in the GBK (zh.GBK) locale. In Taiwan, current standards require more than 13000 characters; 6000 others have been recently standardized but are considered rare.

If a character is not a root character, it usually consists of two or more parts, two being most common. In two-part characters, one part generally represents meaning, and the other represents pronunciation. Occasionally both parts represent meaning. The radical is the most important element, and characters are traditionally arranged by radical, of which there are several hundred. A single sound can be represented by many different characters, which are not interchangeable in usage. A single character can have different sounds.

Some characters are more appropriate than others in a given context--the appropriate one is distinguished phonetically by the use of tones. By contrast, spoken Japanese and Korean lack tones.

Several phonetic systems represent Chinese. In the People's Republic of China the most common is pinyin, which uses roman characters and is widely employed in the West for place names such as Beijing. The Wade-Giles system is an older phonetic system, formerly used for place names such as Peking. In Taiwan zhuyin (or bopomofo), a phonetic alphabet with unique letter forms, is often used instead.

Commercial applications, particularly those that deal with people's names, need to consider the impact of codeset expansion. Many Chinese people have names containing characters that do not exist in any standard codeset. You need to provide space in unassigned codesets to deal with this issue.

Keyboard Differences

Not all characters on the U.S. keyboard appear on other keyboards. Similarly, other keyboards often contain many characters not visible on the U.S. keyboard.

Note -

However, on SPARC machines, the Compose key can be used to produce any character in the ISO Latin-1 codeset on any keyboard that supports it.

Note -

The Compose key can be used with English or European locales, but not with Korean, Chinese, or Japanese locales, except the UTF-8 locales.

Other Differences

Paper Sizes

Within each country a small number of paper sizes are commonly used, normally with one of those sizes being much more common than the others. Most countries follow ISO Standard 216: "Writing paper and certain classes of printed matter--Trimmed sizes--A and B series."

Internationalized applications should not make assumptions about the page sizes available to them. The Solaris system provides no support for tracking output page size; this is the responsibility of the application program. Table 1-5 shows Common International page sizes.

Table 1-5 Common International Page Sizes


Paper Type	Dimensions	Countries
ISO A4	21.0 cm by 29.7 cm	Everywhere except U.S.
ISO A5	14.8 cm by 21.0 cm	Everywhere except U.S.
JIS B4	25.9 cm by 36.65 cm	Japan
JIS B5	18.36 cm by 25.9 cm	Japan
U.S. Letter	8.5 inch by 11 inches	U.S. and Canada
U.S. Legal	8.5 inch by 14 inches	U.S. and Canada

Creating Worldwide Software: The Book

The book Creating Worldwide Software, 2nd edition, by Bill Tuthill and David Smallberg (SunSoft Press, 1997), is a guide to localizing for the Solaris platform. The book is recommended for developers who work with the Solaris system. See "Related Books and Sites" for a full citation.