Asian-Language Support in the Solaris Operating Environment

Chapter 5 Common Development Issues

Writing and cultural conventions can vary greatly in different locales, such as character sets and numeric, time, date, and monetary formats. Some issues apply particularly to multibyte development.

5.1 Casing

Uppercase and lowercase words do not always apply in multibyte languages. For example, ideographs don't have case. Thus, characters that do not change after a casing function should not be treated as an error when calling an API which returns casing rules.

The following APIs process multibyte characters:

toupper(): Convert wide characters to uppercase
tolower(): Convert wide characters to lowercase
wctype(): Define character class

5.2 Sort Order

Sorting conventions vary widely across languages and locales. Some languages even have different rules for collating the same character. Sorting ideographs is different than sorting phonetic scripts and is based on either the form or pronunciation of characters.

A form-based system sorts first on the character's primary radical and then on the number of strokes to write the character (stroke count). A pronunciation-based system sorts first on ideograph pronunciation and then on stroke count.

5.3 Text Manipulation

When supporting multibyte languages, it is important to understand the difference between multibyte, wide and Unicode characters, and the impact of these on software development.

In the Solaris operating environment, a multibyte character (or file code) is a sequence of one or more bytes terminated by a null string. Thus, a string may contain characters of different length. On the other hand, a wide character (or process code) is defined as a fixed-size number of bytes. In the Solaris operating environment, a wide character is defined to be four bytes long. The Solaris operating environment supports the Unicode UTF-8 format, a variable-length encoding similar to multibyte encoding

In many cases, there is no need to distinguish double-byte (or three-byte) characters from single-byte characters. It is simpler to convert multibyte strings (file code) to wide-character formats (process code) before manipulating or processing text data.

The following APIs convert multibyte characters:

mbstowcs(): Convert multibyte string to wide-character string
mbstowc(): Convert multibyte to wide-character code

The following wstring(3c) APIs process multibyte characters:

wcscmp(): Compare wide-character strings
wcscpy(): Copy wide-character strings
wcslen(): Get length of wide-character string
wcschr(): Find character in wide-character string

Note -

File code is in multibyte format. Process code is in wide-character format. Do not assume particular character encodings of the process code.

5.4 Fonts

Mixed codeset strings cannot usually be rendered with a single font. A font set is a collection of fonts suitable for rendering all codesets in a locale's encoding, and includes data about the locale in which it was created. For example, in the Korean locale, both the ASCII and Korean fonts are loaded. This is known as FontSet in the X11 Window System. The number of fonts and their character-set registry in a FontSet vary from one locale to another. Because the Solaris operating environment manages the FontSet at run time, applications do not need to know that multiple fonts are being used. You just need to use FontSet interfaces.

Common font family names, such as Times and Courier, are not usually available in multibyte locales. Locale-sensitive font family names should not be hard coded in applications. In the Common Desktop Environment, all locales have a common set of font alias names, such as dt-application.