International Language Environments Guide

Chapter 2 Internationalization Framework in the Solaris 8 Environment

This section discusses several internationalization features contained in the Solaris 8 environment.

Support for Codeset independence
Locale database
Process code format (wide character expression)
libw and libintl
ctype macros
genmsg utility

This section also contains information useful for developing internationalized applications such as:

Dynamically linked applications
Solaris 8 internationalized APIs

Support for Codeset Independence

The Solaris 8 operating environment supports non-EUC encodings such as PC-Kanji in Japan, Big-5 in Taiwan, and GBK in the People's Republic of China.

Because a large part of the computer market demands non-EUC codeset support, Solaris 8 provides a solid framework to enable both EUC and non-EUC codeset support. This support is called Codeset Independence, or CSI.

The goal of CSI is to remove EUC dependencies on specific codesets or encoding methods from Solaris OS libraries and commands. The CSI architecture allows the Solaris operating environment to support any UNIX file system safe encoding. CSI supports a number of new codesets, such as UTF-8, PC-Kanji, and Big-5.

The CSI Approach

Codeset Independence enables application and platform software developers to keep their code independent of encoding, such as UTF-8, and also provides the ability to adopt any new encoding without having to modify the source code. This architecture approach differs from Java internationalization in that Java requires applications to be Unicode-dependent and also requires code conversions throughout the application.

Many existing internationalized applications (for example, Motif) automatically inherit CSI support from the underlying system. These applications work in the new locales without modification. OPEN LOOK applications, however, that are XView/OLIT based, don't work in the new locales because XView is codeset-dependent.

CSI is inherently independent from any codesets. However, the following assumptions on file code encodings (codesets) still apply to Solaris 8:

File code is a superset of ASCII.

Unicode (16-bits fixed width) cannot be supported as file code.
NULL (0x00) is not part of multibyte characters for support of null-terminated multibyte character strings.
Slash / (0x2f) is not part of multibyte characters for support of the UNIX path names.
Only stateless file code encodings are supported.

CSI-enabled Commands

Table 2-1 contains CSI-enabled commands in Solaris 8. These commands are marked with CSI capabilities on their man page.

All commands are in the /usr/bin directory, unless otherwise noted.

Table 2-1 CSI-enabled Commands in Solaris 8


`/usr/lib/diffh`	`acctcom`	`gencat`	`script`
`/usr/sbin/accept`	`apropos`	`getopt`	`sdiff`
`/usr/sbin/reject`	`batch`	`getoptcvt`	`settime`
`/usr/ucb/lpr`	`bdiff`	`head`	`sh`
`/usr/xpg4/bin/awk`	`cancel`	`join`	`split`
`/usr/xpg4/bin/cp`	`cat`	`jsh`	`strconf`
`/usr/xpg4/bin/date`	`catman`	`kill`	`strings`
`/usr/xpg4/bin/du`	`chgrp`	`ksh`	`sum`
`/usr/xpg4/bin/ed`	`chmod`	`lp`	`tabs`
`/usr/xpg4/bin/edit`	`chown`	`man`	`tar`
`/usr/xpg4/bin/egrep`	`cmp`	`mkdir`	`tee`
`/usr/xpg4/bin/env`	`col`	`msgfmt`	`touch`
`/usr/xpg4/bin/ex`	`comm`	`news`	`tty`
`/usr/xpg4/bin/expr`	`compress`	`nroff`	`uncompress`
`/usr/xpg4/bin/fgrep`	`cpio`	`pack`	`unexpand`
`/usr/xpg4/bin/grep`	`csh`	`paste`	`uniq`
`/usr/xpg4/bin/ln`	`csplit`	`pcat`	`unpack`
`/usr/xpg4/bin/ls`	`cut`	`pg`	`wc`
`/usr/xpg4/bin/more`	`diff`	`printf`	`whatis`
`/usr/xpg4/bin/mv`	`diff3`	`priocntl`	`write`
`/usr/xpg4/bin/nice`	`disable`	`ps`	`xargs`
`/usr/xpg4/bin/nohup`	`echo`	`pwd`	`zcat`
`/usr/xpg4/bin/od`	`expand`	`rcp`
`/usr/xpg4/bin/pr`	`file`	`red`
`/usr/xpg4/bin/rm`	`fine`	`remsh`
`/usr/xpg4/bin/sed`	`fold`	`rksh`
`/usr/xpg4/bin/sort`	`ftp`	`rmdir`
`/usr/xpg4/bin/tail`		rsh
`/usr/xpg4/bin/tr`
`/usr/xpg4/bin/vedit`
`/usr/xpg4/bin/vi`
`/usr/xpg4/bin/view`

Solaris 8 CSI-enabled Libraries

Nearly all functions in Solaris 8 libc (/usr/lib/libc.so) are CSI-enabled. However, the following functions in libcare not CSI-enabled because they are EUC-dependent functions:

csetcol() csetlen() euccol()
euclen() eucscol() getwidth()

The following macros are not CSI-enabled because they are EUC dependent:

csetno() wcsetno()

In the Solaris 8 product, libgen (/usr/ccs/lib/libgen.a) are internationalized, but not CSI enabled.

In the Solaris 8 product, libcurses (/usr/ccs/lib/libcurses.a) are internationalized, but not CSI enabled.

Here are the five deliverables:

The utility (32-bit application):

/usr/bin/geniconvtbl
special iconv shared objects:

/usr/lib/iconv/geniconvtbl.so

/usr/lib/iconv/sparcv9/geniconvtbl.so
Sample geniconvtbl(1) input source files and system-provided binary table files :

/usr/lib/iconv/geniconvtbl/srcs/

ISO8859-1_to_ISO646.txt

ISO646_to_ISO8859-1.txt

ISO8859-1_to_UTF-8.txt

UTF-8_to_ISO8859-1.txt

ShiftJIS_to_eucJP.txt

eucJP_to_ShiftJIS.txt

/usr/lib/iconv/geniconvtbl/binarytables/

ISO8859-1%ISO646.bt

ISO646%ISO8859-1.bt
Changed iconv_open(3) at libc.so.1s:

/usr/lib/libc.so.1

/usr/lib/sparcv9/libc.so.1 (sparcv9 example)
Man pages:

/usr/share/man/sman1/geniconvtbl.1

/usr/share/man/sman4/geniconvtbl.4

Note -

The section for geniconvtbl(1) describes how to use the utility and where to place the generated binary table files so that they can be used by the iconv functions and utilities.

See geniconvtbl(4)

Locale Database

The locale database format and structure is private and subject to change in a future release. Therefore, when developing an internationalized application, do not directly access the locale database. Instead, use the Solaris internationalization APIs.

Note -

When using Solaris 8, use the locale databases that are included with the Solaris 8 product. Do not use locales from previous Solaris versions.

Process Code Format

The process code format in the Solaris 8 product is private and subject to change in a future release. Therefore, when developing an international application, do not assume the process code format is the same. Instead use the Solaris internationalization APIs.

Multibyte Support Environment (MSE)

A multibyte character is a character that cannot be stored in a single byte, such as Chinese, Japanese, or Korean characters. These characters require two or three bytes of storage. A more precise definition can be found in ISO/IEC 9899:1990 subclause 3.13. The programming model enables these multibyte characters to be read in as logical units and stored internally as wide characters. These wide characters can be processed by the program as logical entities in their own right. Finally, these wide characters can be written out (undergoing appropriate translation) as logical units. This is analogous to the way single-byte characters are read in, manipulated, and written out again. The MSE provides a comparable set of interfaces to perform this processing. The MSE allows programs to be written to handle multibyte characters using the same programming model that is used for single-byte characters.

Dynamically Linked Applications

Solaris 8 users can choose how to link applications with the system libraries, such as libc, by using dynamic linking or static linking. However, any application that requires internationalization features in the system libraries must be dynamically linked. If the application has been statically linked, the operation to set the locale to other than C and POSIX using the setlocale function will fail. Statically linked applications can be operated only in C and POSIX locales.

By default, the linker program tries to link the application dynamically. If the command line options to the linker and the compiler include -Bstatic or -dn specifications, your application might be statically linked. You can check whether an existing application is dynamically linked using the /usr/bin/ldd command.

For example, if you type:

% /usr/bin/ldd /sbin/sh

the command displays the following message:

% ldd: /sbin/sh: file is not a dynamic executable or shared object

The message indicates the /sbin/sh command is not a dynamically linked program. Also, if you type:

% /usr/bin/ldd /usr/bin/ls

the command displays the following message:

% libc.so.1 => 	/usr/lib/libc.so.1
% libdl.so.1 => /usr/lib/libdl.so.1

This message indicates the /usr/bin/ls command has been dynamically linked with two libraries, libc.so.1 and libdl.so.1.

To summarize, if the message from the ldd command to the application does not contain a libc.so.1 entry, it indicates that the application has been statically linked with libc. In that case, you need to change the command line options to the linker so that dynamic linking is used instead, then re-link the application.

`libw` and `libintl`

These interfaces have moved to libc and are no longer in libw and libintl.

The shared objects ensure runtime compatibility for existing applications and, together with the archives, provide compilation environment compatibility for building applications. However, it is no longer necessary to build applications against libw or libintl.

For more information on filters see the Linker and Libraries Guide.

Table 2-2 shows the stub entry points in libw and libintl.

Table 2-2 Stub Entry Points in libw and libintl


`libw:`	`fgetwc`	`fgetws`	`fputwc`	`fputws`	`getwc`
	`getwchar`	`getws`	`isenglish`	i`sideogram`	`isnumber`
	`isphonogram`	`isspecial`	`iswalnum`	`iswalpha`	`iswcntrl`
	`iswctype`	`iswdigit`	`iswgraph`	`iswlower`	`iswprint`
	`iswpunct`	`iswspace`	`iswupper`	`iswxdigit`	`putwc`
	`putwchar`	`putws`	`strtows`	`towlower`	`towupper`
	`ungetwc`	`watoll`	`wcscat`	`wcschr`	`wcscmp`
	`wcscoll`	`wcscpy`	`wcscspn`	`wcsftime`	`wcslen`
	`wcsncat`	`wcsncmp`	`wcsncpy`	`wcspbrk`	`wcsrchr`
	`wcsspn`	`wcstod`	`wcstok`	`wcstol`	`wcstoul`
	`wcswcs`	`wcswidth`	`wcsxfrm`	`wctype`	`wcwidth`
	`wscasecmp`	wscat	`wschr`	`wscmp`	`wscol`
	`wscoll`	`wscpy`	wscspn	`wsdup`	`wslen`
	`wsncasecmp`	`wsncat`	`wsncmp`	`wsncpy`	`wspbrk`
	`wsprintf`	`wsrchr`	`wsscanf`	`wsspn`	`wstod`
	`wstok`	`wstol`	`wstoll`	`wstostr`	`wsxfrm`
`libintl:`	`bindtextdomain`	`dcgettext`	`dgettext`	`gettext`	`textdomain`

`ctype` Macros

Character classification and character transformation macros are defined in /usr/include/ctype.h. The Solaris 8 environment provides a new set of ctype macros. The new macros support character classification and transformation semantics defined by XPG4. To access the new set of macros, one of the following conditions must be met:

_XPG4_CHAR_CLASS is defined.
_XOPEN_SOURCE and _XOPEN_VERSION=4 are defined.
_XOPEN_SOURCE and _XOPEN_SOURCE_EXTENDED=1 are defined.

This means that all XPG4 and XPG4.2 applications automatically have the new macros. Since _XOPEN_SOURCE, _XOPEN_VERSION, and _XOPEN_SOURCE_EXTENDED bring in extra XPG4 related features in addition to new ctype macros, non-XPG4 or XPG4.2 applications should use __XPG4_CHAR_CLASS__.

There are corresponding ctype functions. The Solaris 8 functions also support XPG4 semantics.

Refer to the ctype(3C) man page for details.

Internationalization APIs in `libc`

Solaris 8 offers two sets of APIs:

Multibye (file codes)
Wide characters (process code)

Applications process in wide-character codes.

When a program takes input from a file, convert your file's multibyte data into wide character process code with the mbtwoc and mbtowcs APIs. To convert the file output data from wide character format into multibyte format, use the wcstombs and wctomb APIs.

Table 2-3 shows a list of internationalization APIs included in Solaris 8.

Table 2-3 Internationalization APIs in libc


API Type	Library Routine	Description
Messaging functions
	`catclose()`	Close a message catalog.
	`catgets()`	Read a program message.
	`catopen()`	Open a message catalog.
	`dgettext()`	Get a message from a message catalog with domain specified.
	`dcgettext()`	Get a message from a message catalog with domain and category specified.
	`textdomain()`	Set and query the current domain.
	`bindtextdomain()`	Bind the path for a message domain.
Code conversion
	`iconv()`	Convert codes.
	`iconv_close()`	Deallocate the conversion descriptor.
	`iconv_open()`	Allocate the conversion descriptor.
Regular expression
	`regcomp()`	Compile the regular expression.
	`regexec()`	Execute regular expression matching.
	`regerror(`)	Provide a mapping from error codes to error message.
	`regfree()`	Free memory allocated by `regcomp().`

Wide character class
	`wctype()`	Define character class.
	`wctrans`	Define character mapping.
	`towctrans`	Wide-character mapping.
	`setlocale()`	Modify and query a program's locale.
	`nl_langinfo()`	Get language and cultural information of current locale.
	`localeconv()`	Get monetary and numeric formatting information of current locale.
Character classification
	`isalpha()`	Is character alphabetic?
	`isupper()`	Is character uppercase?
	`islower()`	Is character lowercase?
	`isdigit()`	Is character a digit?
	`isxdigit()`	Is character a hex digit?
	`isalnum()`	Is character alphabetic or digital?
	`isspace()`	Is character a space?
	`ispunct()`	Is character a punctuation mark?
	`isprint()`	Is character printable?
	`iscntrl()`	Is character a control character?
	`isascii()`	Is character an ASCII character?
	`isgraph()`	Is character a visible character?
	`isphonogram()`	Is wide character a phonogram?
	`isideogram()`	Is wide character an ideogram?
	`isenglish()`	Is wide character in English alphabet from a supplementary codeset?
	`isnumber()`	Is wide character a digit from a supplementary codeset?
	`isspecial()`	Is special wide character from a supplementary codeset?
	`iswalpha()`	Is wide character alphabetic?
	`iswupper()`	Is wide character uppercase?
	`iswlower()`	Is wide character lowercase?
	`iswdigit()`	Is wide character a digit?
	`iswxdigit()`	Is wide character a hex digit?
	`iswalnum()`	Is wide character an alphabetic character or digit?
	`iswspace()`	Is wide character a white space?
	`iswpunct()`	Is wide character a punctuation mark?
	`iswprint()`	Is wide character a printable character?
	`iswgraph()`	Is wide character a visible character?
	`iswcntrl()`	Is wide character a control character?
	`iswascii()`	Is wide character an ASCII character?
	`toupper()`	Convert a lowercase character to uppercase.
	`tolower()`	Convert an uppercase character to lowercase.
	`towupper()`	Convert a lowercase wide character to uppercase.
	`towlower()`	Convert an uppercase wide character to lowercase.
Character collation
	`strcoll()`	Collate character strings.
	`strxfrm()`	Transform character strings for comparison.
	`wcscoll()`	Collate wide character strings.
	`wcsxfrm()`	Transform wide character strings for comparison.
Monetary handling
	`strfmon()`	Convert monetary value to string representation.
Date and time handling
	`getdate()`	Convert user format date and time.
	`strftime()`	Convert date and time to string representation. The %u conversion function conforms to the X/Open CAE Specification, System Interfaces and Headers, Issue 4, Version 2. This function represents a weekday as a decimal number [1,7], with 1 now representing Monday.
	`strptime()`	Date and time conversion.
Multibyte handling
	`btowc`	Single-byte to wide-character conversion.
	`mbrlen()`	Get number of bytes in character (restartable).
	`mbsinit()`	Determine conversion object status.
	`mbtowc()`	Convert a character to a wide-character code (restartable).
	`mbstowcs()`	Convert a character string to a wide-character string (restartable).
Wide characters
	`wcsncat()`	Concatenate wide-character strings to length `n.`
	`wsdup()`	Duplicate wide-character string.
	`wcscmp()`	Compare wide-character strings.
	`wcsncmp()`	Compare wide-character strings to length `n.`
	`wcscpy()`	Copy wide-character strings.
	`wcsncpy()`	Copy wide-character strings to length `n`.
	`wcschr()`	Find character in wide-character string.
	`wcsrchr()`	Find character in wide-character string from right.
	`wcslen()`	Get length of wide-character string.
	`wscol()`	Return display width of wide-character string.
	`wcsspn()`	Return span of one wide-character string in another.
	`wcscspn()`	Return span of one wide-character string not in another.
	`wcspbrk()`	Return pointer to one wide-character string in another.
	`wcstok()`	Move token through wide-character string.
	`wcswcs()`	Find string in wide-character string.
	`wcstombs()`	Convert wide-character string to multibyte string.
	`wctomb()`	Convert wide-character to multibyte character.
	`wcwidth()`	Determine number of column positions of a wide character.
	`wcswidth()`	Determine number of column positions of a wide-character string.
	`wctob`	Wide-character to single-byte conversion.
	`wcrtomb`	Convert a wide-character code to a character (restartable).
	`wcsrtombs`	Interpret wide-character string according to format.
Wide formatting
	`wsprintf()`	Generate wide-character string according to format.
	`wsscanf()`	Formatted input conversion.
	`fwprintf`	Print formatted wide-character output.
	`fwscanf`	Convert formatted wide-character input.
	`wprintf`	Print formatted wide-character output.
	`wscanf`	Convert formatted wide-character input.
	`swprintf`	Print formatted wide-character output.
	`swscanf`	Convert formatted wide-character input.
	`vfwprintf`	Wide-character formatted output of a `stdarg` argument list.
	`vswprintf`	Wide-character formatted output of a `stdarg` argument list.
Wide numbers
	`wcstol()`	Convert wide-character string to long integer.
	`wcstoul()`	Convert wide-character string to unsigned long integer.
	`wcstod()`	Convert wide-character string to double precision.
Wide strings
	`wscasecmp()`	Compare wide-character strings, ignore case differences.
	`wsncasecmp()`	Process code-string operations.
	`wcsstr`	Find a wide-character substring.
	`wmemchr`	Find a wide-character in memory.
	`wmemcmp`	Compare wide-characters in memory.
	`wmemcpy`	Copy wide-characters in memory.
	`wmemmove`	Copy wide-characters in memory with overlapping areas.
	`wmemset`	Set wide-characters in memory.
Wide standard I/O
	`fgetwc()`	Get multibyte character from stream, convert to wide character.
	`getwchar()`	Get multibyte character from `stdin`, convert to wide character.
	`fgetws()`	Get multibyte string from stream, convert to wide character.
	`getws()`	Get multibyte string from `stdin`, convert to wide character.
	`fputwc()`	Convert wide character to multibyte character, puts to stream.
	`fwide`	Set stream orientation.
	`putwchar()`	Convert wide character to multibyte character, puts to `stdin`.
	`fputws()`	Convert wide character to multibyte string, puts to stream.
	`putws()`	Convert wide character to multibyte string, puts to `stdin`.
	`ungetwc()`	Push a wide character back into input stream.

`genmsg` Utility

The new genmsg utility can be used with the catgets() family of functions to create internationalized source message catalogs. The utility examines a source program file for calls to functions in catgets and builds a source message catalog from the information it finds. For example:

% cat example.c
	...
	/* NOTE: %s is a file name */
	printf(catgets(catd, 5, 1, "%s cannot be opened."));
	/* NOTE: "Read" is a past participle, not a present
 
			tense verb */
	printf(catgets(catd, 5, 1, "Read"));
	...
% genmsg -c NOTE example.c
The following file(s) have been created.
			new msg file = "example.c.msg"
% cat example.c.msg
$quote "
$set 5
1			"%s cannot be opened"
	/* NOTE: %s is a file name */
2			"Read"
	/* NOTE: "Read" is a past participle, not a present
			tense verb */

In the above example, genmsg is run on the source file example.c, which produces a source message catalog named example.c.msg. The -c option with the argument NOTE causes genmsg to include comments in the catalog. If a comment in the source program contains the string specified, the comment appears in the message catalog after the next string extracted from a call to catgets().

You can use genmsg to number the messages in a message set automatically.

For more information, see the genmsg(1) man page.

Note -

The material in this section is used with permission from Creating Worldwide Software: Solaris International Developer's Guide, 2nd edition by Bill Tuthill and David A. Smallberg, published by Sun Microsystems Press/Prentice Hall. (c)1997 Sun Microsystems, Inc.

Chapter 2 Internationalization Framework in the Solaris 8 Environment

Support for Codeset Independence

The CSI Approach

CSI-enabled Commands

Solaris 8 CSI-enabled Libraries

Locale Database

Process Code Format

Multibyte Support Environment (MSE)

Dynamically Linked Applications

libw and libintl

ctype Macros

Internationalization APIs in libc

genmsg Utility

`libw` and `libintl`

`ctype` Macros

Internationalization APIs in `libc`

`genmsg` Utility