International Language Environments Guide

Chapter 2 General Internationalization Features

This chapter discusses several internationalization features contained in the Oracle Solaris operating system. The chapter covers the following topics.

Support for Code Set Independence

EUC is an abbreviation for Extended UNIX® Code. The Oracle Solaris operating system supports non-EUC encodings such as PC-Kanji (better known as Shift_JIS) in Japan, Big5 in Taiwan, and GBK in the People's Republic of China. Because a large part of the computer market demands non-EUC codeset support, the current Oracle Solaris environment provides a solid framework to enable both EUC and non-EUC code set support. This support is called Code Set Independence, or CSI.

The goal of CSI is to remove dependencies on specific code sets or encoding methods from Oracle Solaris operating system libraries and commands. The CSI architecture enables the Oracle Solaris operating system to support any UNIX file system safe encoding. CSI supports a number of new code sets, such as UTF-8, PC-Kanji, and Big5.

CSI Approach

Code set independence enables application and platform software developers to keep their code independent of any encoding, such as UTF-8. CSI also provides the ability to adopt any new encoding without having to modify the source code. This architecture approach differs from Java internationalization because applications do not have to be to be UTF-16–dependent.

Many existing internationalized applications (for example, Motif) automatically inherit CSI support from the underlying system. These applications work in the new locales without modification.

CSI is inherently independent from any code sets. However, the following assumptions about file code encodings (code sets) still apply to the current Oracle Solaris system:

File code is a superset of ASCII.
NULL byte value (0x00) does not appear as part of multibyte character bytes for support of null-terminated multibyte character strings.
ASCII Slash character byte value (0x2f) does not appear as part of multibyte character bytes for support of the UNIX path names.

CSI-enabled Commands

This section lists the CSI-enabled commands in the current Oracle Solaris environment. The man page for each command includes an attribute section that indicates whether the command is CSI-enabled.

All commands are in the /usr/bin directory, unless otherwise noted.

`/usr/lib/diffh`	`cat`	`pack`
`/usr/sbin/accept`	`catman`	`paste`
`/usr/sbin/reject`	`chgrp`	`pcat`
`/usr/ucb/lpr`	`chmod`	`pg`
`/usr/xpg4/bin/awk`	`chown`	`printf`
`/usr/xpg4/bin/cp`	`cmp`	`priocntl`
`/usr/xpg4/bin/date`	`col`	`ps`
`/usr/xpg4/bin/du`	`comm`	`pwd`
`/usr/xpg4/bin/ed`	`compress`	`rcp`
`/usr/xpg4/bin/edit`	`cpio`	`red`
`/usr/xpg4/bin/egrep`	`csh`	`remsh`
`/usr/xpg4/bin/env`	`csplit`	`rksh`
`/usr/xpg4/bin/ex`	`cut`	`rsh`
`/usr/xpg4/bin/expr`	`diff`	`rsmdir`
`/usr/xpg4/bin/fgrep`	`diff3`	`script`
`/usr/xpg4/bin/lp`	`disable`	`sdiff`
`/usr/xpg4/bin/ls`	`echo`	`settime`
`/usr/xpg4/bin/more`	`expand`	`sh`
`/usr/xpg4/bin/mv`	`file`	`split`
`/usr/xpg4/bin/nice`	`find`	`strconf`
`/usr/xpg4/bin/nohup`	`fold`	`strings`
`/usr/xpg4/bin/od`	`ftp`	`sum`
`/usr/xpg4/bin/pr`	`gencat`	`tabs`
`/usr/xpg4/bin/rm`	`geteopt`	`tar`
`/usr/xpg4/bin/sed`	`getoptcvt`	`tee`
`/usr/xpg4/bin/sort`	`head`	`touch`
`/usr/xpg4/bin/tail`	`join`	`tty`
`/usr/xpg4/bin/tr`	`jsh`	`uncompress`
`/usr/xpg4/bin/vedit`	`kill`	`unexpand`
`/usr/xpg4/bin/vi`	`ksh`	`uniq`
`/usr/xpg4/bin/view`	`lp`	`unpack`
`acctcom`	`man`	`wc`
`apropos`	`mkdir`	`whatis`
`batch`	`msgfmt`	`write`
`bdiff`	`news`	`xargs`
`cancel`	`nroff`	`zcat`

CSI-enabled Libraries

Nearly all functions in libc (/usr/lib/libc.so) are CSI-enabled. However, the following functions in libc are not CSI-enabled and therefore are EUC-dependent functions:

csetcol()
csetlen()
csetno()
euccol()
euclen()
eucscol()
getwidth()
wcsetno()

In the current Oracle Solaris environment, libgen /usr/ccs/lib/libgen.a and libcurses /usr/ccs/lib/libcurses.a are internationalized but not CSI-enabled.

Locale Database

The locale database format and structure is private and subject to change in a future release. When you develop internationalized applications, you use the internationalization APIs in libc. These APIs are described in Internationalization APIs in libc, rather than linking to the locale database.

Note –

When you work in the Oracle Solaris environment, use the locale databases that are included with the current Oracle Solaris release. Do not use locales from previous Oracle Solaris versions.

Process Code Format

The process code format, which is also known as wide-character code format in the Oracle Solaris operating system, is private and subject to change in a future release. Therefore, when you develop an international application, do not assume that the process code format is the same. Instead, use the internationalization APIs in libc described in Internationalization APIs in libc.

Note –

The process code for all Unicode locales is in UTF 32 representation. For more detail on UTF 32, refer to the Unicode Standard Annex #19: UTF 32 and Unicode Standard Annex #27: Unicode 3.1 from the Unicode Consortium or http://www.unicode.org/.

Multibyte Support Environment

A multibyte character is a character that cannot be stored in a single byte, such as Chinese, Japanese, or Korean characters. These characters require 2, 3, or 4 bytes of storage. A more precise definition can be found in ISO/IEC 9899:1990 subclause 3.13.

The Amendment 1 to ANSI C, which is also known as ISO/IEC 9899:1990, added new internationalization features, collectively known as the Multibyte Support Environment (MSE). Amendment 1 defines additional internationalization APIs for multibyte code sets with state and also for better wide-character handling support.

The programming model enables these multibyte characters to be read in as logical units and stored internally as wide characters. These wide characters can be processed by the program as logical entities. Finally, these wide characters can be written out, undergoing appropriate translation, as logical units.

This procedure is analogous to the way single-byte characters are read in, manipulated, and written out again. The MSE enables programs to handle multibyte characters using the same programming model that is used for single-byte characters.

Dynamically Linked Applications

You can link applications with the system libraries, such as libc, by using dynamic linking or static linking. Any application that requires internationalization features in the system libraries must be dynamically linked. If the application has been statically linked, the operation to set the locale to anything other than C and POSIX using the setlocale function will fail. Statically linked applications can operated only in C and POSIX locales.

By default, the linker program tries to link the application dynamically. If the command-line options to the linker and the compiler include -Bstatic or -dn specifications, your application might be statically linked. You can check whether an existing application is dynamically linked using the /usr/bin/ldd command.

For example, the response to the following command indicates that the /sbin/sh command is not a dynamically linked program:

% /usr/bin/ldd /sbin/sh
ldd: /sbin/sh: file is not a dynamic executable or shared object

The response to the following command indicates that the /usr/bin/ls command has been dynamically linked with two libraries, libc.so.1 and libdl.so.1.

% /usr/bin/ldd /usr/bin/ls
libc.so.1 => 	/usr/lib/libc.so.1
libdl.so.1 => /usr/lib/libdl.so.1

Changed Interfaces

libw and libintl have moved to libc and are no longer in libw and libintl.

The shared objects ensure runtime compatibility for existing applications and, together with the archives, provide compilation environment compatibility for building applications. However, you no longer must build applications against libw or libintl.

The following list shows the stub entry points in libw:

`fgetwc`	`iswdigit`	`wcscat`	`wcstol`	`wslen`
`fgetws`	`iswgraph`	`wcschr`	`wcstoul`	`wsncasecmp`
`fputwc`	`iswlower`	`wcsclen`	`wcswcs`	`wsncat`
`fputws`	`iswprint`	`wcscmp`	`wcswidth`	`wsncmp`
`getwc`	`iswpunct`	`wcscoll`	`wcsxfrm`	`wsncpy`
`getwchar`	`iswspace`	`wcscpy`	`wctype`	`wspbrk`
`getws`	`iswupper`	`wcscspn`	`wcwidth`	`wsprintf`
`isenglish`	`iswxdigit`	`wcsftime`	`wscasecmp`	`wsrchr`
`isideogram`	`putwc`	`wscncat`	`wscat`	`wsscanf`
`isnumber`	`putwchar`	`wcsncmp`	`wschr`	`wsspn`
`isphonogram`	`putws`	`wcsncpy`	`wscmp`	`wstod`
`isspecial`	`strtows`	`wcspbrk`	`wscol`	`wstok`
`iswalnum`	`towlower`	`wcsrchr`	`wscoll`	`wstol`
`iswalpha`	`towupper`	`wcsspn`	`wscpy`	`wstoll`
`iswcntrl`	`ungetwc`	`wcstod`	`wscspn`	`wstostr`
`iswctype`	`watoll`	`wcstok`	`wsdup`	`wsxfrm`

The following list shows the stub entry points in libintl:

bindtextdomain

dcgettext

dgettext

gettext

textdomain

`ctype` Macros

Character classification and character transformation macros are defined in /usr/include/ctype.h. The current Oracle Solaris environment provides a set of ctype macros that support character classification and transformation semantics defined by XPG4. For all XPG4 and XPG4.2 applications to automatically access new macros, one of the following conditions must be met:

_XPG4_CHAR_CLASS is defined.
_XOPEN_SOURCE and _XOPEN_VERSION=4 are defined.
_XOPEN_SOURCE and _XOPEN_SOURCE_EXTENDED=1 are defined.

Because _XOPEN_SOURCE, _XOPEN_VERSION, and _XOPEN_SOURCE_EXTENDED bring in extra XPG4 related features in addition to new ctype macros, non-XPG4 or XPG4.2 applications should use __XPG4_CHAR_CLASS__.

Corresponding ctype functions also exist. The current Oracle Solaris environment functions also support XPG4 semantics.

Internationalization APIs in `libc`

The current Oracle Solaris environment offers two sets of APIs:

Multibyte (file codes)
Wide characters (process code)

Wide-character codes are fixed-width units of logical entities. Therefore, you do not have to keep track of maintaining proper character boundaries when you are using multibyte characters.

When a program takes input from a file, you can convert your file's multibyte data into wide-character process code directly with input functions like fscanf and fwscanf or by using conversion functions like mbtowc and mbsrtowcs after the input. To convert output data from wide-character format to multibyte character format, use output functions like fwprintf and fprintf or apply conversion functions like wctomb and wcsrtombs before the output.

The tables in the remainder of this chapter describe the internationalization APIs included in the current Oracle Solaris system.

The following table describes the messaging function APIs in libc.

Table 2–1 Messaging Functions in libcp


Library Routine	Description
`bindtextdomain()`	Bind the path for a message domain
`catclose()`	Close a message catalog
`catgets()`	Read a program message
`catopen()`	Open a message catalog
`dcgettext()`	Get a message from a message catalog with domain and category specified
`dgettext()`	Get a message from a message catalog with domain specified
`gettext()`	Retrieve a text string from the message database
`textdomain()`	Set and query the current domain

The following table describes the code conversion function APIs in libc.

Table 2–2 Code Conversion in libc


Library Routine	Description
`iconv()`	Convert codes
`iconv_close()`	Deallocate the conversion descriptor
`iconv_open()`	Allocate the conversion descriptor

The following table describes the regular expression APIs in libc.

Table 2–3 Regular Expressions in libc


Library Routine	Description
`fnmatch()`	Match file name or path name
`regcomp()`	Compile the regular expression
`regerror()`	Provide a mapping from error codes to error messages
`regexec()`	Execute regular expression matching
`regfree()`	Free memory allocated by `regcomp()`

The following table describes the wide character function APIs in libc.

Table 2–4 Wide Character Class in libc


Library Routine	Description
`wctrans()`	Define character mapping
`wctype()`	Define character class

The following table lists the modify and query locale in libc.

Table 2–5 Modify and Query Locale in libc


Library Routine	Description
`setlocale()`	Modify and query a program's locale

The following table lists the query locale data in libc.

Table 2–6 Query Locale Data in libc


Library Routine	Description
`localeconv()`	Get monetary and numeric formatting information of current locale
`nl_langinfo()`	Get language and cultural information of current locale

The following table describes the character classification function APIs in libc.

Table 2–7 Character Classification and Transliteration in libc


Library Routine	Description
`isalnum()`	Is character alphabetic or digital?
`isalpha()`	Is character alphabetic?
`isascii()`	Is character an ASCII character?
`iscntrl()`	Is character a control character?
`isdigit()`	Is character a digit?
`isenglish()`	Is wide character in English alphabet from a supplementary code set?
`isgraph()`	Is character a visible character?
`isideogram()`	Is wide character an ideogram?
`islower()`	Is character lowercase?
`isnumber()`	Is wide character a digit from a supplementary code set?
`isphonogram()`	Is wide character a phonogram?
`isprint()`	Is character printable?
`ispunct()`	Is character a punctuation mark?
`isspace()`	Is character a space?
`isspecial()`	Is special wide character from a supplementary code set?
`isupper()`	Is character uppercase?
`iswalnum()`	Is wide character an alphabetic character or digit?
`iswalpha()`	Is wide character alphabetic?
`iswascii()`	Is wide character an ASCII character?
`iswcntrl()`	Is wide character a control character?
`iswdigit()`	Is wide-character a digit?
`iswgraph()`	Is wide character a visible character?
`iswlower()`	Is wide character lowercase?
`iswprint()`	Is wide character a printable character?
`iswpunct()`	Is wide character a punctuation mark?
`iswspace()`	Is wide character a white space?
`iswupper()`	Is wide character uppercase?
`iswxdigit()`	Is wide character a hex digit?
`isxdigit()`	Is character a hex digit?
`tolower()`	Convert an uppercase character to lowercase.
`toupper()`	Convert a lowercase character to uppercase.
`towctrans()`	Wide character mapping.
`towlower()`	Convert an uppercase wide character to lowercase.
`towupper()`	Convert a lowercase wide character to uppercase.

The following table describes the character collation function APIs in libc.

Table 2–8 Character Collation in libc


Library Routine	Description
`strcoll()`	Collate character strings
`strxfrm()`	Transform character strings for comparison
`wcscoll()`	Collate wide-character strings
`wcsxfrm()`	Transform wide-character strings for comparison

The following table describes the monetary handling function APIs in libc.

Table 2–9 Monetary Formatting in libc


Library Routine	Description
`localeconv()`	Get monetary formatting information for the current locale
`strfmon()`	Convert monetary value to string representation

The following table describes the date and time formatting in libc.

Table 2–10 Date and Time Formatting in libc


Library Routine	Description
`getdate()`	Convert user format date and time.
`strftime()`	Convert date and time to string representation. The `%u` conversion function conforms to the X/Open CAE Specification, System Interfaces and Headers, Issue 4, Version 2. This function represents a weekday as a decimal number [1,7], with 1 now representing Monday.
`strptime()`	Date and time conversion.

The following table describes the multibyte handling function APIs in libc.

Table 2–11 Multibyte Handling in libc


Library Routine	Description
`btowc()`	Single-byte to wide-character conversion
`mblen()`	Get number of bytes in a character
`mbrlen()`	Get number of bytes in character (restartable)
`mbrtowc()`	Convert a character to a wide-character code (restartable)
`mbsinit()`	Determine conversion object status
`mbsrtowcs()`	Convert a character string to a wide-character string (restartable)
`mbstowcs()`	Convert a character string to a wide-character string
`mbtowc()`	Convert a character to a wide-character code

The following table describes the wide character and string handling in libc.

Table 2–12 Wide Character and String Handling in libc


Library Routine	Description
`wcrtomb()`	Convert a wide-character code to a character (restartable)
`wcscat()`	Concatenate wide-character strings
`wcschr()`	Find character in wide-character string
`wcscmp()`	Compare wide-character strings
`wcscpy()`	Copy wide-character strings
`wcscspn()`	Return span of one wide-character string not in another
`wcslen()`	Get length of wide-character string
`wcsncat()`	Concatenate wide-character strings to length `n`
`wcsncmp()`	Compare wide-character strings to length `n`
`wcsncpy()`	Copy wide-character strings to length `n`
`wcspbrk()`	Return pointer to one wide-character string in another
`wcsrchr()`	Find character in wide-character string from right
`wcsrtombs()`	Convert a wide-character string to a character string (restartable)
`wcsspn()`	Return span of one wide-character string in another
`wcstod()`	Convert wide-character string to double precision
`wcstok()`	Move token through wide-character string
`wcstol()`	Convert wide-character string to long integer
`wcstombs()`	Convert wide-character string to multibyte string
`wcstoul()`	Convert wide-character string to unsigned long integer
`wscwcs()`	Find string in wide-character string
`wcswidth()`	Determine number of column positions of a wide-character string
`wctob()`	Wide character to single byte conversion
`wctomb()`	Convert wide-character to multibyte character
`wcwidth()`	Determine number of column positions of a wide character
`wscol()`	Return display width of wide-character string
`wsdup()`	Duplicate wide-character string

The following table describes the formatted wide-character input and output in libc.

Table 2–13 Formatted Wide-character Input and Output in libc


Library Routine	Description
`fwprintf()`	Print formatted wide-character output
`fwscanf()`	Convert formatted wide-character input
`swprintf()`	Print formatted wide-character output
`swscanf()`	Convert formatted wide-character input
`vfwprintf()`	Wide-character formatted output of a `stdarg` argument list
`vswprintf()`	Wide-character formatted output of a `stdarg` argument list
`wprintf()`	Print formatted wide-character output
`wscanf()`	Convert formatted wide-character input
`wsprintf()`	Generate wide-character string according to format
`wsscanf()`	Formatted input conversion

This table describes the wide strings function APIs in libc.

Table 2–14 Wide Stringslibc


Library Routine	Description
`wcsstr()`	Find a wide-character substring
`wmemchr()`	Find a wide character in memory
`wmemcmp()`	Compare wide characters in memory
`wmemcpy()`	Copy wide characters in memory
`wmemmove()`	Copy wide characters in memory with overlapping areas
`wmemset()`	Set wide characters in memory
`wscasecmp()`	Compare wide-character strings, ignore case differences
`wsncasecmp()`	Process code-string operations

The following table describes the wide-character input and output in libc.

Table 2–15 Wide-Character Input and Output in libc


Library Routine	Description
`fgetwc()`	Get multibyte character from stream, convert to wide character
`fgetws()`	Get multibyte string from stream, convert to wide character
`fputwc()`	Convert wide character to multibyte character, puts to stream
`fputws()`	Convert wide character to multibyte string, puts to stream
`fwide()`	Set stream orientation
`getwchar()`	Get multibyte character from `stdin`, convert to wide character
`getws()`	Get multibyte string from `stdin`, convert to wide character
`putwchar()`	Convert wide character to multibyte character, puts to `stdin`
`putws()`	Convert wide character to multibyte string, puts to `stdin`
`ungetwc()`	Push a wide character back into input stream

`genmsg` Utility

The new genmsg utility can be used with the catgets() family of functions to create internationalized source message catalogs. The utility examines a source program file for calls to functions in catgets and builds a source message catalog from the information it finds. For example:

% cat example.c
	...
	/* NOTE: %s is a file name */
	printf(catgets(catd, 5, 1, "%s cannot be opened."));
	/* NOTE: "Read" is a past participle, not a present
 			tense verb */
	printf(catgets(catd, 5, 1, "Read"));
	...
% genmsg -c NOTE example.c
The following file(s) have been created.
			new msg file = "example.c.msg"
% cat example.c.msg
$quote "
$set 5
1			"%s cannot be opened"
	/* NOTE: %s is a file name */
2			"Read"
	/* NOTE: "Read" is a past participle, not a present
			tense verb */

In the above example, genmsg is run on the source file example.c, which produces a source message catalog named example.c.msg. The -c option with the argument NOTE causes genmsg to include comments in the catalog. If a comment in the source program contains the string specified, the comment appears in the message catalog after the next string extracted from a call to catgets.

You can use genmsg to number the messages in a message set automatically.

For more information, see the genmsg(1) man page.

To generate a formatted message catalog file, use the gencat(1) utility.

For information on the message extraction utility for portable message files (.po files) and also on how to generate message object files (.mo files) from the .po files.

User-Defined and User-Extensible Code Conversions

You can create user-defined codeset converters using the geniconvtbl utility.

This utility enables user-defined and user-customizable codeset conversions with a standard system utility and interface like iconv(1) and iconv(3C). This feature enhances the ability of an application to deal with incompatible data types, particularly data generated from proprietary or legacy applications. Modification to existing Oracle Solaris codeset conversions is also supported.

Sample input source files for the utility are available in the /usr/lib/iconv/geniconvtbl/srcs/ directory.

Once the user-defined code conversions are prepared and placed properly, users can use the code conversions from the iconv(1) utility and the iconv(3C) functions of both 32-bit and 64-bit Oracle Solaris operating system.

Internationalized Domain Name (IDN) Support

Internationalized Domain Name (IDN) enables the use of non-English native language names as host and domain names. To use non-English host and domain names, convert these names into ASCII Compatible Encoding (ACE) encoded names before sending the names to resolver routines as specified in RFC 3490. System administrators are also required to use ACE names in system files and applications where the system administration applications do not support the IDNs.

See RFC 3490 Internationalizing Domain Names in Applications (IDNA).

The APIs for the Internationalized Domain Name in libidnkit(3EXT) provide convenient conversions between UTF-8 or the application locale's codeset and ACE. If idn_decodename2(3EXT) is used, you can also specify an arbitrary codeset name as the codeset of the input argument.

Figure 2–1 IDN to ACE Conversion

graphic shows non-English name conversion to ASCII compatible
encoded string

Figure 2–2 ACE to IDN Conversion

graphic shows ASCII compatible encoded string conversion
to non-English name

The following table shows bilateral iconv code conversions that you can use.

Table 2–16 iconv Code Conversions


From Code	To Code
ACE ACE-ALLOW-UNASSIGNED	UTF-8 UTF-8
UTF-8 UTF-8	ACE ACE-ALLOW-UNASSIGNED

The ACE and the ACE-ALLOW-UNASSIGNED iconv code conversion names have the following meanings:

ACE.

ACE is a fromcode or tocode name that can be used in iconv code conversions to refer to the ASCII Compatible Encoding defined in RFC 3490. This conversion uses STD3 ASCII rules. Unassigned characters are not allowed. ACE is typically used for storing or giving host or domain names to machines.
ACE-ALLOW-UNASSIGNED.

ACE-ALLOW-UNASSIGNED performs the same operations as ACE except that ACE-ALLOW-UNASSIGNED allows unassigned characters. ACE-ALLOW-UNASSIGNED is typically used for query purpose.

The following example shows a conversion from ACE to UTF-8 with input from the hostnames.txt file. Output goes to standard output.

system% iconv -f ACE -t UTF-8 hostnames.txt

The dedicated IDN conversion utility idnconv(1) provides IDN conversions with various options. The options control the conversion details.

For information about IDN, the conversion routines, and iconv code conversions, see libidnkit(3LIB), idn_decodename(3EXT), idn_decodename2(3EXT), idn_encodename(3EXT), and iconv_en_US.UTF-8(5) man pages.