International Language Environments Guide

Chapter 2 General Internationalization Features

This section discusses several internationalization features contained in the Solaris 9 environment.

Support for Codeset Independence

EUC is an abbreviation for Extended UNIX Code. The Solaris 9 operating environment supports non-EUC encodings such as PC-Kanji (better known as Shift_JIS) in Japan, Big5 in Taiwan, and GBK in the People's Republic of China. Because a large part of the computer market demands non-EUC codeset support, the Solaris 9 environment provides a solid framework to enable both EUC and non-EUC codeset support. This support is called Codeset Independence, or CSI.

The goal of CSI is to remove dependencies on specific codesets or encoding methods from Solaris operating environment libraries and commands. The CSI architecture allows the Solaris operating environment to support any UNIX file system safe encoding. CSI supports a number of new codesets, such as UTF-8, PC-Kanji, and Big5.

CSI Approach

Codeset independence enables application and platform software developers to keep their code independent of any encoding, such as UTF-8, and also provides the ability to adopt any new encoding without having to modify the source code. This architecture approach differs from Java^TM internationalization in that Java requires applications to be UTF-16–dependent.

Many existing internationalized applications (for example, Motif) automatically inherit CSI support from the underlying system. These applications work in the new locales without modification.

CSI is inherently independent from any codesets. However, the following assumptions about file code encodings (codesets) still apply to the Solaris 9 environment:

File code is a superset of ASCII.
NULL byte value (0x00) does not appear as part of multibyte character bytes for support of null-terminated multibyte character strings.
ASCII Slash character byte value (0x2f) does not appear as part of multibyte character bytes for support of the UNIX path names.

CSI-enabled Commands

This section lists the CSI-enabled commands in the Solaris 9 environment. The man page for each command has an attribute section that indicates whether the command is CSI-enabled.

All commands are in the /usr/bin directory, unless otherwise noted.

/usr/lib/diffh
/usr/sbin/accept
/usr/sbin/reject
/usr/ucb/lpr
/usr/xpg4/bin/awk
/usr/xpg4/bin/cp
/usr/xpg4/bin/date
/usr/xpg4/bin/du
/usr/xpg4/bin/ed
/usr/xpg4/bin/edit
/usr/xpg4/bin/egrep
/usr/xpg4/bin/env
/usr/xpg4/bin/ex
/usr/xpg4/bin/expr
/usr/xpg4/bin/fgrep
/usr/xpg4/bin/lp
/usr/xpg4/bin/ls
/usr/xpg4/bin/more
/usr/xpg4/bin/mv
/usr/xpg4/bin/nice
/usr/xpg4/bin/nohup
/usr/xpg4/bin/od
/usr/xpg4/bin/pr
/usr/xpg4/bin/rm
/usr/xpg4/bin/sed
/usr/xpg4/bin/sort
/usr/xpg4/bin/tail
/usr/xpg4/bin/tr
/usr/xpg4/bin/vedit
/usr/xpg4/bin/vi
/usr/xpg4/bin/view
acctcom
apropos
batch
bdiff
cancel
cat
catman
chgrp
chmod
chown
cmp
col
comm
compress
cpio
csh
csplit
cut
diff
diff3
disable
echo
expand
file
find
fold
ftp
gencat
geteopt
getoptcvt
head
join
jsh
kill
ksh
lp
man
mkdir
msgfmt
news
nroff
pack
paste
pcat
pg
printf
priocntl
ps
pwd
rcp
red
remsh
rksh
rsmdir
rsh
script
sdiff
settime
sh
split
strconf
strings
sum
tabs
tar
tee
touch
tty
uncompress
unexpand
uniq
unpack
wc
whatis
write
xargs
zcat

Solaris 9 CSI-enabled Libraries

Nearly all functions in libc (/usr/lib/libc.so) are CSI-enabled. However, the following functions in libc are not CSI-enabled because they are EUC-dependent functions:

csetcol()
csetlen()
euccol()
euclen()
eucscol()
getwidth()
csetno()
wcsetno()

In the Solaris 9 product, libgen /usr/ccs/lib/libgen.a and libcurses /usr/ccs/lib/libcurses.a are internationalized but not CSI-enabled.

Locale Database

The locale database format and structure is private and subject to change in a future release. Therefore, when developing an internationalized application, do not directly access the locale database. Instead, use the internationalization APIs in libc, described in Internationalization APIs in libc.

Note –

When working with the Solaris 9 environment, use the locale databases that are included with the Solaris 9 product. Do not use locales from previous Solaris versions.

Process Code Format

The process code format, which is also known as wide-character code format in the Solaris 9 product, is private and subject to change in a future release. Therefore, when developing an international application, do not assume the process code format is the same. Instead, use the internationalization APIs in libc described in Internationalization APIs in libc.

Note –

The process code for all Unicode locales is in UTF-32 representation. For more detail on UTF-32, refer to the “Unicode Standard Annex #19: UTF 32” and “Unicode Standard Annex #27: Unicode 3.1” from The Unicode Consortium or http://www.unicode.org/.

Multibyte Support Environment

A multibyte character is a character that cannot be stored in a single byte, such as Chinese, Japanese, or Korean characters. These characters require 2, 3, or 4 bytes of storage. A more precise definition can be found in ISO/IEC 9899:1990 subclause 3.13.

The Amendment 1 to ANSI C, which is also known as ISO/IEC 9899:1990, added new internationalization features, collectively known as the Multibyte Support Environment (MSE). Amendment 1 defines additional internationalization APIs for multibyte codesets with state and also for better wide-character handling support.

The programming model enables these multibyte characters to be read in as logical units and stored internally as wide characters. These wide characters can be processed by the program as logical entities in their own right. Finally, these wide characters can be written out, undergoing appropriate translation, as logical units.

This procedure is analogous to the way single-byte characters are read in, manipulated, and written out again. The MSE enables programs to be written to handle multibyte characters using the same programming model that is used for single-byte characters.

Dynamically Linked Applications

Solaris 9 product users can choose how to link applications with the system libraries, such as libc, by using dynamic linking or static linking. Any application that requires internationalization features in the system libraries must be dynamically linked. If the application has been statically linked, the operation to set the locale to anything other than C and POSIX using the setlocale function will fail. Statically linked applications can be operated only in C and POSIX locales.

By default, the linker program tries to link the application dynamically. If the command line options to the linker and the compiler include -Bstatic or -dn specifications, your application might be statically linked. You can check whether an existing application is dynamically linked using the /usr/bin/ldd command.

For example, if you type:

% /usr/bin/ldd /sbin/sh

the command indicates that the /sbin/sh command is not a dynamically linked program, as shown by the following response:

ldd: /sbin/sh: file is not a dynamic executable or shared object

If you type:

% /usr/bin/ldd /usr/bin/ls

the command displays the following message:

libc.so.1 => 	/usr/lib/libc.so.1
libdl.so.1 => /usr/lib/libdl.so.1

This message indicates that the /usr/bin/ls command has been dynamically linked with two libraries, libc.so.1 and libdl.so.1.

Changed Interfaces

libw and libintl have moved to libc and are no longer in libw and libintl.

The shared objects ensure runtime compatibility for existing applications and, together with the archives, provide compilation environment compatibility for building applications. However, you no longer must build applications against libw or libintl.

For more information on filters, see the Linker and Libraries Guide.

The following list shows the stub entry points in libw.

fgetwc
fgetws
fputwc
fputws
getwc
getwchar
getws
isenglish
isideogram
isnumber
isphonogram
isspecial
iswalnum
iswalpha
iswcntrl
iswctype
iswdigit
iswgraph
iswlower
iswprint
iswpunct
iswspace
iswupper
iswxdigit
putwc
putwchar
putws
strtows
towlower
towupper
ungetwc
watoll
wcscat
wcschr
wcscmp
wcscoll
wcscpy
wcscspn
wcsftime
wcsclen
wscncat
wcsncmp
wcsncpy
wcspbrk
wcsrchr
wcsspn
wcstod
wcstok
wcstol
wcstoul
wcswcs
wcswidth
wcsxfrm
wctype
wcwidth
wscasecmp
wscat
wschr
wscmp
wscol
wscoll
wscpy
wscspn
wsdup
wslen
wsncasecmp
wsncat
wsncmp
wsncpy
wspbrk
wsprintf
wsrchr
wsscanf
wsspn
wstod
wstok
wstol
wstoll
wstostr
wsxfrm

This shorter list contains stub entry points in libintl:

bindtextdomain
dcgettext
dgettext
gettext
textdomain

`ctype` Macros

Character classification and character transformation macros are defined in /usr/include/ctype.h. The Solaris 9 environment provides a set of ctype macros that support character classification and transformation semantics defined by XPG4. For all XPG4 and XPG4.2 applications to automatically access new macros, one of the following conditions must be met:

_XPG4_CHAR_CLASS is defined.
_XOPEN_SOURCE and _XOPEN_VERSION=4 are defined.
_XOPEN_SOURCE and _XOPEN_SOURCE_EXTENDED=1 are defined.

Because _XOPEN_SOURCE, _XOPEN_VERSION, and _XOPEN_SOURCE_EXTENDED bring in extra XPG4 related features in addition to new ctype macros, non-XPG4 or XPG4.2 applications should use __XPG4_CHAR_CLASS__.

Corresponding ctype functions also exist. The Solaris 9 environment functions also support XPG4 semantics. Refer to the ctype(3C) man page for details.

Internationalization APIs in `libc`

The Solaris 9 environment offers two sets of APIs:

Multibyte (file codes)
Wide characters (process code)

Wide-character codes are fixed-width units of logical entities. Therefore, you do not have to keep track of maintaining proper character boundaries when you are using multibyte characters.

When a program takes input from a file, you can convert your file's multibyte data into wide-character process code directly with input functions like fscanf(3S) and fwscanf(3S) or by using conversion functions like mbtowc(3C) and mbsrtowcs(3C) after the input. To convert output data from wide-character format to multibyte character format, use output functions like fwprintf(3S) and fprintf(3S), or apply conversion functions like wctomb(3C) and wcsrtombs(3C) before the output.

The tables in the remainder of this chapter describe the internationalization APIs included in the Solaris 9 product.

The following table describes the messaging function APIs in libc.

Table 2–1 Messaging Functions in libc


Library Routine	Description
`catclose()`	Close a message catalog
`catgets()`	Read a program message
`catopen()`	Open a message catalog
`dgettext()`	Get a message from a message catalog with domain specified
`dcgettext()`	Get a message from a message catalog with domain and category specified
`textdomain()`	Set and query the current domain
`bindtextdomain()`	Bind the path for a message domain
`gettext()`	Retrieve text string from message database

The following table describes the code conversion function APIs in libc.

Table 2–2 Code Conversion in libc


Library Routine	Description
`iconv()`	Convert codes
`iconv_close()`	Deallocate the conversion descriptor
`iconv_open()`	Allocate the conversion descriptor

Thise following table describes the regular expression APIs in libc.

Table 2–3 Regular Expressions in libc


Library Routine	Description
`regcomp()`	Compile the regular expression
`regexec()`	Execute regular expression matching
`regerror()`	Provide a mapping from error codes to error messages
`regfree()`	Free memory allocated by `regcomp()`
`fnmatch()`	Match file name or path name

The following table describes the wide character function APIs in libc.

Table 2–4 Wide Character Class in libc


Library Routine	Description
`wctype()`	Define character class
`wctrans()`	Define character mapping

The following table lists the modify and query locale in libc.

Table 2–5 Modify and Query Locale in libc


Library Routine	Description
`setlocale()`	Modify and query a program's locale

The following table lists the query locale data in libc.

Table 2–6 Query Locale Data in libc


Library Routine	Description
`nl_langinfo()`	Get language and cultural information of current locale
`localeconv()`	Get monetary and numeric formatting information of current locale

The following table describes the character classification function APIs in libc.

Table 2–7 Character Classification and Transliteration in libc


Library Routine	Description
`isalpha()`	Is character alphabetic?
`isupper()`	Is character uppercase?
`islower()`	Is character lowercase?
`isdigit()`	Is character a digit?
`isxdigit()`	Is character a hex digit?
`isalnum()`	Is character alphabetic or digital?
`isspace()`	Is character a space?
`ispunct()`	Is character a punctuation mark?
`isprint()`	Is character printable?
`iscntrl()`	Is character a control character?
`isascii()`	Is character an ASCII character?
`isgraph()`	Is character a visible character?
`isphonogram()`	Is wide character a phonogram?
`isideogram()`	Is wide character an ideogram?
`isenglish()`	Is wide character in English alphabet from a supplementary codeset?
`isnumber()`	Is wide character a digit from a supplementary codeset?
`isspecial()`	Is special wide character from a supplementary codeset?
`iswalpha()`	Is wide character alphabetic?
`iswupper()`	Is wide character uppercase?
`iswlower()`	Is wide character lowercase?
`iswdigit()`	Is wide-character a digit?
`iswxdigit()`	Is wide character a hex digit?
`iswalnum()`	Is wide character an alphabetic character or digit?
`iswspace()`	Is wide character a white space?
`iswpunct()`	Is wide character a punctuation mark?
`iswprint()`	Is wide character a printable character?
`iswgraph()`	Is wide character a visible character?
`iswcntrl()`	Is wide character a control character?
`iswascii()`	Is wide character an ASCII character?
`toupper()`	Convert a lowercase character to uppercase.
`tolower()`	Convert an uppercase character to lowercase.
`towupper()`	Convert a lowercase wide character to uppercase.
`towlower()`	Convert an uppercase wide character to lowercase.
`towctrans()`	Wide character mapping.

The following table describes the character collation function APIs in libc.

Table 2–8 Character Collation in libc


Library Routine	Description
`strcoll()`	Collate character strings
`strxfrm()`	Transform character strings for comparison
`wcscoll()`	Collate wide-character strings
`wcsxfrm()`	Transform wide-character strings for comparison

The following table describes the monetary handling function APIs in libc.

Table 2–9 Monetary Formatting in libc


Library Routine	Description
`localeconv()`	Get monetary formatting information for the current locale
`strfmon()`	Convert monetary value to string representation

The following table describes the date and time formatting in libc.

Table 2–10 Date and Time Formatting in libc


Library Routine	Description
`getdate()`	Convert user format date and time.
`strftime()`	Convert date and time to string representation. The `%u` conversion function conforms to the X/Open CAE Specification, System Interfaces and Headers, Issue 4, Version 2. This function represents a weekday as a decimal number [1,7], with 1 now representing Monday.
`strptime()`	Date and time conversion.

The following table describes the multibyte handling function APIs in libc.

Table 2–11 Multibyte Handling in libc


Library Routine	Description
`btowc()`	Single-byte to wide-character conversion
`mbrlen()`	Get number of bytes in character (restartable)
`mbsinit()`	Determine conversion object status
`mbrtowc()`	Convert a character to a wide-character code (restartable)
`mbsrtowcs()`	Convert a character string to a wide-character string (restartable)
`mblen()`	Get number of bytes in a character
`mbtowc()`	Convert a character to a wide-character code
`mbstowcs()`	Convert a character string to a wide-character string

The following table describes the wide character and string handling in libc.

Table 2–12 Wide Character and String Handling in libc


Library Routine	Description
`wcsncat()`	Concatenate wide-character strings to length `n`
`wsdup()`	Duplicate wide-character string
`wcscmp()`	Compare wide-character strings
`wcsncmp()`	Compare wide-character strings to length `n`
`wcscpy()`	Copy wide-character strings
`wcsncpy()`	Copy wide-character strings to length `n`
`wcschr()`	Find character in wide-character string
`wcsrchr()`	Find character in wide-character string from right
`wcslen()`	Get length of wide-character string
`wscol()`	Return display width of wide-character string
`wcsspn()`	Return span of one wide-character string in another
`wcscspn()`	Return span of one wide-character string not in another
`wcspbrk()`	Return pointer to one wide-character string in another
`wcstok()`	Move token through wide-character string
`wscwcs()`	Find string in wide-character string
`wcstombs()`	Convert wide-character string to multibyte string
`wctomb()`	Convert wide-character to multibyte character
`wcwidth()`	Determine number of column positions of a wide character
`wcswidth()`	Determine number of column positions of a wide-character string
`wctob()`	Wide character to single byte conversion
`wcrtomb()`	Convert a wide-character code to a character (restartable)
`wcstol()`	Convert wide-character string to long integer
`wcstoul()`	Convert wide-character string to unsigned long integer
`wcstod()`	Convert wide-character string to double precision
`wcsrtombs()`	Convert a wide-character string to a character string (restartable)
`wcscat()`	Concatenate wide-character strings

The following table describes the formatted wide-character input and output in libc.

Table 2–13 Formatted Wide-character Input and Output in libc


Library Routine	Description

`wsprintf()`	Generate wide-character string according to format
`wsscanf()`	Formatted input conversion
`fwprintf()`	Print formatted wide-character output
`fwscanf()`	Convert formatted wide-character input
`wprintf()`	Print formatted wide-character output
`wscanf()`	Convert formatted wide-character input
`swprintf()`	Print formatted wide-character output
`swscanf()`	Convert formatted wide-character input
`vfwprintf()`	Wide-character formatted output of a `stdarg` argument list
`vswprintf()`	Wide-character formatted output of a `stdarg` argument list

This table describes the wide strings function APIs in libc.

Table 2–14 Wide Stringslibc


Library Routine	Description
`wscasecmp()`	Compare wide-character strings, ignore case differences
`wsncasecmp()`	Process code-string operations
`wcsstr()`	Find a wide-character substring
`wmemchr()`	Find a wide character in memory
`wmemcmp()`	Compare wide characters in memory
`wmemcpy()`	Copy wide characters in memory
`wmemmove()`	Copy wide characters in memory with overlapping areas
`wmemset()`	Set wide characters in memory

The following table describes the wide-character input and output in libc.

Table 2–15 Wide-character Input and Output inlibc


Library Routine	Description
`fgetwc()`	Get multibyte character from stream, convert to wide character
`getwchar()`	Get multibyte character from `stdin`, convert to wide character
`fgetws()`	Get multibyte string from stream, convert to wide character
`getws()`	Get multibyte string from `stdin`, convert to wide character
`fputwc()`	Convert wide character to multibyte character, puts to stream
`fwide()`	Set stream orientation
`putwchar()`	Convert wide character to multibyte character, puts to `stdin`
`fputws()`	Convert wide character to multibyte string, puts to stream
`putws()`	Convert wide character to multibyte string, puts to `stdin`
`ungetwc()`	Push a wide character back into input stream.

`genmsg` Utility

The new genmsg utility can be used with the catgets() family of functions to create internationalized source message catalogs. The utility examines a source program file for calls to functions in catgets and builds a source message catalog from the information it finds. For example:

% cat example.c
	...
	/* NOTE: %s is a file name */
	printf(catgets(catd, 5, 1, "%s cannot be opened."));
	/* NOTE: "Read" is a past participle, not a present
 
			tense verb */
	printf(catgets(catd, 5, 1, "Read"));
	...
% genmsg -c NOTE example.c
The following file(s) have been created.
			new msg file = "example.c.msg"
% cat example.c.msg
$quote "
$set 5
1			"%s cannot be opened"
	/* NOTE: %s is a file name */
2			"Read"
	/* NOTE: "Read" is a past participle, not a present
			tense verb */

In the above example, genmsg is run on the source file example.c, which produces a source message catalog named example.c.msg. The -c option with the argument NOTE causes genmsg to include comments in the catalog. If a comment in the source program contains the string specified, the comment appears in the message catalog after the next string extracted from a call to catgets.

You can use genmsg to number the messages in a message set automatically.

For more information, see the genmsg(1) man page.

To generate a formatted message catalog file, use the gencat(1) utility.

For information on the message extraction utility for Portable Message files (.po files) and also on how to generate message object files (.mo files) from the .po files, see the xgettext(1), and msgfmt(1) man pages, respectively.

User Defined and User Extensible Code Conversions

Solaris users can create user-defined codeset converters by using the geniconvtbl utility.

This utility enables user-defined and user-customizable codeset conversions with a standard system utility and interface like iconv(1) and iconv(3C). This feature enhances the ability of an application to deal with incompatible data types, particularly data generated from proprietary or legacy applications. Modification to existing Solaris codeset conversions is also supported.

More details and also examples can be found in the geniconvtbl(1) and geniconvtbl(4) man pages. Sample input source files for the utility are also available for reference from the /usr/lib/iconv/geniconvtbl/srcs/ directory.

Once the user-defined code conversions are prepared and placed as specified in the geniconvtbl(1) man page, users can use the code conversions from the iconv(1) utility and the iconv(3C) functions of both 32-bit and 64-bit Solaris operating environments.