International Language Environments Guide

Chapter 2 General Internationalization Features

This chapter discusses several internationalization features contained in the Oracle Solaris operating system. The chapter covers the following topics.

Support for Code Set Independence

EUC is an abbreviation for Extended UNIX® Code. The Oracle Solaris operating system supports non-EUC encodings such as PC-Kanji (better known as Shift_JIS) in Japan, Big5 in Taiwan, and GBK in the People's Republic of China. Because a large part of the computer market demands non-EUC codeset support, the current Oracle Solaris environment provides a solid framework to enable both EUC and non-EUC code set support. This support is called Code Set Independence, or CSI.

The goal of CSI is to remove dependencies on specific code sets or encoding methods from Oracle Solaris operating system libraries and commands. The CSI architecture enables the Oracle Solaris operating system to support any UNIX file system safe encoding. CSI supports a number of new code sets, such as UTF-8, PC-Kanji, and Big5.

CSI Approach

Code set independence enables application and platform software developers to keep their code independent of any encoding, such as UTF-8. CSI also provides the ability to adopt any new encoding without having to modify the source code. This architecture approach differs from Java internationalization because applications do not have to be to be UTF-16–dependent.

Many existing internationalized applications (for example, Motif) automatically inherit CSI support from the underlying system. These applications work in the new locales without modification.

CSI is inherently independent from any code sets. However, the following assumptions about file code encodings (code sets) still apply to the current Oracle Solaris system:

CSI-enabled Commands

This section lists the CSI-enabled commands in the current Oracle Solaris environment. The man page for each command includes an attribute section that indicates whether the command is CSI-enabled.

All commands are in the /usr/bin directory, unless otherwise noted.

/usr/lib/diffh

cat

pack

/usr/sbin/accept

catman

paste

/usr/sbin/reject

chgrp

pcat

/usr/ucb/lpr

chmod

pg

/usr/xpg4/bin/awk

chown

printf

/usr/xpg4/bin/cp

cmp

priocntl

/usr/xpg4/bin/date

col

ps

/usr/xpg4/bin/du

comm

pwd

/usr/xpg4/bin/ed

compress

rcp

/usr/xpg4/bin/edit

cpio

red

/usr/xpg4/bin/egrep

csh

remsh

/usr/xpg4/bin/env

csplit

rksh

/usr/xpg4/bin/ex

cut

rsh

/usr/xpg4/bin/expr

diff

rsmdir

/usr/xpg4/bin/fgrep

diff3

script

/usr/xpg4/bin/lp

disable

sdiff

/usr/xpg4/bin/ls

echo

settime

/usr/xpg4/bin/more

expand

sh

/usr/xpg4/bin/mv

file

split

/usr/xpg4/bin/nice

find

strconf

/usr/xpg4/bin/nohup

fold

strings

/usr/xpg4/bin/od

ftp

sum

/usr/xpg4/bin/pr

gencat

tabs

/usr/xpg4/bin/rm

geteopt

tar

/usr/xpg4/bin/sed

getoptcvt

tee

/usr/xpg4/bin/sort

head

touch

/usr/xpg4/bin/tail

join

tty

/usr/xpg4/bin/tr

jsh

uncompress

/usr/xpg4/bin/vedit

kill

unexpand

/usr/xpg4/bin/vi

ksh

uniq

/usr/xpg4/bin/view

lp

unpack

acctcom

man

wc

apropos

mkdir

whatis

batch

msgfmt

write

bdiff

news

xargs

cancel

nroff

zcat

CSI-enabled Libraries

Nearly all functions in libc (/usr/lib/libc.so) are CSI-enabled. However, the following functions in libc are not CSI-enabled and therefore are EUC-dependent functions:

In the current Oracle Solaris environment, libgen /usr/ccs/lib/libgen.a and libcurses /usr/ccs/lib/libcurses.a are internationalized but not CSI-enabled.

Locale Database

The locale database format and structure is private and subject to change in a future release. When you develop internationalized applications, you use the internationalization APIs in libc. These APIs are described in Internationalization APIs in libc, rather than linking to the locale database.


Note –

When you work in the Oracle Solaris environment, use the locale databases that are included with the current Oracle Solaris release. Do not use locales from previous Oracle Solaris versions.


Process Code Format

The process code format, which is also known as wide-character code format in the Oracle Solaris operating system, is private and subject to change in a future release. Therefore, when you develop an international application, do not assume that the process code format is the same. Instead, use the internationalization APIs in libc described in Internationalization APIs in libc.


Note –

The process code for all Unicode locales is in UTF 32 representation. For more detail on UTF 32, refer to the Unicode Standard Annex #19: UTF 32 and Unicode Standard Annex #27: Unicode 3.1 from the Unicode Consortium or http://www.unicode.org/.


Multibyte Support Environment

A multibyte character is a character that cannot be stored in a single byte, such as Chinese, Japanese, or Korean characters. These characters require 2, 3, or 4 bytes of storage. A more precise definition can be found in ISO/IEC 9899:1990 subclause 3.13.

The Amendment 1 to ANSI C, which is also known as ISO/IEC 9899:1990, added new internationalization features, collectively known as the Multibyte Support Environment (MSE). Amendment 1 defines additional internationalization APIs for multibyte code sets with state and also for better wide-character handling support.

The programming model enables these multibyte characters to be read in as logical units and stored internally as wide characters. These wide characters can be processed by the program as logical entities. Finally, these wide characters can be written out, undergoing appropriate translation, as logical units.

This procedure is analogous to the way single-byte characters are read in, manipulated, and written out again. The MSE enables programs to handle multibyte characters using the same programming model that is used for single-byte characters.

Dynamically Linked Applications

You can link applications with the system libraries, such as libc, by using dynamic linking or static linking. Any application that requires internationalization features in the system libraries must be dynamically linked. If the application has been statically linked, the operation to set the locale to anything other than C and POSIX using the setlocale function will fail. Statically linked applications can operated only in C and POSIX locales.

By default, the linker program tries to link the application dynamically. If the command-line options to the linker and the compiler include -Bstatic or -dn specifications, your application might be statically linked. You can check whether an existing application is dynamically linked using the /usr/bin/ldd command.

For example, the response to the following command indicates that the /sbin/sh command is not a dynamically linked program:


% /usr/bin/ldd /sbin/sh
ldd: /sbin/sh: file is not a dynamic executable or shared object

The response to the following command indicates that the /usr/bin/ls command has been dynamically linked with two libraries, libc.so.1 and libdl.so.1.


% /usr/bin/ldd /usr/bin/ls
libc.so.1 => 	/usr/lib/libc.so.1
libdl.so.1 => /usr/lib/libdl.so.1

Changed Interfaces

libw and libintl have moved to libc and are no longer in libw and libintl.

The shared objects ensure runtime compatibility for existing applications and, together with the archives, provide compilation environment compatibility for building applications. However, you no longer must build applications against libw or libintl.

The following list shows the stub entry points in libw:

fgetwc

iswdigit

wcscat

wcstol

wslen

fgetws

iswgraph

wcschr

wcstoul

wsncasecmp

fputwc

iswlower

wcsclen

wcswcs

wsncat

fputws

iswprint

wcscmp

wcswidth

wsncmp

getwc

iswpunct

wcscoll

wcsxfrm

wsncpy

getwchar

iswspace

wcscpy

wctype

wspbrk

getws

iswupper

wcscspn

wcwidth

wsprintf

isenglish

iswxdigit

wcsftime

wscasecmp

wsrchr

isideogram

putwc

wscncat

wscat

wsscanf

isnumber

putwchar

wcsncmp

wschr

wsspn

isphonogram

putws

wcsncpy

wscmp

wstod

isspecial

strtows

wcspbrk

wscol

wstok

iswalnum

towlower

wcsrchr

wscoll

wstol

iswalpha

towupper

wcsspn

wscpy

wstoll

iswcntrl

ungetwc

wcstod

wscspn

wstostr

iswctype

watoll

wcstok

wsdup

wsxfrm

The following list shows the stub entry points in libintl:

bindtextdomain

dcgettext

dgettext

gettext

textdomain

ctype Macros

Character classification and character transformation macros are defined in /usr/include/ctype.h. The current Oracle Solaris environment provides a set of ctype macros that support character classification and transformation semantics defined by XPG4. For all XPG4 and XPG4.2 applications to automatically access new macros, one of the following conditions must be met:

Because _XOPEN_SOURCE, _XOPEN_VERSION, and _XOPEN_SOURCE_EXTENDED bring in extra XPG4 related features in addition to new ctype macros, non-XPG4 or XPG4.2 applications should use __XPG4_CHAR_CLASS__.

Corresponding ctype functions also exist. The current Oracle Solaris environment functions also support XPG4 semantics.

Internationalization APIs in libc

The current Oracle Solaris environment offers two sets of APIs:

Wide-character codes are fixed-width units of logical entities. Therefore, you do not have to keep track of maintaining proper character boundaries when you are using multibyte characters.

When a program takes input from a file, you can convert your file's multibyte data into wide-character process code directly with input functions like fscanf and fwscanf or by using conversion functions like mbtowc and mbsrtowcs after the input. To convert output data from wide-character format to multibyte character format, use output functions like fwprintf and fprintf or apply conversion functions like wctomb and wcsrtombs before the output.

The tables in the remainder of this chapter describe the internationalization APIs included in the current Oracle Solaris system.

The following table describes the messaging function APIs in libc.

Table 2–1 Messaging Functions in libcp

Library Routine 

Description 

bindtextdomain()

Bind the path for a message domain 

catclose()

Close a message catalog 

catgets()

Read a program message 

catopen()

Open a message catalog 

dcgettext()

Get a message from a message catalog with domain and category specified 

dgettext()

Get a message from a message catalog with domain specified 

gettext()

Retrieve a text string from the message database 

textdomain()

Set and query the current domain 

The following table describes the code conversion function APIs in libc.

Table 2–2 Code Conversion in libc

Library Routine 

Description 

iconv()

Convert codes 

iconv_close()

Deallocate the conversion descriptor 

iconv_open()

Allocate the conversion descriptor 

The following table describes the regular expression APIs in libc.

Table 2–3 Regular Expressions in libc

Library Routine 

Description 

fnmatch()

Match file name or path name 

regcomp()

Compile the regular expression 

regerror()

Provide a mapping from error codes to error messages 

regexec()

Execute regular expression matching 

regfree()

Free memory allocated by regcomp()

The following table describes the wide character function APIs in libc.

Table 2–4 Wide Character Class in libc

Library Routine 

Description 

wctrans()

Define character mapping 

wctype()

Define character class 

The following table lists the modify and query locale in libc.

Table 2–5 Modify and Query Locale in libc

Library Routine 

Description 

setlocale()

Modify and query a program's locale 

The following table lists the query locale data in libc.

Table 2–6 Query Locale Data in libc

Library Routine 

Description 

localeconv()

Get monetary and numeric formatting information of current locale 

nl_langinfo()

Get language and cultural information of current locale 

The following table describes the character classification function APIs in libc.

Table 2–7 Character Classification and Transliteration in libc

Library Routine 

Description 

isalnum()

Is character alphabetic or digital? 

isalpha()

Is character alphabetic? 

isascii()

Is character an ASCII character? 

iscntrl()

Is character a control character? 

isdigit()

Is character a digit? 

isenglish()

Is wide character in English alphabet from a supplementary code set? 

isgraph()

Is character a visible character? 

isideogram()

Is wide character an ideogram? 

islower()

Is character lowercase? 

isnumber()

Is wide character a digit from a supplementary code set? 

isphonogram()

Is wide character a phonogram? 

isprint()

Is character printable? 

ispunct()

Is character a punctuation mark? 

isspace()

Is character a space? 

isspecial()

Is special wide character from a supplementary code set? 

isupper()

Is character uppercase? 

iswalnum()

Is wide character an alphabetic character or digit? 

iswalpha()

Is wide character alphabetic? 

iswascii()

Is wide character an ASCII character? 

iswcntrl()

Is wide character a control character? 

iswdigit()

Is wide-character a digit? 

iswgraph()

Is wide character a visible character? 

iswlower()

Is wide character lowercase? 

iswprint()

Is wide character a printable character? 

iswpunct()

Is wide character a punctuation mark? 

iswspace()

Is wide character a white space? 

iswupper()

Is wide character uppercase? 

iswxdigit()

Is wide character a hex digit? 

isxdigit()

Is character a hex digit? 

tolower()

Convert an uppercase character to lowercase. 

toupper()

Convert a lowercase character to uppercase. 

towctrans()

Wide character mapping. 

towlower()

Convert an uppercase wide character to lowercase. 

towupper()

Convert a lowercase wide character to uppercase. 

The following table describes the character collation function APIs in libc.

Table 2–8 Character Collation in libc

Library Routine 

Description 

strcoll()

Collate character strings 

strxfrm()

Transform character strings for comparison 

wcscoll()

Collate wide-character strings 

wcsxfrm()

Transform wide-character strings for comparison 

The following table describes the monetary handling function APIs in libc.

Table 2–9 Monetary Formatting in libc

Library Routine 

Description 

localeconv()

Get monetary formatting information for the current locale 

strfmon()

Convert monetary value to string representation 

The following table describes the date and time formatting in libc.

Table 2–10 Date and Time Formatting in libc

Library Routine 

Description 

getdate()

Convert user format date and time. 

strftime()

Convert date and time to string representation. The %u conversion function conforms to the X/Open CAE Specification, System Interfaces and Headers, Issue 4, Version 2. This function represents a weekday as a decimal number [1,7], with 1 now representing Monday.

strptime()

Date and time conversion. 

The following table describes the multibyte handling function APIs in libc.

Table 2–11 Multibyte Handling in libc

Library Routine 

Description 

btowc()

Single-byte to wide-character conversion 

mblen()

Get number of bytes in a character 

mbrlen()

Get number of bytes in character (restartable) 

mbrtowc()

Convert a character to a wide-character code (restartable) 

mbsinit()

Determine conversion object status 

mbsrtowcs()

Convert a character string to a wide-character string (restartable) 

mbstowcs()

Convert a character string to a wide-character string 

mbtowc()

Convert a character to a wide-character code 

The following table describes the wide character and string handling in libc.

Table 2–12 Wide Character and String Handling in libc

Library Routine 

Description 

wcrtomb()

Convert a wide-character code to a character (restartable) 

wcscat()

Concatenate wide-character strings 

wcschr()

Find character in wide-character string 

wcscmp()

Compare wide-character strings 

wcscpy()

Copy wide-character strings 

wcscspn()

Return span of one wide-character string not in another 

wcslen()

Get length of wide-character string 

wcsncat()

Concatenate wide-character strings to length n

wcsncmp()

Compare wide-character strings to length n

wcsncpy()

Copy wide-character strings to length n

wcspbrk()

Return pointer to one wide-character string in another 

wcsrchr()

Find character in wide-character string from right 

wcsrtombs()

Convert a wide-character string to a character string (restartable) 

wcsspn()

Return span of one wide-character string in another 

wcstod()

Convert wide-character string to double precision 

wcstok()

Move token through wide-character string 

wcstol()

Convert wide-character string to long integer 

wcstombs()

Convert wide-character string to multibyte string 

wcstoul()

Convert wide-character string to unsigned long integer 

wscwcs()

Find string in wide-character string 

wcswidth()

Determine number of column positions of a wide-character string 

wctob()

Wide character to single byte conversion 

wctomb()

Convert wide-character to multibyte character 

wcwidth()

Determine number of column positions of a wide character 

wscol()

Return display width of wide-character string 

wsdup()

Duplicate wide-character string 

The following table describes the formatted wide-character input and output in libc.

Table 2–13 Formatted Wide-character Input and Output in libc

Library Routine 

Description 

fwprintf()

Print formatted wide-character output 

fwscanf()

Convert formatted wide-character input 

swprintf()

Print formatted wide-character output 

swscanf()

Convert formatted wide-character input 

vfwprintf()

Wide-character formatted output of a stdarg argument list

vswprintf()

Wide-character formatted output of a stdarg argument list

wprintf()

Print formatted wide-character output 

wscanf()

Convert formatted wide-character input 

wsprintf()

Generate wide-character string according to format 

wsscanf()

Formatted input conversion 

This table describes the wide strings function APIs in libc.

Table 2–14 Wide Stringslibc

Library Routine 

Description 

wcsstr()

Find a wide-character substring 

wmemchr()

Find a wide character in memory 

wmemcmp()

Compare wide characters in memory 

wmemcpy()

Copy wide characters in memory 

wmemmove()

Copy wide characters in memory with overlapping areas 

wmemset()

Set wide characters in memory 

wscasecmp()

Compare wide-character strings, ignore case differences 

wsncasecmp()

Process code-string operations 

The following table describes the wide-character input and output in libc.

Table 2–15 Wide-Character Input and Output in libc

Library Routine 

Description 

fgetwc()

Get multibyte character from stream, convert to wide character 

fgetws()

Get multibyte string from stream, convert to wide character 

fputwc()

Convert wide character to multibyte character, puts to stream 

fputws()

Convert wide character to multibyte string, puts to stream 

fwide()

Set stream orientation 

getwchar()

Get multibyte character from stdin, convert to wide character

getws()

Get multibyte string from stdin, convert to wide character

putwchar()

Convert wide character to multibyte character, puts to stdin

putws()

Convert wide character to multibyte string, puts to stdin

ungetwc()

Push a wide character back into input stream 

genmsg Utility

The new genmsg utility can be used with the catgets() family of functions to create internationalized source message catalogs. The utility examines a source program file for calls to functions in catgets and builds a source message catalog from the information it finds. For example:

% cat example.c
	...
	/* NOTE: %s is a file name */
	printf(catgets(catd, 5, 1, "%s cannot be opened."));
	/* NOTE: "Read" is a past participle, not a present
 			tense verb */
	printf(catgets(catd, 5, 1, "Read"));
	...
% genmsg -c NOTE example.c
The following file(s) have been created.
			new msg file = "example.c.msg"
% cat example.c.msg
$quote "
$set 5
1			"%s cannot be opened"
	/* NOTE: %s is a file name */
2			"Read"
	/* NOTE: "Read" is a past participle, not a present
			tense verb */

In the above example, genmsg is run on the source file example.c, which produces a source message catalog named example.c.msg. The -c option with the argument NOTE causes genmsg to include comments in the catalog. If a comment in the source program contains the string specified, the comment appears in the message catalog after the next string extracted from a call to catgets.

You can use genmsg to number the messages in a message set automatically.

For more information, see the genmsg(1) man page.

To generate a formatted message catalog file, use the gencat(1) utility.

For information on the message extraction utility for portable message files (.po files) and also on how to generate message object files (.mo files) from the .po files.

User-Defined and User-Extensible Code Conversions

You can create user-defined codeset converters using the geniconvtbl utility.

This utility enables user-defined and user-customizable codeset conversions with a standard system utility and interface like iconv(1) and iconv(3C). This feature enhances the ability of an application to deal with incompatible data types, particularly data generated from proprietary or legacy applications. Modification to existing Oracle Solaris codeset conversions is also supported.

Sample input source files for the utility are available in the /usr/lib/iconv/geniconvtbl/srcs/ directory.

Once the user-defined code conversions are prepared and placed properly, users can use the code conversions from the iconv(1) utility and the iconv(3C) functions of both 32-bit and 64-bit Oracle Solaris operating system.

Internationalized Domain Name (IDN) Support

Internationalized Domain Name (IDN) enables the use of non-English native language names as host and domain names. To use non-English host and domain names, convert these names into ASCII Compatible Encoding (ACE) encoded names before sending the names to resolver routines as specified in RFC 3490. System administrators are also required to use ACE names in system files and applications where the system administration applications do not support the IDNs.

See RFC 3490 Internationalizing Domain Names in Applications (IDNA).

The APIs for the Internationalized Domain Name in libidnkit(3EXT) provide convenient conversions between UTF-8 or the application locale's codeset and ACE. If idn_decodename2(3EXT) is used, you can also specify an arbitrary codeset name as the codeset of the input argument.

Figure 2–1 IDN to ACE Conversion

graphic shows non-English name conversion to ASCII compatible
encoded string

Figure 2–2 ACE to IDN Conversion

graphic shows ASCII compatible encoded string conversion
to non-English name

The following table shows bilateral iconv code conversions that you can use.

Table 2–16 iconv Code Conversions

From Code 

To Code 

ACE 

ACE-ALLOW-UNASSIGNED 

UTF-8 

UTF-8 

UTF-8 

UTF-8 

ACE 

ACE-ALLOW-UNASSIGNED 

The ACE and the ACE-ALLOW-UNASSIGNED iconv code conversion names have the following meanings:

The following example shows a conversion from ACE to UTF-8 with input from the hostnames.txt file. Output goes to standard output.

system% iconv -f ACE -t UTF-8 hostnames.txt

The dedicated IDN conversion utility idnconv(1) provides IDN conversions with various options. The options control the conversion details.

For information about IDN, the conversion routines, and iconv code conversions, see libidnkit(3LIB), idn_decodename(3EXT), idn_decodename2(3EXT), idn_encodename(3EXT), and iconv_en_US.UTF-8(5) man pages.