This section discusses several internationalization features contained in the Solaris 9 environment.
EUC is an abbreviation for Extended UNIX Code. The Solaris 9 operating environment supports non-EUC encodings such as PC-Kanji (better known as Shift_JIS) in Japan, Big5 in Taiwan, and GBK in the People's Republic of China. Because a large part of the computer market demands non-EUC codeset support, the Solaris 9 environment provides a solid framework to enable both EUC and non-EUC codeset support. This support is called Codeset Independence, or CSI.
The goal of CSI is to remove dependencies on specific codesets or encoding methods from Solaris operating environment libraries and commands. The CSI architecture allows the Solaris operating environment to support any UNIX file system safe encoding. CSI supports a number of new codesets, such as UTF-8, PC-Kanji, and Big5.
Codeset independence enables application and platform software developers to keep their code independent of any encoding, such as UTF-8, and also provides the ability to adopt any new encoding without having to modify the source code. This architecture approach differs from JavaTM internationalization in that Java requires applications to be UTF-16–dependent.
Many existing internationalized applications (for example, Motif) automatically inherit CSI support from the underlying system. These applications work in the new locales without modification.
CSI is inherently independent from any codesets. However, the following assumptions about file code encodings (codesets) still apply to the Solaris 9 environment:
NULL byte value (0x00) does not appear as part of multibyte character bytes for support of null-terminated multibyte character strings.
ASCII Slash character byte value (0x2f) does not appear as part of multibyte character bytes for support of the UNIX path names.
This section lists the CSI-enabled commands in the Solaris 9 environment. The man page for each command has an attribute section that indicates whether the command is CSI-enabled.
All commands are in the /usr/bin directory, unless otherwise noted.
Nearly all functions in libc (/usr/lib/libc.so) are CSI-enabled. However, the following functions in libc are not CSI-enabled because they are EUC-dependent functions:
csetcol()
csetlen()
euccol()
euclen()
eucscol()
getwidth()
csetno()
wcsetno()
In the Solaris 9 product, libgen /usr/ccs/lib/libgen.a and libcurses /usr/ccs/lib/libcurses.a are internationalized but not CSI-enabled.
The locale database format and structure is private and subject to change in a future release. Therefore, when developing an internationalized application, do not directly access the locale database. Instead, use the internationalization APIs in libc, described in Internationalization APIs in libc.
When working with the Solaris 9 environment, use the locale databases that are included with the Solaris 9 product. Do not use locales from previous Solaris versions.
The process code format, which is also known as wide-character code format in the Solaris 9 product, is private and subject to change in a future release. Therefore, when developing an international application, do not assume the process code format is the same. Instead, use the internationalization APIs in libc described in Internationalization APIs in libc.
The process code for all Unicode locales is in UTF-32 representation. For more detail on UTF-32, refer to the “Unicode Standard Annex #19: UTF 32” and “Unicode Standard Annex #27: Unicode 3.1” from The Unicode Consortium or http://www.unicode.org/.
A multibyte character is a character that cannot be stored in a single byte, such as Chinese, Japanese, or Korean characters. These characters require 2, 3, or 4 bytes of storage. A more precise definition can be found in ISO/IEC 9899:1990 subclause 3.13.
The Amendment 1 to ANSI C, which is also known as ISO/IEC 9899:1990, added new internationalization features, collectively known as the Multibyte Support Environment (MSE). Amendment 1 defines additional internationalization APIs for multibyte codesets with state and also for better wide-character handling support.
The programming model enables these multibyte characters to be read in as logical units and stored internally as wide characters. These wide characters can be processed by the program as logical entities in their own right. Finally, these wide characters can be written out, undergoing appropriate translation, as logical units.
This procedure is analogous to the way single-byte characters are read in, manipulated, and written out again. The MSE enables programs to be written to handle multibyte characters using the same programming model that is used for single-byte characters.
Solaris 9 product users can choose how to link applications with the system libraries, such as libc, by using dynamic linking or static linking. Any application that requires internationalization features in the system libraries must be dynamically linked. If the application has been statically linked, the operation to set the locale to anything other than C and POSIX using the setlocale function will fail. Statically linked applications can be operated only in C and POSIX locales.
By default, the linker program tries to link the application dynamically. If the command line options to the linker and the compiler include -Bstatic or -dn specifications, your application might be statically linked. You can check whether an existing application is dynamically linked using the /usr/bin/ldd command.
% /usr/bin/ldd /sbin/sh
the command indicates that the /sbin/sh command is not a dynamically linked program, as shown by the following response:
ldd: /sbin/sh: file is not a dynamic executable or shared object
% /usr/bin/ldd /usr/bin/ls
the command displays the following message:
libc.so.1 => /usr/lib/libc.so.1 libdl.so.1 => /usr/lib/libdl.so.1
This message indicates that the /usr/bin/ls command has been dynamically linked with two libraries, libc.so.1 and libdl.so.1.
libw and libintl have moved to libc and are no longer in libw and libintl.
The shared objects ensure runtime compatibility for existing applications and, together with the archives, provide compilation environment compatibility for building applications. However, you no longer must build applications against libw or libintl.
For more information on filters, see the Linker and Libraries Guide.
The following list shows the stub entry points in libw.
This shorter list contains stub entry points in libintl:
Character classification and character transformation macros are defined in /usr/include/ctype.h. The Solaris 9 environment provides a set of ctype macros that support character classification and transformation semantics defined by XPG4. For all XPG4 and XPG4.2 applications to automatically access new macros, one of the following conditions must be met:
_XPG4_CHAR_CLASS is defined.
_XOPEN_SOURCE and _XOPEN_VERSION=4 are defined.
_XOPEN_SOURCE and _XOPEN_SOURCE_EXTENDED=1 are defined.
Because _XOPEN_SOURCE, _XOPEN_VERSION, and _XOPEN_SOURCE_EXTENDED bring in extra XPG4 related features in addition to new ctype macros, non-XPG4 or XPG4.2 applications should use __XPG4_CHAR_CLASS__.
Corresponding ctype functions also exist. The Solaris 9 environment functions also support XPG4 semantics. Refer to the ctype(3C) man page for details.
The Solaris 9 environment offers two sets of APIs:
Multibyte (file codes)
Wide characters (process code)
Wide-character codes are fixed-width units of logical entities. Therefore, you do not have to keep track of maintaining proper character boundaries when you are using multibyte characters.
When a program takes input from a file, you can convert your file's multibyte data into wide-character process code directly with input functions like fscanf(3S) and fwscanf(3S) or by using conversion functions like mbtowc(3C) and mbsrtowcs(3C) after the input. To convert output data from wide-character format to multibyte character format, use output functions like fwprintf(3S) and fprintf(3S), or apply conversion functions like wctomb(3C) and wcsrtombs(3C) before the output.
The tables in the remainder of this chapter describe the internationalization APIs included in the Solaris 9 product.
The following table describes the messaging function APIs in libc.
Table 2–1 Messaging Functions in libc
Library Routine |
Description |
---|---|
catclose() |
Close a message catalog |
catgets() |
Read a program message |
catopen() |
Open a message catalog |
dgettext() |
Get a message from a message catalog with domain specified |
dcgettext() |
Get a message from a message catalog with domain and category specified |
textdomain() |
Set and query the current domain |
bindtextdomain() |
Bind the path for a message domain |
gettext() |
Retrieve text string from message database |
The following table describes the code conversion function APIs in libc.
Table 2–2 Code Conversion in libc
Library Routine |
Description |
---|---|
iconv() |
Convert codes |
iconv_close() |
Deallocate the conversion descriptor |
iconv_open() |
Allocate the conversion descriptor |
Thise following table describes the regular expression APIs in libc.
Table 2–3 Regular Expressions in libc
Library Routine |
Description |
---|---|
regcomp() |
Compile the regular expression |
regexec() |
Execute regular expression matching |
regerror() |
Provide a mapping from error codes to error messages |
regfree() |
Free memory allocated by regcomp() |
fnmatch() |
Match file name or path name |
The following table describes the wide character function APIs in libc.
Table 2–4 Wide Character Class in libc
Library Routine |
Description |
---|---|
wctype() |
Define character class |
wctrans() |
Define character mapping |
The following table lists the modify and query locale in libc.
Table 2–5 Modify and Query Locale in libc
Library Routine |
Description |
---|---|
setlocale() |
Modify and query a program's locale |
The following table lists the query locale data in libc.
Table 2–6 Query Locale Data in libc
Library Routine |
Description |
---|---|
nl_langinfo() |
Get language and cultural information of current locale |
localeconv() |
Get monetary and numeric formatting information of current locale |
The following table describes the character classification function APIs in libc.
Table 2–7 Character Classification and Transliteration in libc
Library Routine |
Description |
---|---|
isalpha() |
Is character alphabetic? |
isupper() |
Is character uppercase? |
islower() |
Is character lowercase? |
isdigit() |
Is character a digit? |
isxdigit() |
Is character a hex digit? |
isalnum() |
Is character alphabetic or digital? |
isspace() |
Is character a space? |
ispunct() |
Is character a punctuation mark? |
isprint() |
Is character printable? |
iscntrl() |
Is character a control character? |
isascii() |
Is character an ASCII character? |
isgraph() |
Is character a visible character? |
isphonogram() |
Is wide character a phonogram? |
isideogram() |
Is wide character an ideogram? |
isenglish() |
Is wide character in English alphabet from a supplementary codeset? |
isnumber() |
Is wide character a digit from a supplementary codeset? |
isspecial() |
Is special wide character from a supplementary codeset? |
iswalpha() |
Is wide character alphabetic? |
iswupper() |
Is wide character uppercase? |
iswlower() |
Is wide character lowercase? |
iswdigit() |
Is wide-character a digit? |
iswxdigit() |
Is wide character a hex digit? |
iswalnum() |
Is wide character an alphabetic character or digit? |
iswspace() |
Is wide character a white space? |
iswpunct() |
Is wide character a punctuation mark? |
iswprint() |
Is wide character a printable character? |
iswgraph() |
Is wide character a visible character? |
iswcntrl() |
Is wide character a control character? |
iswascii() |
Is wide character an ASCII character? |
toupper() |
Convert a lowercase character to uppercase. |
tolower() |
Convert an uppercase character to lowercase. |
towupper() |
Convert a lowercase wide character to uppercase. |
towlower() |
Convert an uppercase wide character to lowercase. |
towctrans() |
Wide character mapping. |
The following table describes the character collation function APIs in libc.
Table 2–8 Character Collation in libc
Library Routine |
Description |
---|---|
strcoll() |
Collate character strings |
strxfrm() |
Transform character strings for comparison |
wcscoll() |
Collate wide-character strings |
wcsxfrm() |
Transform wide-character strings for comparison |
The following table describes the monetary handling function APIs in libc.
Table 2–9 Monetary Formatting in libc
Library Routine |
Description |
---|---|
localeconv() |
Get monetary formatting information for the current locale |
strfmon() |
Convert monetary value to string representation |
The following table describes the date and time formatting in libc.
Table 2–10 Date and Time Formatting in libc
Library Routine |
Description |
---|---|
getdate() |
Convert user format date and time. |
strftime() |
Convert date and time to string representation. The %u conversion function conforms to the X/Open CAE Specification, System Interfaces and Headers, Issue 4, Version 2. This function represents a weekday as a decimal number [1,7], with 1 now representing Monday. |
strptime() |
Date and time conversion. |
The following table describes the multibyte handling function APIs in libc.
Table 2–11 Multibyte Handling in libc
Library Routine |
Description |
---|---|
btowc() |
Single-byte to wide-character conversion |
mbrlen() |
Get number of bytes in character (restartable) |
mbsinit() |
Determine conversion object status |
mbrtowc() |
Convert a character to a wide-character code (restartable) |
mbsrtowcs() |
Convert a character string to a wide-character string (restartable) |
mblen() |
Get number of bytes in a character |
mbtowc() |
Convert a character to a wide-character code |
mbstowcs() |
Convert a character string to a wide-character string |
The following table describes the wide character and string handling in libc.
Table 2–12 Wide Character and String Handling in libc
Library Routine |
Description |
---|---|
wcsncat() |
Concatenate wide-character strings to length n |
wsdup() |
Duplicate wide-character string |
wcscmp() |
Compare wide-character strings |
wcsncmp() |
Compare wide-character strings to length n |
wcscpy() |
Copy wide-character strings |
wcsncpy() |
Copy wide-character strings to length n |
wcschr() |
Find character in wide-character string |
wcsrchr() |
Find character in wide-character string from right |
wcslen() | Get length of wide-character string |
wscol() |
Return display width of wide-character string |
wcsspn() |
Return span of one wide-character string in another |
wcscspn() |
Return span of one wide-character string not in another |
wcspbrk() |
Return pointer to one wide-character string in another |
wcstok() |
Move token through wide-character string |
wscwcs() |
Find string in wide-character string |
wcstombs() |
Convert wide-character string to multibyte string |
wctomb() |
Convert wide-character to multibyte character |
wcwidth() |
Determine number of column positions of a wide character |
wcswidth() |
Determine number of column positions of a wide-character string |
wctob() |
Wide character to single byte conversion |
wcrtomb() |
Convert a wide-character code to a character (restartable) |
wcstol() |
Convert wide-character string to long integer |
wcstoul() |
Convert wide-character string to unsigned long integer |
wcstod() |
Convert wide-character string to double precision |
wcsrtombs() |
Convert a wide-character string to a character string (restartable) |
wcscat() |
Concatenate wide-character strings |
The following table describes the formatted wide-character input and output in libc.
Table 2–13 Formatted Wide-character Input and Output in libc
Library Routine |
Description |
---|---|
wsprintf() |
Generate wide-character string according to format |
wsscanf() |
Formatted input conversion |
fwprintf() |
Print formatted wide-character output |
fwscanf() |
Convert formatted wide-character input |
wprintf() |
Print formatted wide-character output |
wscanf() |
Convert formatted wide-character input |
swprintf() |
Print formatted wide-character output |
swscanf() |
Convert formatted wide-character input |
vfwprintf() |
Wide-character formatted output of a stdarg argument list |
vswprintf() |
Wide-character formatted output of a stdarg argument list |
This table describes the wide strings function APIs in libc.
Table 2–14 Wide Stringslibc
Library Routine |
Description |
---|---|
wscasecmp() |
Compare wide-character strings, ignore case differences |
wsncasecmp() |
Process code-string operations |
wcsstr() |
Find a wide-character substring |
wmemchr() |
Find a wide character in memory |
wmemcmp() |
Compare wide characters in memory |
wmemcpy() |
Copy wide characters in memory |
wmemmove() |
Copy wide characters in memory with overlapping areas |
wmemset() |
Set wide characters in memory |
The following table describes the wide-character input and output in libc.
Table 2–15 Wide-character Input and Output inlibc
Library Routine |
Description |
---|---|
fgetwc() |
Get multibyte character from stream, convert to wide character |
getwchar() |
Get multibyte character from stdin, convert to wide character |
fgetws() |
Get multibyte string from stream, convert to wide character |
getws() |
Get multibyte string from stdin, convert to wide character |
fputwc() |
Convert wide character to multibyte character, puts to stream |
fwide() |
Set stream orientation |
putwchar() |
Convert wide character to multibyte character, puts to stdin |
fputws() |
Convert wide character to multibyte string, puts to stream |
putws() |
Convert wide character to multibyte string, puts to stdin |
ungetwc() |
Push a wide character back into input stream. |
The new genmsg utility can be used with the catgets() family of functions to create internationalized source message catalogs. The utility examines a source program file for calls to functions in catgets and builds a source message catalog from the information it finds. For example:
% cat example.c ... /* NOTE: %s is a file name */ printf(catgets(catd, 5, 1, "%s cannot be opened.")); /* NOTE: "Read" is a past participle, not a present tense verb */ printf(catgets(catd, 5, 1, "Read")); ... % genmsg -c NOTE example.c The following file(s) have been created. new msg file = "example.c.msg" % cat example.c.msg $quote " $set 5 1 "%s cannot be opened" /* NOTE: %s is a file name */ 2 "Read" /* NOTE: "Read" is a past participle, not a present tense verb */
In the above example, genmsg is run on the source file example.c, which produces a source message catalog named example.c.msg. The -c option with the argument NOTE causes genmsg to include comments in the catalog. If a comment in the source program contains the string specified, the comment appears in the message catalog after the next string extracted from a call to catgets.
You can use genmsg to number the messages in a message set automatically.
For more information, see the genmsg(1) man page.
To generate a formatted message catalog file, use the gencat(1) utility.
For information on the message extraction utility for Portable Message files (.po files) and also on how to generate message object files (.mo files) from the .po files, see the xgettext(1), and msgfmt(1) man pages, respectively.
Solaris users can create user-defined codeset converters by using the geniconvtbl utility.
This utility enables user-defined and user-customizable codeset conversions with a standard system utility and interface like iconv(1) and iconv(3C). This feature enhances the ability of an application to deal with incompatible data types, particularly data generated from proprietary or legacy applications. Modification to existing Solaris codeset conversions is also supported.
More details and also examples can be found in the geniconvtbl(1) and geniconvtbl(4) man pages. Sample input source files for the utility are also available for reference from the /usr/lib/iconv/geniconvtbl/srcs/ directory.
Once the user-defined code conversions are prepared and placed as specified in the geniconvtbl(1) man page, users can use the code conversions from the iconv(1) utility and the iconv(3C) functions of both 32-bit and 64-bit Solaris operating environments.