Go to main content

Internationalizing and Localizing Applications in Oracle Solaris

Exit Print View

Updated: November 2020
 
 

Handling Characters and Character Strings

Character codes used for handling characters and character strings can be categorized into two groups:

Multibyte (file code)

File code is used for text data exchange and for storing in a file. It has fixed byte ordering regardless of the underlying system, which is Big Endian byte ordering. Codesets like UTF-8, EUC, single-byte codesets, BIG5, Shift-JIS, PCK, GBK, GB18030, and so on come under this category. The term multibyte character in the context of the functions described in this section is a general term that refers to the codeset of the current locale, even though it might in some cases be a single-byte codeset.

Wide characters (process code)

Process code is a fixed-width representation of a character used for internal processing. It is in the native byte ordering of the platform, which can be either Big Endian or Little Endian. Encodings like UTF-32, UCS-2, and UCS-4 can be wide-character encodings.

Conversion between multibyte data and wide-character data is often necessary. When a program takes input from a file, the multibyte data in the file is converted into wide-character process code by using input functions like fscanf() and fwscanf() or by using conversion functions like mbtowc() and mbsrtowcs() after the input. To convert output data from wide-character format to multibyte character format, use output functions like fwprintf() and fprintf() or apply conversion functions like wctomb() and wcsrtombs() before the output.

Functions for handling characters, wide characters, and corresponding data types are described in the following sections.

Character Types and Definitions

    The ISO/IEC 9899 standard defines the term "wide character" and the wchar_t and wint_t data types.

  • A wide character is a representation of a single character that fits into an object of type wchar_t.

  • The wchar_t is an integer type capable of representing all characters for all supported locales.

  • The wint_t is an integer type capable of storing any valid value of wchar_t or WEOF.

  • A wide-character string (also wide string or process code string) is a sequence of wide characters terminated by a null wide character code.


Note -  The ISO/IEC 9899 standard does not specify the form or the encoding of the contents for the wchar_t data type. Because it is an implementation-specific data type, it is not portable. Although many implementations use some Unicode encoding forms for the contents of the wchar_t data type, do not assume that the contents of wchar_t are Unicode. Some platforms use UCS-4 or UCS-2 for their wide-character encoding.

In Oracle Solaris, the internal form of wchar_t is specific to a locale. In the Oracle Solaris Unicode locales, wchar_t has the UTF-32 Unicode encoding form, and other locales have different representations.

Fore more information, see stddef.h(3HEAD) and wchar.h(3HEAD) man pages.

Integer Coded Character Classification Functions

The following functions are used for character classification and return a non-zero value for true, and 0 for false. With the exception of the isascii() function, all other functions are locale sensitive, specifically for the LC_CTYPE category of the current thread or process, or of locale specified as an argument to functions whose names end with _l.

isalpha()
isalpha_l()

Test for an alphabetic character

isalnum()
isalnum_l()

Test for an alphanumeric character

isascii()

Test for a 7-bit US-ASCII character

isblank()
isblank_l()

Test for a blank character

iscntrl()
iscntrl_l()

Test for a control character

isdigit()
isdigit_l()

Test for a decimal digit

isgraph()
isgraph_l()

Test for a visible character

islower()
islower_l()

Test for a lowercase letter

isprint()
isprint_l()

Test for a printable character

ispunct()
ispunct_l()

Test for a punctuation character

isspace()
isspace_l()

Test for a white-space character

isupper()
isupper_l()

Test for an uppercase letter

isxdigit()
isxdigit_l()

Test for a hexadecimal digit

These functions should not be used in a locale with a multibyte codeset, such as UTF-8. Use the wide-character classification functions described in the following section for multibyte codesets.

The behavior of some of these functions also depends on the compiler options used at compile time. The ctype(3C) man page describes the "Default" and "Standard conforming" behaviors for isalpha(), isgraph(), isprint(), and isxdigit() functions. For example, isalpha() function is defined as follows:

Default isalpha()

Tests for any character for which isupper() or islower() is true.

Standard Conforming isalpha()

Tests for any character for which isupper() or islower() is true, or any character that is one of the current locale-defined set of characters for which none of iscntrl(), isdigit(), ispunct(), or isspace() is true. In the C locale, isalpha() returns true only for the characters for which isupper() or islower() is true.

This has consequences for languages or alphabets which have no case for its letters (also called unicase), such as Arabic, Hebrew or Thai. For alphabetic characters such as aleph (0xE0) in the Hebrew legacy locale he_IL.ISO8859-8, the functions isupper() and islower() always return false. Therefore, even the isalpha() function always returns false. If compiler options are enabled for the standard conforming behavior, the isalpha() function returns true for such characters. For more information, see the isalpha(3C) and standards(7) man pages.

See also the Oracle Developer Studio 12.6: C User's Guide, ctype(3C), and SUSv3(7) man pages.

Wide-Character Classification Functions

The following man pages describe functions that classify wide characters and return a non-zero value for TRUE, and 0 for FALSE. These functions check the given wide character against named character classes, such as alpha, lower, or jkana, which are defined in the LC_CTYPE category current thread or process, or of locale specified as an argument to functions whose name end with _l. Therefore, these functions are locale sensitive.

iswalpha()
iswalpha_l()

Test for an alphabetic wide-character

iswalnum()
iswalnum_l()

Test for an alphanumeric wide character

iswascii()
iswascii_l()

Test whether a wide character represents a 7-bit US-ASCII character

iswblank()
iswblank_l()

Test for a blank wide character

iswcntrl()
iswcntrl_l()

Test for a control wide character

iswdigit()
iswdigit_l()

Test for a decimal digit wide character

iswgraph()
iswgraph_l()

Test for a visible wide character

iswlower()
iswlower_l()

Test for a lowercase letter wide character

iswprint()
iswprint_l()

Test for a printable wide character

iswpunct()
iswpunct_l()

Test for a punctuation wide character

iswspace()
iswspace_l()

Test for a white-space wide character

iswupper()
iswupper_l()

Test for an uppercase letter wide character

iswxdigit()
iswxdigit_l()

Test for a hexadecimal digit wide character

isenglish()

Test for a wide character representing an English language character, excluding US-ASCII characters

isideogram()

Test for a wide character representing an ideographic language character, excluding US-ASCII characters

isnumber()

Test for wide character representing digit, excluding US-ASCII characters

isphonogram()

Test for a wide character representing a phonetic language character, excluding US-ASCII characters

isspecial()

Test for a wide character representing a special language character, excluding US-ASCII characters

    The following character classes are defined in all the locales:

  • alnum

  • alpha

  • blank

  • cntrl

  • digit

  • graph

  • lower

  • print

  • punct

  • space

  • upper

  • xdigit

The isenglish(), isideogram(), isnumber(), isphonogram(), and isspecial() are legacy Oracle Solaris specific wide-character classification functions. The character classes for these functions are defined only in the following Asian locales: ko_KR.EUC, zh_CN.EUC, zh_CN.GBK, zh_CN.GB18030, zh_HK.BIG5HK, zh_TW.BIG5, and zh_TW.EUC and their variants. The return values will always be false when used in other locales including Unicode locales.

You can to query for a specific character class in a generic way by using the following functions:

wctype()

Define character class

iswctype()
iswctype_l()

Test character for specified class


Note -  mbtowc_l() is not currently available. A wide character that is passed to iswalpha_l(), for example, needs to be generated with mbtowc(), with the current thread's locale set by uselocale() either to the same locale or to another locale which uses same codeset.
Example 13  Querying Character Class of a Wide Character

In the following example, calls to the iswctype() and wctype() functions are used to check whether the given Unicode character belongs to the jhira character class. The jhira character class is from Japanese Hiragana script.

  wint_t  wc;
  int     ret;

  setlocale(LC_ALL, "ja_JP.UTF-8");

  /* "\xe3\x81\xba" is UTF-8 for HIRAGANA LETTER PE */
  ret = mbtowc(&wc, "\xe3\x81\xba", 3);
  if (ret == (size_t)-1) {
          /* Invalid character sequence. */
          :
  }

  if (iswctype(wc, wctype("jhira"))) {
          wprintf(L"'%c' is a hiragana character.\n", wc);
  }

The example will produce the following output:

ぺ is a hiragana character.

Character Transliteration Functions

The following functions serve for mapping characters between character classes (character transliteration). If a mapping for a character is in the character class of locale of a current thread or process, or of a locale specified as an argument to functions whose name end with _l, the functions return a transliterated character. These functions are locale sensitive.

tolower()
tolower_l()

Convert an uppercase character to lowercase.

toupper()
toupper_l()

Convert a lowercase character to uppercase

towlower()
towlower_l()

Convert an uppercase wide character to lowercase

towupper()
towupper_l()

Convert a lowercase wide character to uppercase

The following functions provide a generic way to perform character transliteration:

wctrans()
wctrans_l()

Define character mapping

towctrans()
towctrans_l()

Wide-character mapping

For more information about related functions for Unicode strings, see Processing UTF-8 Strings.


Note -  mbtowc_l() is not currently available. A wide character that is passed to towlower_l(), for example, needs to be generated with mbtowc(), with the current thread's locale set by uselocale() either to the same locale or to another locale which uses same codeset.
Example 14  Transliteration of a Wide Character

The following code fragment shows how to use the towupper() function for transliterating a Unicode wide character to uppercase.

  wint_t  wc;
  int     ret;

  setlocale(LC_ALL, "cs_CZ.UTF-8");

  /* "\xc5\x99" is UTF-8 for LATIN SMALL LETTER R WITH CARON */
  ret = mbtowc(&wc, "\xc5\x99", 2);
  if (ret == (size_t)-1) {
          /* Invalid character sequence. */
          :
  }

  wprintf(L"'%c' is uppercase of '%c'.\n", towupper(wc), wc);

The example will produce the following output:

Ř is uppercase of ř.

String Collation

The following functions are used for string comparison based on the collation data of locale of current thread or process, or of locale specified as argument to function whose name ends with _l:

strcoll()
strcoll_l()

String comparison using collating information

strxfrm()
strxfrm_l()

String transformation

wcscoll(), wscoll()
wcscoll_l()

Wide-character string comparison using collating information

wcsxfrm(), wsxfrm()
wcsxfrm_l()

Wide-character string transformation

For better performance when sorting large lists of strings, use the strxfrm() and strcmp() functions instead of the strcoll() function, and the wcsxfrm() and wcscmp() functions instead of the wcscoll() function.

When using the strxfrm() and wcsxfrm() functions, note that the format of the transformed string is not in a human-readable form. These functions are used as input to the strcmp() and wcscmp() function calls respectively.


Note -  mbstowcs_l() is not currently available. A wide character that is passed to wcscoll_l() or wcsxfrm_l(), for example, need to be generated with mbstowcs(), with the current thread's locale set by uselocale() either to the same locale or to another locale which uses same codeset.

For more information, see the strcmp(3C) and wcscmp(3C) man pages.

Conversion Between Multibyte and Wide Characters

The following functions are used for conversion between the codeset of the current locale (multibyte) and the process code (wide-character representation).

These functions are locale sensitive and depend on the LC_CTYPE category of the current locale. They return the same error on incomplete characters and illegal characters. For more information about illegal characters and incomplete characters, see Converting Codesets.

mblen()

Get the number of bytes in a character

mbtowc()

Convert a character to a wide-character code

mbstowcs()

Convert a character string to a wide-character string

wctomb()

Convert a wide-character code to a character

wcstombs()

Convert a wide-character string to a character string

The following functions are restartable, and can be used to handle incomplete character cases. These cases occur when an incomplete character reported from the previous call along with the additional bytes of the current call is a valid character. In order to store the state information required for this kind of processing, the functions either use a user-provided or an internal state structure of type mbstate_t. The mbsinit() function is used to detect whether an mbstate_t structure is in an initial state.

mbsinit()

Determine the conversion object status

mbrlen()

Get the number of bytes in a character (restartable)

mbrtowc()

Convert a character to a wide-character code (restartable)

mbsrtowcs(), mbsnrtowcs()

Convert a character string to a wide-character string (restartable)

wcrtomb()

Convert a wide-character code to a character (restartable)

wcsrtombs(), wcsnrtombs()

Convert a wide-character string to a character string (restartable)

c16rtomb()

Convert a 16-bit character code to a character (restartable)

c32rtomb()

Convert a 32-bit character code to a character (restartable)

mbrtoc16()

Convert a character to a 16-bit character code (restartable)

mbrtoc32()

Convert a character to a 32-bit character code (restartable)

The following functions are used for conversion between the codeset of the current locale and the process code. They determine whether the integer-coded character is represented in single-byte. If not, they return EOF and WEOF respectively.

wctob()

Convert a wide-character to a single-byte character, if possible

btowc()

Convert a single-byte character to a wide character, if possible

Wide-Character Strings

The following functions are used to handle wide-character strings:

wcslen(), wslen(), wcsnlen()

Get length of a fixed-sized wide-character string

wcschr(), wschr()

Find the first occurrence of a wide character in a wide-character string

wcsrchr(), wsrchr()

Find the last occurrence of a wide character in a wide-character string

wcspbrk()

Scan a wide-character string for a wide-character code

wcscat(), wscat(), wcsncat()

Concatenate two wide-character strings

wcscmp(), wscmp(), wcsncmp()

Compare two wide-character strings

wcscpy(), wscpy()

Copy a wide-character string

wcsncpy(), wsncpy()

Copy part of a wide-character string

wcpcpy(), wcpncpy()

Copy a wide-character string, returning a pointer to its end

wcsspn(), wsspn()

Get the length of a wide-character substring

wcscspn(), wscspn()

Get the length of a complementary wide-character substring

wcstok(), wstok()

Split a wide-character string into tokens

wcsstr(), wscwcs()

Find a wide-character substring

wcwidth(), wcswidth(), wscol()

Get the number of column positions of a wide-character or wide-character string

wscasecmp(), wsncasecmp(), wcscasecmp_l(), wcsncasecmp_l()

Case-insensitive wide-character string comparison

wcsdup(), wsdup()

Duplicate a wide-character string

The wcswcs() function was marked legacy and may be removed from the ISO/IEC 9899 standard in the future. Use wcsstr() function instead.

The functions for converting wide characters to numbers are as follows:

wcstol(), wstol(), wcstoll(), watol(), watoll(), watoi()

Convert a wide-character string to a long integer

wcstoul(), wcstoull()

Convert a wide-character string to an unsigned long integer

wcstod(), wstod(), wcstof(), wcstold(), watof()

Convert a wide-character string to a floating-point number


Note -  mbstowcs_l() is not currently available. A wide character that is passed to wcscasecmp_l() or wcsncasecmp_l(), for example, need to be generated with mbstowcs(), with the current thread's locale set by uselocale() either to the same locale or to another locale which uses same codeset.

The following man pages describe functions that list the in-memory operations with wide characters. They are wide-character equivalents of functions like memset(), memcpy(), and so on. These functions are not affected by the locale and all wchar_t values are treated identically.

wmemset(3C)

Set wide characters in memory

wmemcpy(3C)

Copy wide characters in memory

wmemmove(3C)

Copy wide characters in memory with overlapping areas

wmemcmp(3C)

Compare wide characters in memory

wmemchr(3C)

Find a wide character in memory

Wide-Character Input and Output

The following functions are used for wide-character input and output. These functions perform implicit conversion between file code (multibyte data) and internal process code (wide-character data).

fgetwc(), getwc()

Get a wide-character code from a stream

getwchar()

Get a wide character from a standard input stream

fgetws()

Get a wide-character string from a stream

getws() (*)

Get a wide-character string from a standard input stream

fputwc(), putwc()

Put a wide-character code on a stream

putwchar()

Put a wide-character code on the standard output stream

fputws()

Put a wide-character string on a stream

putws() (*)

Put a wide-character string on the standard output stream

fwide()

Set the stream orientation to byte or wide-character

ungetwc()

Push wide-character code back into the input stream

The following functions are used for formatting wide-character input and output:

fwprintf(), wprintf(), swprintf(), wsprintf() (*)

Print formatted wide-character output

vfwprintf(), vwprintf(), vswprintf()

Wide-character formatted output of a stdarg argument list

fwscanf(), wscanf(), swscanf(), wsscanf() (*)

Convert formatted wide-character input

vfwscanf(), vwscanf(), vswscanf()

Convert formatted wide-character input using a stdarg argument list

The functions marked with (*) were added to Oracle Solaris before the UNIX 98 standard that introduced the Multibyte Support Extension (MSE). They require inclusion of the widec.h header instead of the default wchar.h.