Handling Characters and Character Strings

Language:

Character codes used for handling characters and character strings can be categorized into two groups:

Multibyte (file code): File code is used for text data exchange and for storing in a file. It has fixed byte ordering regardless of the underlying system, which is Big Endian byte ordering. Codesets like UTF-8, EUC, single-byte codesets, BIG5, Shift-JIS, PCK, GBK, GB18030, and so on come under this category. The term multibyte character in the context of the functions described in this section is a general term that refers to the codeset of the current locale, even though it might in some cases be a single-byte codeset.
Wide characters (process code): Process code is a fixed-width representation of a character used for internal processing. It is in the native byte ordering of the platform, which can be either Big Endian or Little Endian. Encodings like UTF-32, UCS-2, and UCS-4 can be wide-character encodings.

Conversion between multibyte data and wide-character data is often necessary. When a program takes input from a file, the multibyte data in the file is converted into wide-character process code by using input functions like fscanf() and fwscanf() or by using conversion functions like mbtowc() and mbsrtowcs() after the input. To convert output data from wide-character format to multibyte character format, use output functions like fwprintf() and fprintf() or apply conversion functions like wctomb() and wcsrtombs() before the output.

Functions for handling characters, wide characters, and corresponding data types are described in the following sections.

Character Types and Definitions

The ISO/IEC 9899 standard defines the term "wide character" and the wchar_t and wint_t data types.

A wide character is a representation of a single character that fits into an object of type wchar_t.
The wchar_t is an integer type capable of representing all characters for all supported locales.
The wint_t is an integer type capable of storing any valid value of wchar_t or WEOF.
A wide-character string (also wide string or process code string) is a sequence of wide characters terminated by a null wide character code.

Note - The ISO/IEC 9899 standard does not specify the form or the encoding of the contents for the wchar_t data type. Because it is an implementation-specific data type, it is not portable. Although many implementations use some Unicode encoding forms for the contents of the wchar_t data type, do not assume that the contents of wchar_t are Unicode. Some platforms use UCS-4 or UCS-2 for their wide-character encoding.

In Oracle Solaris, the internal form of wchar_t is specific to a locale. In the Oracle Solaris Unicode locales, wchar_t has the UTF-32 Unicode encoding form, and other locales have different representations.

Fore more information, see stddef.h(3HEAD) and wchar.h(3HEAD) man pages.

Integer Coded Character Classification Functions

The following functions are used for character classification and return a non-zero value for true, and 0 for false. With the exception of the isascii() function, all other functions are locale sensitive, specifically for the LC_CTYPE category of the current thread or process, or of locale specified as an argument to functions whose names end with _l.

isalpha()
isalpha_l(): Test for an alphabetic character
isalnum()
isalnum_l(): Test for an alphanumeric character
isascii(): Test for a 7-bit US-ASCII character
isblank()
isblank_l(): Test for a blank character
iscntrl()
iscntrl_l(): Test for a control character
isdigit()
isdigit_l(): Test for a decimal digit
isgraph()
isgraph_l(): Test for a visible character
islower()
islower_l(): Test for a lowercase letter
isprint()
isprint_l(): Test for a printable character
ispunct()
ispunct_l(): Test for a punctuation character
isspace()
isspace_l(): Test for a white-space character
isupper()
isupper_l(): Test for an uppercase letter
isxdigit()
isxdigit_l(): Test for a hexadecimal digit

These functions should not be used in a locale with a multibyte codeset, such as UTF-8. Use the wide-character classification functions described in the following section for multibyte codesets.

The behavior of some of these functions also depends on the compiler options used at compile time. The ctype(3C) man page describes the "Default" and "Standard conforming" behaviors for isalpha(), isgraph(), isprint(), and isxdigit() functions. For example, isalpha() function is defined as follows:

Default isalpha(): Tests for any character for which isupper() or islower() is true.
Standard Conforming isalpha(): Tests for any character for which isupper() or islower() is true, or any character that is one of the current locale-defined set of characters for which none of iscntrl(), isdigit(), ispunct(), or isspace() is true. In the C locale, isalpha() returns true only for the characters for which isupper() or islower() is true.

This has consequences for languages or alphabets which have no case for its letters (also called unicase), such as Arabic, Hebrew or Thai. For alphabetic characters such as aleph (0xE0) in the Hebrew legacy locale he_IL.ISO8859-8, the functions isupper() and islower() always return false. Therefore, even the isalpha() function always returns false. If compiler options are enabled for the standard conforming behavior, the isalpha() function returns true for such characters. For more information, see the isalpha(3C) and standards(7) man pages.

See also the Oracle Developer Studio 12.6: C User's Guide, ctype(3C), and SUSv3(7) man pages.

Wide-Character Classification Functions

The following man pages describe functions that classify wide characters and return a non-zero value for TRUE, and 0 for FALSE. These functions check the given wide character against named character classes, such as alpha, lower, or jkana, which are defined in the LC_CTYPE category current thread or process, or of locale specified as an argument to functions whose name end with _l. Therefore, these functions are locale sensitive.

iswalpha()
iswalpha_l(): Test for an alphabetic wide-character
iswalnum()
iswalnum_l(): Test for an alphanumeric wide character
iswascii()
iswascii_l(): Test whether a wide character represents a 7-bit US-ASCII character
iswblank()
iswblank_l(): Test for a blank wide character
iswcntrl()
iswcntrl_l(): Test for a control wide character
iswdigit()
iswdigit_l(): Test for a decimal digit wide character
iswgraph()
iswgraph_l(): Test for a visible wide character
iswlower()
iswlower_l(): Test for a lowercase letter wide character
iswprint()
iswprint_l(): Test for a printable wide character
iswpunct()
iswpunct_l(): Test for a punctuation wide character
iswspace()
iswspace_l(): Test for a white-space wide character
iswupper()
iswupper_l(): Test for an uppercase letter wide character
iswxdigit()
iswxdigit_l(): Test for a hexadecimal digit wide character
isenglish(): Test for a wide character representing an English language character, excluding US-ASCII characters
isideogram(): Test for a wide character representing an ideographic language character, excluding US-ASCII characters
isnumber(): Test for wide character representing digit, excluding US-ASCII characters
isphonogram(): Test for a wide character representing a phonetic language character, excluding US-ASCII characters
isspecial(): Test for a wide character representing a special language character, excluding US-ASCII characters

The following character classes are defined in all the locales:

alnum
alpha
blank
cntrl
digit
graph
lower
print
punct
space
upper
xdigit

The isenglish(), isideogram(), isnumber(), isphonogram(), and isspecial() are legacy Oracle Solaris specific wide-character classification functions. The character classes for these functions are defined only in the following Asian locales: ko_KR.EUC, zh_CN.EUC, zh_CN.GBK, zh_CN.GB18030, zh_HK.BIG5HK, zh_TW.BIG5, and zh_TW.EUC and their variants. The return values will always be false when used in other locales including Unicode locales.

You can to query for a specific character class in a generic way by using the following functions:

wctype(): Define character class

iswctype()
iswctype_l(): Test character for specified class

Note - mbtowc_l() is not currently available. A wide character that is passed to iswalpha_l(), for example, needs to be generated with mbtowc(), with the current thread's locale set by uselocale() either to the same locale or to another locale which uses same codeset.

Example 13 Querying Character Class of a Wide Character

In the following example, calls to the iswctype() and wctype() functions are used to check whether the given Unicode character belongs to the jhira character class. The jhira character class is from Japanese Hiragana script.

  wint_t  wc;
  int     ret;

  setlocale(LC_ALL, "ja_JP.UTF-8");

  /* "\xe3\x81\xba" is UTF-8 for HIRAGANA LETTER PE */
  ret = mbtowc(&wc, "\xe3\x81\xba", 3);
  if (ret == (size_t)-1) {
          /* Invalid character sequence. */
          :
  }

  if (iswctype(wc, wctype("jhira"))) {
          wprintf(L"'%c' is a hiragana character.\n", wc);
  }

The example will produce the following output:

ぺ is a hiragana character.

Character Transliteration Functions

The following functions serve for mapping characters between character classes (character transliteration). If a mapping for a character is in the character class of locale of a current thread or process, or of a locale specified as an argument to functions whose name end with _l, the functions return a transliterated character. These functions are locale sensitive.

tolower()
tolower_l(): Convert an uppercase character to lowercase.
toupper()
toupper_l(): Convert a lowercase character to uppercase
towlower()
towlower_l(): Convert an uppercase wide character to lowercase
towupper()
towupper_l(): Convert a lowercase wide character to uppercase

The following functions provide a generic way to perform character transliteration:

wctrans()
wctrans_l(): Define character mapping
towctrans()
towctrans_l(): Wide-character mapping

For more information about related functions for Unicode strings, see Processing UTF-8 Strings.

Note - mbtowc_l() is not currently available. A wide character that is passed to towlower_l(), for example, needs to be generated with mbtowc(), with the current thread's locale set by uselocale() either to the same locale or to another locale which uses same codeset.

Example 14 Transliteration of a Wide Character

The following code fragment shows how to use the towupper() function for transliterating a Unicode wide character to uppercase.

  wint_t  wc;
  int     ret;

  setlocale(LC_ALL, "cs_CZ.UTF-8");

  /* "\xc5\x99" is UTF-8 for LATIN SMALL LETTER R WITH CARON */
  ret = mbtowc(&wc, "\xc5\x99", 2);
  if (ret == (size_t)-1) {
          /* Invalid character sequence. */
          :
  }

  wprintf(L"'%c' is uppercase of '%c'.\n", towupper(wc), wc);

The example will produce the following output:

Ř is uppercase of ř.

String Collation

The following functions are used for string comparison based on the collation data of locale of current thread or process, or of locale specified as argument to function whose name ends with _l:

strcoll()
strcoll_l(): String comparison using collating information
strxfrm()
strxfrm_l(): String transformation
wcscoll(), wscoll()
wcscoll_l(): Wide-character string comparison using collating information
wcsxfrm(), wsxfrm()
wcsxfrm_l(): Wide-character string transformation

For better performance when sorting large lists of strings, use the strxfrm() and strcmp() functions instead of the strcoll() function, and the wcsxfrm() and wcscmp() functions instead of the wcscoll() function.

When using the strxfrm() and wcsxfrm() functions, note that the format of the transformed string is not in a human-readable form. These functions are used as input to the strcmp() and wcscmp() function calls respectively.

Note - mbstowcs_l() is not currently available. A wide character that is passed to wcscoll_l() or wcsxfrm_l(), for example, need to be generated with mbstowcs(), with the current thread's locale set by uselocale() either to the same locale or to another locale which uses same codeset.

For more information, see the strcmp(3C) and wcscmp(3C) man pages.

Conversion Between Multibyte and Wide Characters

The following functions are used for conversion between the codeset of the current locale (multibyte) and the process code (wide-character representation).

These functions are locale sensitive and depend on the LC_CTYPE category of the current locale. They return the same error on incomplete characters and illegal characters. For more information about illegal characters and incomplete characters, see Converting Codesets.

mblen(): Get the number of bytes in a character
mbtowc(): Convert a character to a wide-character code
mbstowcs(): Convert a character string to a wide-character string
wctomb(): Convert a wide-character code to a character
wcstombs(): Convert a wide-character string to a character string

The following functions are restartable, and can be used to handle incomplete character cases. These cases occur when an incomplete character reported from the previous call along with the additional bytes of the current call is a valid character. In order to store the state information required for this kind of processing, the functions either use a user-provided or an internal state structure of type mbstate_t. The mbsinit() function is used to detect whether an mbstate_t structure is in an initial state.

mbsinit(): Determine the conversion object status
mbrlen(): Get the number of bytes in a character (restartable)
mbrtowc(): Convert a character to a wide-character code (restartable)
mbsrtowcs(), mbsnrtowcs(): Convert a character string to a wide-character string (restartable)
wcrtomb(): Convert a wide-character code to a character (restartable)
wcsrtombs(), wcsnrtombs(): Convert a wide-character string to a character string (restartable)
c16rtomb(): Convert a 16-bit character code to a character (restartable)
c32rtomb(): Convert a 32-bit character code to a character (restartable)
mbrtoc16(): Convert a character to a 16-bit character code (restartable)
mbrtoc32(): Convert a character to a 32-bit character code (restartable)

The following functions are used for conversion between the codeset of the current locale and the process code. They determine whether the integer-coded character is represented in single-byte. If not, they return EOF and WEOF respectively.

wctob(): Convert a wide-character to a single-byte character, if possible
btowc(): Convert a single-byte character to a wide character, if possible

Wide-Character Strings

The following functions are used to handle wide-character strings:

wcslen(), wslen(), wcsnlen(): Get length of a fixed-sized wide-character string
wcschr(), wschr(): Find the first occurrence of a wide character in a wide-character string
wcsrchr(), wsrchr(): Find the last occurrence of a wide character in a wide-character string
wcspbrk(): Scan a wide-character string for a wide-character code
wcscat(), wscat(), wcsncat(): Concatenate two wide-character strings
wcscmp(), wscmp(), wcsncmp(): Compare two wide-character strings
wcscpy(), wscpy(): Copy a wide-character string
wcsncpy(), wsncpy(): Copy part of a wide-character string
wcpcpy(), wcpncpy(): Copy a wide-character string, returning a pointer to its end
wcsspn(), wsspn(): Get the length of a wide-character substring
wcscspn(), wscspn(): Get the length of a complementary wide-character substring
wcstok(), wstok(): Split a wide-character string into tokens
wcsstr(), wscwcs(): Find a wide-character substring
wcwidth(), wcswidth(), wscol(): Get the number of column positions of a wide-character or wide-character string
wscasecmp(), wsncasecmp(), wcscasecmp_l(), wcsncasecmp_l(): Case-insensitive wide-character string comparison
wcsdup(), wsdup(): Duplicate a wide-character string

The wcswcs() function was marked legacy and may be removed from the ISO/IEC 9899 standard in the future. Use wcsstr() function instead.

The functions for converting wide characters to numbers are as follows:

wcstol(), wstol(), wcstoll(), watol(), watoll(), watoi(): Convert a wide-character string to a long integer
wcstoul(), wcstoull(): Convert a wide-character string to an unsigned long integer
wcstod(), wstod(), wcstof(), wcstold(), watof(): Convert a wide-character string to a floating-point number

Note - mbstowcs_l() is not currently available. A wide character that is passed to wcscasecmp_l() or wcsncasecmp_l(), for example, need to be generated with mbstowcs(), with the current thread's locale set by uselocale() either to the same locale or to another locale which uses same codeset.

The following man pages describe functions that list the in-memory operations with wide characters. They are wide-character equivalents of functions like memset(), memcpy(), and so on. These functions are not affected by the locale and all wchar_t values are treated identically.

wmemset(3C): Set wide characters in memory
wmemcpy(3C): Copy wide characters in memory
wmemmove(3C): Copy wide characters in memory with overlapping areas
wmemcmp(3C): Compare wide characters in memory
wmemchr(3C): Find a wide character in memory

Wide-Character Input and Output

The following functions are used for wide-character input and output. These functions perform implicit conversion between file code (multibyte data) and internal process code (wide-character data).

fgetwc(), getwc(): Get a wide-character code from a stream
getwchar(): Get a wide character from a standard input stream
fgetws(): Get a wide-character string from a stream
getws() (*): Get a wide-character string from a standard input stream
fputwc(), putwc(): Put a wide-character code on a stream
putwchar(): Put a wide-character code on the standard output stream
fputws(): Put a wide-character string on a stream
putws() (*): Put a wide-character string on the standard output stream
fwide(): Set the stream orientation to byte or wide-character
ungetwc(): Push wide-character code back into the input stream

The following functions are used for formatting wide-character input and output:

fwprintf(), wprintf(), swprintf(), wsprintf() (*): Print formatted wide-character output
vfwprintf(), vwprintf(), vswprintf(): Wide-character formatted output of a stdarg argument list
fwscanf(), wscanf(), swscanf(), wsscanf() (*): Convert formatted wide-character input
vfwscanf(), vwscanf(), vswscanf(): Convert formatted wide-character input using a stdarg argument list

The functions marked with (*) were added to Oracle Solaris before the UNIX 98 standard that introduced the Multibyte Support Extension (MSE). They require inclusion of the widec.h header instead of the default wchar.h.