Asian-Language Support in the Solaris Operating Environment

5.3 Text Manipulation

When supporting multibyte languages, it is important to understand the difference between multibyte, wide and Unicode characters, and the impact of these on software development.

In the Solaris operating environment, a multibyte character (or file code) is a sequence of one or more bytes terminated by a null string. Thus, a string may contain characters of different length. On the other hand, a wide character (or process code) is defined as a fixed-size number of bytes. In the Solaris operating environment, a wide character is defined to be four bytes long. The Solaris operating environment supports the Unicode UTF-8 format, a variable-length encoding similar to multibyte encoding

In many cases, there is no need to distinguish double-byte (or three-byte) characters from single-byte characters. It is simpler to convert multibyte strings (file code) to wide-character formats (process code) before manipulating or processing text data.

The following APIs convert multibyte characters:

mbstowcs(): Convert multibyte string to wide-character string
mbstowc(): Convert multibyte to wide-character code

The following wstring(3c) APIs process multibyte characters:

wcscmp(): Compare wide-character strings
wcscpy(): Copy wide-character strings
wcslen(): Get length of wide-character string
wcschr(): Find character in wide-character string

Note -

File code is in multibyte format. Process code is in wide-character format. Do not assume particular character encodings of the process code.