Handling Characters and Character Strings

Language:

Character codes used for handling characters and character strings can be categorized into two groups:

Multibyte (file code): File code is used for text data exchange and for storing in a file. It has fixed byte ordering regardless of the underlying system, which is Big Endian byte ordering. Codesets like UTF-8, EUC, single-byte codesets, BIG5, Shift-JIS, PCK, GBK, GB18030, and so on come under this category. The term multibyte character in the context of the functions described in this section is a general term that refers to the codeset of the current locale, even though it might in some cases be a single-byte codeset.
Wide characters (process code): Process code is a fixed-width representation of a character used for internal processing. It is in the native byte ordering of the platform, which can be either Big Endian or Little Endian. Encodings like UTF-32, UCS-2, and UCS-4 can be wide-character encodings.

Conversion between multibyte data and wide-character data is often necessary. When a program takes input from a file, the multibyte data in the file is converted into wide-character process code by using input functions like fscanf() and fwscanf() or by using conversion functions like mbtowc() and mbsrtowcs() after the input. To convert output data from wide-character format to multibyte character format, use output functions like fwprintf() and fprintf() or apply conversion functions like wctomb() and wcsrtombs() before the output.

Functions for handling characters, wide characters, and corresponding data types are described in the following sections.