Asian-Language Support in the Solaris Operating Environment

2.4 Codeset Independence

The Solaris operating environment architecture supports codeset independence (CSI), expanding the number of supported codesets from Extended UNIX\256 Codeset (EUC) to both EUC and non-EUC encodings, including PC-Kanji (also known as ShiftJIS) and GBK.

Note that text-handling routines should not define the size of the character codeset. Nor should other locale-specific components, such as the window system, input method, and online help, depend on a particular codeset. Figure 2-1 shows the locale-specific components which should be codeset independent.

Figure 2-1 Design model for international software

Support for Unicode, a universal codeset encompassing most written characters, is often confused with codeset independence. Unicode is often referred to as ISO 10646 and is an International Standards Organization (ISO) standard. Note that codeset independence must also apply to Unicode. Although Unicode supports many languages and writing systems, to an application Unicode is just another codeset. The Solaris operating environment supports the Unicode UTF-8 (File System Safe UCS Transformation Format) format, which is compatible with ISO 10646. For more information, see Unicode Support in the Solaris Operating Environment.

Note -

Codeset independence is often assumed because the idea of a character (in ISO C terms) and char (or byte) is thought of as a one-to-one relationship in programming languages. In written languages, however, the idea of a character can encompass one char/byte or multiple bytes. An alphabetic character from most European languages can be represented in one byte. An Asian-language character often requires more than one byte because there are more characters in the charset than one byte can represent.

Furthermore, applications often assume the representation of a given character. For example, a codeset independent application does not assume that `a' = \x61 or char = byte. Instead, during text-manipulation routines, such as truncating a stream of characters, the APIs determine the size of the number of bytes by the character and its definition or type. By not assuming the size of a character or the codeset, the application will be codeset independent.

Solaris maintains a codeset independence framework. Applications can use Solaris APIs to determine the size of the number of bytes used by the character and its definition or type. By not making assumptions about the underlying codeset, an application is codeset independent in Solaris.