Korean Solaris User's Guide

Glossary

ANSI

American National Standards Institute. ANSI proposes standard definitions for different computing languages. The most recent standard for the C language, prepared by the ANSI C X3J11 Committee, includes library functions for computing with multibyte characters for international usage, as well as a new data type, wchar_t, for dealing with four-byte characters. This standard is not completed, so it is referred to as the “proposed ANSI C standard,” or ANSI C-X3J11.

ASCII

American Standard Code for Information Interchange. A seven bit code containing English upper and lowercase letters, punctuation, numbers and control codes. The eighth bit in each byte is used by different applications for parity checking, communication and message passing protocols, compacting data, or other purposes. Applications that are intended to be internationalized cannot utilize this bit if they are going to use multiple code sets or multibyte characters, and utilities that handle multiple code sets or multibyte characters.

Category

In the Korean Solaris documentation set, category is related to localization. A category is a portion of a country's language representation and cultural conventions. For instance, the date is often represented in the U.S. as Month, Day, Year; while in another country it might be Day, Month, Year. The date and time can be thought of as one category of a local language. Categories also refer to the program categories, the environment variables that are related to categories, and the ANSI localization tables for each category.

Character Set

A character set is defined as a set of elements used for the organization, control, or representation of data. Character sets may be composed of alphabets, ideograms, or other units. This may seem a bit open-ended, but character sets may contain other character sets, which makes the boundaries unclear. For example, the KS C 5601 character set contains English, Greek, Russian, and Japanese character sets, in addition to Hangul syllables (consonant and vowel combinations), Hanja ideograms (Chinese characters), and many other characters.

code set

Also called a coded character set, this is a set of unambiguous rules that establishes a character set and the one-to-one relationship between each character in the character set and its bit representation. For example, the English character set, including punctuation and numbers, can be mapped to the ASCII code set in such a way that each character corresponds to only one bit code, and no bit code corresponds to more than one character.

Combination code

Another name for Packed code or Johap code described below.

Completion code

Also called Wangsung. Completion code is a pre-defined set of Korean character codes, which maps preselected Hangul, Hanja, special symbols, alphabets of other languages and so on into two-byte coding space. This representation is defined in KS C 5601 and used as EUC code set 1 by the Korean Solaris Operating System.

EUC

Extended UNIX Code. Describes four code sets modelled on ISO-2022. Each code set can contain one or more different character sets, like the Hangul and Hanja character sets in KS C 5601. The four code sets are referred to as code sets 0, 1, 2, and 3, and in this text they are sometimes abbreviated as cs0, cs1, cs2, and cs3. Other internationalization efforts sometimes call these g0, g1, g2, and g3. Code set 0 is also called the primary code set, and code sets 1, 2, and 3 are called the supplementary code sets. In the Korean and Chinese implementations of the EUC codes, the primary code set (cs0) contains ASCII and begins with a zero in the most significant bit.

Hangul

Hangul is the phonetic alphabet commonly used in Korea. Each character corresponds to a spoken syllable, usually a consonant-vowel pair or a consonant-vowel-consonant triad. KS C 5601 defines 2350 Hangul characters used in standard computing.

Hanja

Hanja characters are Korean ideograms, which came originally from ancient China (the word itself means Chinese character). They were adopted many centuries ago and have evolved somewhat different meanings in China and Korea. But because they are not phonetically based, Chinese and Korean Hanja have remained closer in meaning than have Italian, French, and Spanish, which evolved into separate languages over the same time span. The Korean Industry Standard defines the 4888 most frequently used Hanja characters in the KS C 5601 standard.

ISO

International Standards Organization. Composed of a number of professional societies and companies, this organization studies and makes recommendations on internationalization issues. ISO 2022 proposes and describes the Extended UNIX Codes. Other ISO proposals include the European 8-bit code and communication protocols for internationalization.

Johap code

Johap code is a Packed code (also called Combination code), which is defined in the KS C 5601-1992 document. Unlike the Packed code defined in KS C 5601-1987 or before, Johap code has a set of Hanja characters and special symbol characters.

KSC

Korean Industry Standard Code Set. This is the Korean analogue to ASCII. The KSC describes standards for computing in the Korean environment. KS C 5601 contains code assignments in Completion code for Hangul and Hanja characters, graphics and punctuation characters, two Japanese phonetic alphabets (Hiragana and Katakana), control codes, and several western alphabets (Roman, Russian, and Greek characters). This standard defines 2350 Hangul characters, 4888 Hanja characters, and 986 additional characters (for punctuation, foreign alphabets, numbers, graphics, and others). Each character is two bytes long, and does not utilize the highest or most significant bit of each byte. In other words, it uses the lower seven bits of each byte for character assignments.

Locale

A locale describes a language or cultural environment. Its setting affects the display or manipulation of language-dependent features. Korean Solaris software provides C for U.S.A, ko for Korean extended UNIX code, and ko.UTF-8 for Korean Universal Multiple Octet Coded Character Set Transmission Format.

N-byte code

This coding system assigns each Korean alphabetic consonant or vowel a one-byte code. These are built up into Hangul syllabic characters with the Hangul automata.

Packed code

Packed code (also called Combination code) is a systematic method for coding Hangul syllabic characters in a two-byte code. Each 16-bit (two-byte) character contains a high or most-significant bit (1) and three 5-bit fields. These fields contain the codes for the beginning consonant (x), a middle vowel (y), and an optional ending consonant (z), as follows: 1xxxxxyyyyyzzzzz. Hanja characters cannot be represented in Packed code, because many Hanja characters may be represented by one phonetic pronunciation. Packed code is defined in KS C 5601-1987 and earlier as a supplementary code set.

POSIX

Portable Operating System for Computer Environments. An IEEE standards group comprising seven committees that create documents for standardizing and internationalizing UNIX. POSIX document 1003.1 deals with the kernel and system calls. 1003.2 concerns the C-shell and standard libraries. The other five deal with real-time computing, communications and networking, and other issues.

UTF-8

Universal Multiple Octet Coded Character Set (UCS) Transmission Format. ko-UTF-8 provides the Korean-related characters in this standard. UTF-8 is a representation of Unicode.

Unicode

The international character set and encoding developed by the Unicode Consortium.

Wide Character Code (WC)

A constant-width four-byte code, called WC in Asian Solaris documentation, for the internal representation of EUC codes using the new ANSI-C data type wchar_t. Although EUC does not specify limits on the size of the supplementary code sets (code set 0 is always one byte), WC specifies a character as four bytes. Standardizing on four bytes takes up more memory space than necessary if the environment is primarily ASCII, but it also speeds processing time for strings of mixed characters; the 1000th character always begins at byte 4000 (and the 0th character starts at byte 0). This is useful for any type of indexing in applications.

X/Open

X/Open started as a consortium of international UNIX vendors from Europe, USA, and Asia. It is now one of the major standards organizations like POSIX and ANSI; source of the X/Open System Interface Portability Guide.