The following code sets are based on definitions set by the International Organization for Standardization (ISO).
ISO646-IRV
ISO8859-1
ISO8859-x
eucJP
eucTW
eucKR
The ISO646-IRV code set defines the code set used for information processing based on a 7-bit encoding. The character set associated with this code set is derived from the ASCII characters.
ISO8859-1 encoding is a single-byte encoding that is based on and is compatible with other ISO, American National Standards Institute (ANSI), and European Computer Manufacturer's Association (ECMA) code extension techniques. The ISO8859 encoding defines a family of code sets with each member containing its own unique character sets. The 7-bit ASCII code set is a proper subset of each of the code sets in the ISO8859 family.
The ISO8859-1 code set is called the ISO Latin-1 code set and consists of two character sets:
ISO646-IRV Graphic Left, 7-bit ASCII character set
ISO8859-1 Graphic Right (Latin) character set
These character sets combined include the characters necessary for Western European languages such as Danish, Dutch, English, Finnish, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish.
While the ASCII code set defines an order for the English alphabet, the Graphic Right (GR) characters are not ordered according to any specific language. The language-specific ordering is defined by the locale.
This section lists the other significant ISO8859 code sets. Each code set includes the ASCII character set plus its own unique characters.
Latin alphabet, No. 2, Eastern Europe
Albanian
Czechoslovakian
English
German
Hungarian
Polish
Rumanian
Serbo-Croatian
Slovak
Slovene
Latin/Cyrillic alphabet
Bulgarian
Byelorussian
English
Macedonian
Russian
Ukrainian
Latin/Arabic alphabet
English
Arabic
Latin/Greek alphabet
English
Greek
Latin/Hebrew alphabet
English
Hebrew
Latin/Turkish alphabet
Danish
Dutch
English
Finnish
French
German
Irish
Italian
Norwegian
Portuguese
Spanish
Swedish
Turkish
The EUC for Japanese consists of single-byte and multibyte characters (2 and 3 bytes). The encoding conforms to ISO2022 and is based on JIS and EUC definitions, see .Table 3-2 .
Table 3-2 Encoding for eucJP
CS |
Encoding |
|
Character Set |
cs0 |
0xxxxxxx |
|
ASCII |
cs1 |
1xxxxxxx |
1xxxxxxx |
JIS X0208-1990 |
cs2 |
0x8E |
1xxxxxxx |
JIS X0201-1976 |
cs3 |
0x8F |
1xxxxxxx 1xxxxxxx |
JIS X0212-1990 |
A code of the Japanese graphic character set for information interchange (1990 version) that contains 147 special characters, 10 numeric digits, 83 Hiragana characters, 86 Katakana characters, 52 Latin characters, 48 Greek characters, 66 Cyrillic characters, 32 line-drawing elements, and 6355 Kanji characters.
A code for information interchange that contains 63 Katakana characters.
A code of the supplementary Japanese graphic character set for information interchange (1990 version) that contains 21 additional special characters, 21 additional Greek characters, 26 additional Cyrillic characters, 27 additional Latin characters, 171 Latin characters with diacritical marks, and 5801 additional Kanji characters.
The EUC for Traditional Chinese is an encoding consisting of characters that contain single-byte and multibyte (2 and 4 bytes) characters. The EUC encoding conforms to ISO2022 and is based on the Chinese National Standard (CNS) as defined by the Republic of China and the EUC definition, see Table 3-3 .
Table 3-3 Encoding for eucTW
CS |
Encoding |
|
|
Character Set |
cs0 |
0xxxxxxx |
|
|
ASCII |
cs1 |
1xxxxxxx |
1xxxxxxx |
|
CNS 11643.1992 - plane 1 |
cs2 |
0x8EA2 |
1xxxxxxx |
1xxxxxxx |
CNS 11643.1992 - plane 2 |
cs3 |
0x8EA3 |
1xxxxxxx |
1xxxxxxx |
CNS 11643.1992 - plane 3 |
|
0x8EB0 |
1xxxxxxx |
1xxxxxxx |
CNS 11643.1992 - Plane 16 |
CNS 11643-1992 defines 16 planes for the Chinese Standard Interchange Code, each plane can support up to 8836 characters (94x94). Currently, only planes 1 through 7 have characters assigned. Table 3-4 shows the 16 planes of the CNS 11643-1992 standard.
Table 3-4 16 Planes of the CNS 11643-1992 Standard
Plane |
Definition |
# of Character |
EUC Encoding |
---|---|---|---|
1 |
Most frequently used |
6085 |
A1A1-FDCB |
2 |
Secondary frequently |
7650 |
8EA2 A1A1 - 8EA2 F2C4 |
3 |
Exec.Yuen EDP 1 center |
6148 |
8EA3 A1A1 - 8EA3 E2C6 |
4 |
RIS2, Vendor defined |
7298 |
8EA4 A1A1 - 8EA4 EEDC |
5 |
Rarely used by MOE3 |
8603 |
8EA5 A1A1 - 8EA5 FCD1 |
6 |
Variation char set 1 by MOE |
6388 |
8EA6 A1A1 - 8EA6 E4FA |
7 |
Variation char set 2 by MOE |
6539 |
8EA7 A1A1 - 8EA7 E6D5 |
8 |
Undefined |
0 |
8EA8 A1A1 - 8EA8 FEFE |
9 |
Undefined |
0 |
8EA9 A1A1 - 8EA9 FEFE |
10 |
Undefined |
0 |
8EAA A1A1 - 8EAA FEFE |
11 |
Undefined |
0 |
8EAB A1A1 - 8EAB FEFE |
12 |
User Defined Character (UDC) |
0 |
8EAC A1A1 - 8EAC FEFE |
13 |
UDC |
0 |
8EAD A1A1 - 9EAD FEFE |
14 |
UDC |
0 |
8EAE A1A1 - 8EAE FEFE |
15 |
UDC |
0 |
8EAF A1A1 - 8EAF FEFE |
16 |
UDC |
0 |
8EB0 A1A1 - 8EB0 FEFE |
1. EDP: Center of Directorate, General of Budget, Accounting, and Statistics
2. RIS: Residence Information System
3. MOE: Ministry of Education
The EUC for Korean is an encoding consisting of single-byte and multibyte characters (shown in Table 3-5 ). The encoding conforms to ISO2022 and is based on Korean Standard Code (KSC) set and EUC definitions.
Table 3-5 Encoding for eucKR.
CS |
Encoding |
|
Character Set |
---|---|---|---|
cs0 |
0xxxxxxx |
|
ASCII |
cs1 |
1xxxxxxx |
1xxxxxxx |
KS C 5601-1992 |
cs2 |
|
|
Not used |
cs3 |
|
|
Not used |
KSC 5601-1992 (code of the Korean character set for information interchange, 1992 version) contains 432 special characters, 30 Arabic and Roman numeral characters, 94 Hangul alphabet characters, 52 Roman characters, 48 Greek characters, 27 Latin characters, 169 Japanese characters, 66 Russian characters, 68 line-drawing elements, 2344 precomposed Hangul characters, and 4888 Hanja characters.
One Hangul character can be comprised of several consonants and vowels. Most Hangul words can be expressed in Hanja words. Hanja is a set of Traditional Chinese characters, which is currently used by Korean people. Each Hanja character has its own meaning and is thus more specific than Hangul most of the time.