Understanding Character Sets

This section discusses character sets.

Before you install your PeopleSoft system, Oracle recommends that you choose an appropriate character set for PeopleSoft client workstations, web servers, application servers, and database servers, as well as for file attachment storage locations (that is, FTP sites, HTTP repositories, and database tables).

A character set, also known as a code page, is an ordered set of characters in which each character is mapped to a numeric index, called a codepoint. This codepoint stores character data in a computer system. Many hundreds of character sets exist. Some are international standards, maintained by the International Organization for Standardization (ISO), some are country-specific standards, and others are specific to a particular computer system vendor. Given the number of separate computers that are involved in a typical PeopleSoft installation, it is likely that your system uses several different character sets.

Although there is general agreement on the content and arrangement of most character sets, especially those that are maintained by the ISO, many different names are used by vendors and software packages for similar or identical character sets. US-ASCII encodes the basic characters and symbols that are needed to write the English language. However, US-ASCII is limited to 127 characters and cannot represent many characters that are needed by Western European languages, such as French and German, let alone ideographic languages, such as Japanese and Chinese, in which each character represents a word or concept. Many character sets, however, include all US-ASCII characters in addition to their other characters.

The following table illustrates just a few common character sets that you are likely to encounter and some of the names that are used by different vendors to refer to them:

Character Set	Description and Comments	Type	PeopleSoft and SQR Name	Oracle DBMS Name	Microsoft Windows Name
ISO 8859-1	Western European Latin-1. Contains characters that are required to represent Western European languages. However, does not include the euro symbol, the trademark (TM) symbol, or the oe ligature.	ISO	LATIN1 or ISO_8859-1	WE8ISO8859P1	CP28591
Microsoft Code Page 1252	Microsoft Code Page 1252 - Western European. Very similar to ISO 8859-1, except for the inclusion of additional characters. Includes the euro symbol, trademark (TM) symbol, and oe ligature, but using a different codepoint than ISO 8859-15.	Vendor (Microsoft)	CP1252	WE8MSWIN1252	CP1252
ISO 8859-2	Central/Eastern European Latin-2. Contains characters that are required for Central European languages, including Czech, Hungarian, and Polish. Does not include the euro symbol.	ISO	LATIN2 or ISO_8859-2	EE8ISO8859P2	CP28592
ISO 8859-15	Western European extended Latin-9. Similar to ISO 8859-1, but contains the euro symbol, oe ligature, and several characters that are required for Icelandic.	ISO	LATIN9 or ISO_8859-15	WE8ISO8859P15	CP28605
Shift-JIS	Most common Japanese character set. Defines thousands of characters for writing Japanese.	Country (Japan)	SJIS	JA16SJIS or JA16SJISTILDE	CP932
IBM CCSID 37	IBM Coded Character Set ID 37. Western European Multilingual EBCDIC-based character set.	Vendor (IBM)	EBCDIC	WE8EBCDIC37	CP1140
GB18030	Chinese national character set	Country (China)	GB18030	GB18030	GB18030

Some of these character sets, such as ISO 8859-1 and IBM CCSID 37, require only one byte to represent each character. For example, in ISO 8859-1, the hexadecimal number 61 represents the lowercase Latin letter a. However, larger character sets, such as Shift-JIS, may require more than one byte to represent each character.

The most important consideration when dealing with character sets across a system is ensuring that all characters that you plan to represent within the PeopleSoft system exist in the character set that is used by each component of the system.

For example, if you plan to maintain Japanese characters in employee names, you must ensure that:

The character set that is used by the database system includes Japanese characters.
Each external system feeding into or out of the PeopleSoft system expects data in a character set that includes Japanese characters.
Workstations and printers are installed with fonts that include those characters.

For example, the Japanese Shift-JIS character set contains Japanese and many US-ASCII characters; it is sufficient for encoding both English data and the primary characters that are required in Japanese. However, it does not include the accented Latin characters that are needed for French, German, and other languages, so it is not a suitable character set for implementations that encompass Western European countries.

Given the sample list of common character sets in the previous table and the number of languages that are required by a typical global PeopleSoft implementation, selecting a character set can be daunting, especially when you are planning to support a large list of languages.

To simplify this situation, an industry consortium of vendors devised a universal character standard: the Unicode standard. This internationally recognized character standard represents every character that is required to write virtually every written language. The Unicode standard was developed and is maintained by the Unicode Consortium in conjunction with ISO. This standard shares the character repertoire with ISO/IEC standard 10646: the Universal Multiple-Octet Coded Character Set (UCS), also known as the Universal Character Set for short.

The PeopleSoft system uses Unicode throughout PeopleTools to simplify character handling. The PeopleSoft system allows the use of Unicode within PeopleSoft databases to enable you to maintain a single database with characters from virtually any language.

The Unicode standard and the ISO 10646 standard are available from their respective organizations.

See The Unicode Consortium: www.unicode.org.

See International Organization for Standardization (ISO): www.iso.org.

Unicode Encodings

Unicode defines a code space of more than one million code points (or characters). Unicode code points are referred to by writing “U+” followed by the hexadecimal number—for example, U+0000, U+0061, U+FFFF, U+27741, and so on. To manage such a large repertoire of characters, Unicode defines multiple planes each comprising 65,533 code points, or character positions. Plane 0, covering the range of U+0000 to U+FFFF, is known as the Basic Multilingual Plane (BMP) and is generally sufficient for almost all modern languages. The other planes are intended to encode extended ideographic characters, archaic scripts, and other rarely used characters (such as advanced mathematical symbols). All characters from planes outside the BMP are known as supplementary characters.

PeopleTools fully supports the use of characters from the BMP only; supplementary character support is limited to display in the browser, storage in the database, and reporting output in BI Publisher. A tool that can be used to view Unicode character properties is http://www.unicode.org/charts/unihan.html. If a character is at codepoint U+20000 or higher, it is a supplementary character.

Several different Unicode encoding forms have been standardized based on two encoding methodologies. Unicode defines these two encoding methods: the Unicode Transformation Format (UTF), and UCS. An encoding maps the Unicode code points (or perhaps a subset of code points) to sequences of code values of some fixed size. In UTF encodings, the number in the encoding name indicates the number of bits per code value; in UCS encodings, the number indicates the number of bytes per code value.

Four common encodings of Unicode are widely used:

UTF-32 — a 32-bit, fixed-width encoding; equivalent to UCS-4.
UTF-16 — a 16-bit, variable-width encoding.
UTF-8 — an 8-bit, variable-width encoding, which maximizes compatibility with ASCII.
UCS-2 — a 2-byte, fixed-width encoding; a subset of UTF-16 supporting characters in the BMP only.

Other Unicode encodings—such as, CESU-8, Java’s Modified UTF-8, UTF-1, and others—have specific, and sometimes internal, applications and are not widely used for the interchange of information.

While the PeopleSoft system currently supports only the UTF-8 and UCS-2 encodings, the following table presents a brief comparison of all four common encodings:

Encoding	Description	Min. Bytes per Char.	Max. Bytes per Char.	PeopleSoft System Usage
UTF-32	The full, 32-bit (four-byte) encoding of Unicode. Each Unicode character is represented by a four-byte number. For example, the Latin small letter a character is represented in UTF-32 hexadecimal as 0x00000061. UTF-32 was formerly called UCS-4.	4	4	None
UTF-16	An extension of UCS-2 in which the application references characters on planes other than the BMP by combining two UCS-2 code units to designate a single, non-BMP character. UCS-2 is upward compatible with UTF-16 in that each UCS-2 character is also a valid character in UTF-16. However, UTF-16 allows characters outside the BMP to be referenced. These additional characters, known as supplementary characters, require two UTF-16 code units: a low surrogate and a high surrogate, together called a surrogate pair. When no supplementary characters are present, UTF-16 is identical to UCS-2.	2	4	None
UTF-8	A transformation of Unicode that encodes each character as one to four bytes, depending on which character is being encoded. All US-ASCII characters are encoded in UTF-8 as one byte, and this byte is identical to the encoding in US-ASCII. UTF-8 data is therefore backward compatible with US-ASCII data. All characters in the BMP are encoded as one, two, or three bytes in UTF-8. Characters in other planes are encoded as four bytes in UTF-8. UTF-8 has three main advantages: it is fully US-ASCII compatible, US-ASCII data can be considered as UTF-8 data, and it does not require that all characters use two or more bytes of storage.	1	4	The PeopleSoft system uses UTF-8 for serving HTML pages in the PeopleSoft Pure Internet Architecture and for inbound and outbound XML.
UCS-2	A 2-byte (16-bit) representation of each Unicode character. As such, it can reference only 65,535 code points and is limited to characters in the BMP.	2	2	The PeopleSoft system uses UCS-2 in memory for the Microsoft Windows development tools and for the application server.

Unicode Encoding Examples

This section includes Unicode encoding examples for the following characters:

Character	Unicode Code Point	Description
a	U+0061	Latin small letter a.
ñ	U+00F1	Latin small letter ñ.
€	U+20AC	Euro symbol

The following table shows the hexadecimal representation of these characters in each of the four Unicode encodings:

Unicode Encoding	Latin Small Letter a(U+0061)	Latin Small Letter ñ(U+00F1)	€ Symbol(U+20AC)
UTF-32	0x0000006	0x000000F1	0x000020AC
UTF-16	0x0061	0x00F1	0x20AC
UTF-8	0x61	0xC3B1	0xE282AC
UCS-2	0x0061	0x00F1	0x20AC

Although much of the PeopleSoft system runs by using Unicode, you can configure several components with a non-Unicode character set. When making these choices, you should understand the types of character sets other than Unicode that exist.

This section discusses:

Single-byte character sets (SBCSs).
Double-byte character sets (DBCSs).

Note: For the sake of terminology, some systems, such as Microsoft Windows, refer to two types of character sets: Unicode and ANSI. ANSI, in this context, refers to the American National Standards Institute, which maintains equivalent standards for many national and international standard character sets. Informally, ANSI character sets refer to non-Unicode character sets, which can be any international, national, or vendor standard character set, such as those that are discussed at the beginning of this topic.

Single-Byte Character Sets

Most character sets use one byte to represent each character and are therefore known as SBCSs. These character sets are relatively simple and can represent up to 255 unique characters. Examples of SBCSs are ISO 8859-1 (Latin1), ISO 8859-2 (Latin2), Microsoft CP1252 (similar to Latin1, but vendor specific), and IBM CCSID 37.

Double-Byte Character Sets

DBCSs use one or two bytes to represent each character and are typically used for writing ideographic scripts, such as Japanese, Chinese, and Korean. Most DBCSs allow a mix of one-byte and two-byte characters, so you cannot assume an even-string byte length. Encoding with a mix of one- and two-byte characters is also known as variable-width encoding, and such a character set is sometimes referred to as a multi-byte character set (MBCS).

The PeopleSoft system supports two types of DBCSs:

Nonshifting
Shifting

The difference between these types of DBCSs is in the way in which the system determines whether a particular byte represents one character or is part of a two-byte character.

Field or Control	Definition
Nonshifting DBCSs	Nonshifting DBCSs use ranges of codepoints, specified by the character set definition, to determine whether a particular byte represents one character or is part of a two-byte character. In nonshifting DBCSs, the two bytes that are used to form a character are called lead bytes and trail bytes. The lead byte is the first in a two-byte character, and the trail byte is the last. Nonshifting DBCSs differentiate single-byte characters from double-byte characters by the numerical value of the lead byte. For example, in the Japanese Shift-JIS encoding, if a byte is in the range 0x81-0x9F or 0xE0-0xFC, then it is a lead byte and must be paired with the following byte to form a complete character. The most popular client-side Japanese code page, Shift-JIS, uses this lead byte/trail byte encoding scheme, as do most Microsoft Windows and Unix/Linux ASCII-based double-byte character sets that represent Chinese, Japanese, and Korean characters. Contrary to its name, Shift-JIS is a nonshifting double-byte character set.
Shifting DBCSs	A shifting DBCS is another double-byte encoding scheme in use that doesn’t use the lead byte and trail byte concept. The IBM DB2 UDB for OS/390 and z/OS EBCDIC-based Japanese, Chinese, and Korean character sets use this shifting encoding scheme. Instead of reserving a range of bytes as lead bytes, shifting DBCSs always keep track of whether they are in double-byte or single-byte mode. In double-byte mode, every two bytes form a character. In single-byte mode, every byte is a character in itself. To track what mode the character set is in, the system uses shifting characters. By default, the character set is expected as single-byte data. As soon as a double-byte character needs to be represented, a shift-in byte is added to the string. From this point on, all characters are expected to be two bytes. This continues until a shift-out byte is detected, which indicates that the character set should go back to single-byte per characters. This scheme, while more complex than the lead byte and trail byte scheme, provides greater performance, because the system always knows how many bytes should be in any particular character. Unfortunately, it also increases the length of the string. For example, a character string that comprises a mixture of single-byte and double-byte characters could require more space to store in a shifting character set because you need to include the shift-in and shift-out bytes, as well as the data itself. Important! In the PeopleSoft system, shifting DBCSs have limited usage, such as for file I/O, and are not supported for use as a database character set.

Field or Control

Definition

Nonshifting DBCSs

Nonshifting DBCSs use ranges of codepoints, specified by the character set definition, to determine whether a particular byte represents one character or is part of a two-byte character.

In nonshifting DBCSs, the two bytes that are used to form a character are called lead bytes and trail bytes. The lead byte is the first in a two-byte character, and the trail byte is the last. Nonshifting DBCSs differentiate single-byte characters from double-byte characters by the numerical value of the lead byte. For example, in the Japanese Shift-JIS encoding, if a byte is in the range 0x81-0x9F or 0xE0-0xFC, then it is a lead byte and must be paired with the following byte to form a complete character.

The most popular client-side Japanese code page, Shift-JIS, uses this lead byte/trail byte encoding scheme, as do most Microsoft Windows and Unix/Linux ASCII-based double-byte character sets that represent Chinese, Japanese, and Korean characters. Contrary to its name, Shift-JIS is a nonshifting double-byte character set.

Shifting DBCSs

A shifting DBCS is another double-byte encoding scheme in use that doesn’t use the lead byte and trail byte concept. The IBM DB2 UDB for OS/390 and z/OS EBCDIC-based Japanese, Chinese, and Korean character sets use this shifting encoding scheme.

Instead of reserving a range of bytes as lead bytes, shifting DBCSs always keep track of whether they are in double-byte or single-byte mode. In double-byte mode, every two bytes form a character. In single-byte mode, every byte is a character in itself. To track what mode the character set is in, the system uses shifting characters. By default, the character set is expected as single-byte data. As soon as a double-byte character needs to be represented, a shift-in byte is added to the string. From this point on, all characters are expected to be two bytes. This continues until a shift-out byte is detected, which indicates that the character set should go back to single-byte per characters.

This scheme, while more complex than the lead byte and trail byte scheme, provides greater performance, because the system always knows how many bytes should be in any particular character. Unfortunately, it also increases the length of the string. For example, a character string that comprises a mixture of single-byte and double-byte characters could require more space to store in a shifting character set because you need to include the shift-in and shift-out bytes, as well as the data itself.

Important! In the PeopleSoft system, shifting DBCSs have limited usage, such as for file I/O, and are not supported for use as a database character set.

PeopleSoft installations include multiple components, each of which must be able to handle differing character sets.

PeopleSoft application servers and clients (for example, the PeopleTools development environment and PeopleSoft Pure Internet Architecture pages) use Unicode exclusively and do not rely on other character sets to represent and process data. However, depending on your environment, not all system components support Unicode-encoded data. Therefore, you might not be able to run all parts of your system in Unicode. For example, some database platforms and third-party products do not support Unicode. The following table illustrates support for Unicode in the PeopleSoft system.

Tier	Component	Unicode Support
Client	PeopleTools development environment PeopleSoft Pure Internet Architecture pages	Yes Yes
Web server	Web server	Yes
Application server	Application server	Yes
Database	Non-Unicode DB (Western European or Japanese) Unicode DB	No Yes
File attachment storage	FTP server HTTP repository	Yes Yes

Examples of how to configure these tiers are provided in this topic.

See Understanding Character Set Selection.

In addition to the tiers listed in the previous table, PeopleTools enables you to configure these system components to use other character sets:

COBOL.
The character set that is used for PeopleSoft COBOL processing must match the character set of the database. If you created a Unicode database for the PeopleSoft implementation, you must also run COBOL in Unicode.
Direct file I/O operations.

All direct file I/O operations in PeopleTools, including file layout objects, trace and log files, and file operations from Structured Query Report (SQR) programs can be performed in Unicode or any supported non-Unicode character set. This is useful in situations in which you must interface with an external system that does not support Unicode.

Some applications refer to the ANSI encoding that inherits the encoding of the OS. For example, applications running on Windows OS inherits the Active Codepage. Application running on Unix OS inherits the encoding portion of the shell locale.
Third-party products that are non-Unicode compliant.
Some third-party products that are supported by PeopleTools do not yet support Unicode. In this case, PeopleTools converts application data to a specific, non-Unicode character set before communicating with these tools. Check the product documentation for your third-party application regarding Unicode compliance before determining how the application and the PeopleSoft system will interoperate.

When Unicode is not used for any of these types of operations or data storage, the PeopleSoft system transparently handles the conversion from Unicode to a non-Unicode character set. The non-Unicode character set that is used depends on several settings, which are discussed in detail later in this topic.

PSCHARSETS Table

The character sets that the PeopleSoft system supports are defined in the PSCHARSETS table. The following table lists these character sets and the names by which they may be referred to in PeopleSoft applications. You may need to know the correct character set name to use in several situations including:

In PeopleCode programs for manipulating file layout objects.
In the Unix/Linux application server configuration to determine the default, non-Unicode character set for log files, trace files, and operating system interfaces.
When creating your database.

Refer to your hardware and software requirements guide for details about the character sets that are supported for your database platform.

PSCHARSETS Character Set Name	Description and Comments	Character Set Type
ANSI	Current ANSI-based code page. Not really a character set, but causes the system to use the default non-Unicode character set of the host operating system.	SBCS or DBCS, depending on the host operating system.
ASCII	7–bit US-ASCII	SBCS
Big5	Big5 (Traditional Chinese)	Nonshifting DBCS
CCSID1027	IBM EBCDIC 1027 (Japanese-Latin)	SBCS
CCSID1047	IBM EBCDIC 1047 (Latin1)	SBCS
CCSID290¹	IBM EBCDIC 290 (Katakana)	SBCS
CCSID300¹	IBM EBCDIC 300 (Kanji)	Nonshifting DBCS
CCSID930²	IBM EBCDIC 930 (Kana-Kanji)	Shifting DBCS
CCSID935²	IBM EBCDIC 935 (Simplified Chinese)	Shifting DBCS
CCSID937²	IBM EBCDIC 937 (Traditional Chinese)	Shifting DBCS
CCSID939²	IBM EBCDIC 939 (Latin-Kanji)	Shifting DBCS
CCSID942	IBM EBCDIC 942 (Japanese PC)	Nonshifting DBCS
CP1026	Windows 1026 (EBCDIC)	SBCS
CP1250	Windows 1250 (Eastern Europe)	SBCS
CP1251	Windows 1251 (Cyrillic)	SBCS
CP1252	Windows 1252 (Western Europe)	SBCS
CP1253	Windows 1253 (Greek)	SBCS
CP1254	Windows 1254 (Turkish)	SBCS
CP1255	Windows 1255 (Hebrew)	SBCS
CP1256	Windows 1256 (Arabic)	SBCS
CP1257	Windows 1257 (Baltic)	SBCS
CP1258	Windows 1258 (Vietnamese)	SBCS
CP1361	Windows 1361 (Korean Johab)	SBCS
CP437	MS-DOS 437 (U.S.)	SBCS
CP500	Windows 500 (EBCDIC 500V1)	SBCS
CP708	Windows 708 (Arabic - ASMO708)	SBCS
CP720	Windows 720 (Arabic - ASMO)	SBCS
CP737	Windows 737 (Greek - 437G)	SBCS
CP775	Windows 775 (Baltic)	SBCS
CP850	MS-DOS 850 (Western Europe)	SBCS
CP852	MS-DOS 852 (Eastern Europe)	SBCS
CP855	MS-DOS 855 (IBM Cyrillic)	SBCS
CP857	MS-DOS 857 (IBM Turkish)	SBCS
CP860	MS-DOS 860 (IBM Portuguese)	SBCS
CP861	MS-DOS 861 (Icelandic)	SBCS
CP862	MS-DOS 862 (Hebrew)	SBCS
CP863	MS-DOS 863 (Canadian French)	SBCS
CP864	MS-DOS 864 (Arabic)	SBCS
CP865	MS-DOS 865 (Nordic)	SBCS
CP866	MS-DOS 866 (Russian)	SBCS
CP869	MS-DOS 869 (Modern Greek)	SBCS
CP870	Windows 870	SBCS
CP874	Windows 874 (Thai)	SBCS
CP875	Windows 875 (EBCDIC)	SBCS
CP932	Windows 932 (Japanese)	Nonshifting DBCS
CP936	Windows 936 (Simplified Chinese)	Nonshifting DBCS
CP949	Windows 949 (Korean)	Nonshifting DBCS
CP950	Windows 950 (Traditional Chinese)	Nonshifting DBCS
EBCDIC	IBM EBCDIC CCSID37 (USA)	SBCS
EUC-JP	Extended UNIX code (Japanese)	Nonshifting DBCS
EUC-KR	Extended UNIX code (Korean)	Nonshifting DBCS
EUC-TW	Extended UNIX code (Taiwan)	Nonshifting DBCS
EUC-TW-1986	Extended UNIX code (TW-1986)	Nonshifting DBCS
GB12345	GB 2312 (Simplified Chinese)	Nonshifting DBCS
GB18030	GB18030 (Simplified Chinese)	Nonshifting DBCS
GB2312	GB 2312 (Simplified Chinese)	Nonshifting DBCS
HKSCS	Hong Kong Supplementary Character Set	Nonshifting DBCS
ISO-2022-JP^{2, 3}	ISO-2022-JP Japanese	Shifting DBCS
ISO-2022-KR²	ISO-2022-JP Korean	Shifting DBCS
ISO_8859-1	ISO 8859-1 (Latin1)	SBCS
ISO_8859-10	ISO 8859-10 (Latin6)	SBCS
ISO_8859-11	ISO 8859-11 (Thai)	SBCS
ISO_8859-14	ISO 8859-14 (Latin8)	SBCS
ISO_8859-15	ISO 8859-15 (Latin9/Latin0)	SBCS
ISO_8859-2	ISO 8859-2 (Latin2)	SBCS
ISO_8859-3	ISO 8859-3 (Latin3)	SBCS
ISO_8859-4	ISO 8859-4 (Latin4)	SBCS
ISO_8859-5	ISO 8859-5 (Cyrillic)	SBCS
ISO_8859-6	ISO 8859-6 (Arabic)	SBCS
ISO_8859-7	ISO 8859-7 (Greek)	SBCS
ISO_8859-8	ISO 8859-8 (Hebrew)	SBCS
ISO_8859-9	ISO 8859-9 (Latin5)	SBCS
JIS_X0201¹	Japanese Half-width Katakana	Nonshifting DBCS
JIS_X_0208	Japanese Kanji	Nonshifting DBCS
Java	Java (Unicode encoding)	Unicode
Johab	Johab (Korean)	Nonshifting DBCS
Shift_JIS	Shift-JIS (Japanese)	Nonshifting DBCS
UCS2	Unicode UCS-2	Unicode
UTF-8	Unicode UTF-8	Unicode
UTF7¹	Unicode UTF-7. (An outdated Unicode 7-bit clean transformation that is sometimes used for email that must pass through gateways that do not support 8-bit characters.)	Unicode
UTF8	Unicode UTF-8	Unicode
UTF8BOM	Unicode UTF-8 with BOM (byte-order mark)	Unicode

¹ Not commonly used.

² In the PeopleSoft system, shifting DBCSs have limited usage, such as for file I/O, and are not supported for use as a database character set.

³ To use certain Windows-31J (also known as Microsoft CP932) characters in incoming or outgoing email messages, you must complete additional configuration of your web server (incoming email) and application server or PeopleSoft Process Scheduler (outgoing email).

See Selecting Email Character Sets.

This PeopleBook also contains information about supported character set encodings for globalization when using SQR for PeopleSoft.

See SQR for PeopleSoft-Supported Character Set Encodings.

See, Certifications tab on My Oracle Support.

Understanding Character Sets

Character Sets

Common Character Sets

The Unicode Standard

Unicode Encodings

Unicode Encoding Examples

Non-Unicode Character Sets

Single-Byte Character Sets

Double-Byte Character Sets

Character Sets Across the Tiers of the PeopleSoft Architecture

PSCHARSETS Table