Selecting and Configuring Character Sets

This chapter provides an overview of character sets and discusses how to:

Click to jump to parent topicUnderstanding Character Sets

This section discusses:

Click to jump to top of pageClick to jump to parent topicCharacter Sets

Before you install your PeopleSoft system, Oracle recommends that you choose an appropriate character set for PeopleSoft client workstations, web servers, application servers, and database servers, as well as for file attachment storage locations (that is, FTP sites, HTTP repositories, and database tables).

A character set, also known as a code page, is an ordered set of characters in which each character is mapped to a numeric index, called a codepoint. This codepoint stores character data in a computer system. Many hundreds of character sets exist. Some are international standards, maintained by the International Organization for Standardization (ISO), some are country-specific standards, and others are specific to a particular computer system vendor. Given the number of separate computers that are involved in a typical PeopleSoft installation, it is likely that your system uses several different character sets.

Click to jump to top of pageClick to jump to parent topicCommon Character Sets

Although there is general agreement on the content and arrangement of most character sets, especially those that are maintained by the ISO, many different names are used by vendors and software packages for similar or identical character sets. US-ASCII encodes the basic characters and symbols that are needed to write the English language. However, US-ASCII is limited to 127 characters and cannot represent many characters that are needed by Western European languages, such as French and German, let alone ideographic languages, such as Japanese and Chinese, in which each character represents a word or concept. Many character sets, however, include all US-ASCII characters in addition to their other characters.

The following table illustrates just a few common character sets that you are likely to encounter and some of the names that are used by different vendors to refer to them:

Character Set

Description and Comments

Type

PeopleSoft and SQR Name

Oracle DBMS Name

Microsoft Windows Name

ISO 8859-1

Western European Latin-1. Contains characters that are required to represent Western European languages. However, does not include the euro symbol, the trademark (TM) symbol, or the oe ligature.

ISO

LATIN1 or ISO_8859-1

WE8ISO8859P1

CP28591

Microsoft Code Page 1252

Microsoft Code Page 1252 - Western European. Very similar to ISO 8859-1, except for the inclusion of additional characters. Includes the euro symbol, trademark (TM) symbol, and oe ligature, but using a different codepoint than ISO 8859-15.

Vendor (Microsoft)

CP1252

WE8MSWIN1252

CP1252

ISO 8859-2

Central/Eastern European Latin-2. Contains characters that are required for Central European languages, including Czech, Hungarian, and Polish. Does not include the euro symbol.

ISO

LATIN2 or ISO_8859-2

EE8ISO8859P2

CP28592

ISO 8859-15

Western European extended Latin-9. Similar to ISO 8859-1, but contains the euro symbol, oe ligature, and several characters that are required for Icelandic.

ISO

LATIN9 or ISO_8859-15

WE8ISO8859P15

CP28605

Shift-JIS

Most common Japanese character set. Defines thousands of characters for writing Japanese.

Country (Japan)

SJIS

JA16SJIS or JA16SJISTILDE

CP932

IBM CCSID 37

IBM Coded Character Set ID 37. Western European Multilingual EBCDIC-based character set.

Vendor (IBM)

EBCDIC

WE8EBCDIC37

CP1140

GB18030

Chinese national character set

Country (China)

GB18030

GB18030

GB18030

Some of these character sets, such as ISO 8859-1 and IBM CCSID 37, require only one byte to represent each character. For example, in ISO 8859-1, the hexadecimal number 61 represents the lowercase Latin letter a. However, larger character sets, such as Shift-JIS, may require more than one byte to represent each character.

Click to jump to top of pageClick to jump to parent topicThe Unicode Standard

The most important consideration when dealing with character sets across a system is ensuring that all characters that you plan to represent within the PeopleSoft system exist in the character set that is used by each component of the system.

For example, if you plan to maintain Japanese characters in employee names, you must ensure that:

For example, the Japanese Shift-JIS character set contains Japanese and many US-ASCII characters; it is sufficient for encoding both English data and the primary characters that are required in Japanese. However, it does not include the accented Latin characters that are needed for French, German, and other languages, so it is not a suitable character set for implementations that encompass Western European countries.

Given the sample list of common character sets in the previous table and the number of languages that are required by a typical global PeopleSoft implementation, selecting a character set can be daunting, especially when you are planning to support a large list of languages.

To simplify this situation, an industry consortium of vendors devised a universal character standard: the Unicode standard. This internationally recognized character standard represents every character that is required to write virtually every written language. The Unicode standard was developed and is maintained by the Unicode Consortium in conjunction with ISO. This standard shares the character repertoire with ISO/IEC standard 10646: the Universal Multiple-Octet Coded Character Set (UCS), also known as the Universal Character Set for short.

The PeopleSoft system uses Unicode throughout PeopleTools to simplify character handling. The PeopleSoft system allows the use of Unicode within PeopleSoft databases to enable you to maintain a single database with characters from virtually any language.

The Unicode standard and the ISO 10646 standard are available from their respective organizations.

See The Unicode Consortium: http://www.unicode.org

See International Organization for Standardization (ISO): http://www.iso.org

Unicode Encodings

Unicode defines a code space of more than one million code points (or characters). Unicode code points are referred to by writing “U+” followed by the hexadecimal number—for example, U+0000, U+0061, U+FFFF, U+27741, and so on. To manage such a large repertoire of characters, Unicode defines multiple planes each comprising 65,533 code points, or character positions. Plane 0, covering the range of U+0000 to U+FFFF, is known as the Basic Multilingual Plane (BMP) and is generally sufficient for almost all modern languages. The other planes are intended to encode extended ideographic characters, archaic scripts, and other rarely used characters (such as advanced mathematical symbols). All characters from planes outside the BMP are known as supplementary characters.

PeopleTools fully supports the use of characters from the BMP only; supplementary character support is limited to display in the browser, storage in the database, and reporting output in BI Publisher. A tool that can be used to view Unicode character properties is http://www.unicode.org/charts/unihan.html. If a character is at codepoint U+20000 or higher, it is a supplementary character.

Several different Unicode encoding forms have been standardized based on two encoding methodologies. Unicode defines these two encoding methods: the Unicode Transformation Format (UTF), and UCS. An encoding maps the Unicode code points (or perhaps a subset of code points) to sequences of code values of some fixed size. In UTF encodings, the number in the encoding name indicates the number of bits per code value; in UCS encodings, the number indicates the number of bytes per code value.

Four common encodings of Unicode are widely used:

Other Unicode encodings—such as, CESU-8, Java’s Modified UTF-8, UTF-1, and others—have specific, and sometimes internal, applications and are not widely used for the interchange of information.

While the PeopleSoft system currently supports only the UTF-8 and UCS-2 encodings, the following table presents a brief comparison of all four common encodings:

Encoding

Description

Min. Bytes per Char.

Max. Bytes per Char.

PeopleSoft System Usage

UTF-32

The full, 32-bit (four-byte) encoding of Unicode. Each Unicode character is represented by a four-byte number. For example, the Latin small letter a character is represented in UTF-32 hexadecimal as 0x00000061. UTF-32 was formerly called UCS-4.

4

4

None

UTF-16

An extension of UCS-2 in which the application references characters on planes other than the BMP by combining two UCS-2 code units to designate a single, non-BMP character. UCS-2 is upward compatible with UTF-16 in that each UCS-2 character is also a valid character in UTF-16. However, UTF-16 allows characters outside the BMP to be referenced. These additional characters, known as supplementary characters, require two UTF-16 code units: a low surrogate and a high surrogate, together called a surrogate pair. When no supplementary characters are present, UTF-16 is identical to UCS-2.

2

4

None

UTF-8

A transformation of Unicode that encodes each character as one to four bytes, depending on which character is being encoded. All US-ASCII characters are encoded in UTF-8 as one byte, and this byte is identical to the encoding in US-ASCII. UTF-8 data is therefore backward compatible with US-ASCII data. All characters in the BMP are encoded as one, two, or three bytes in UTF-8. Characters in other planes are encoded as four bytes in UTF-8. UTF-8 has three main advantages: it is fully US-ASCII compatible, US-ASCII data can be considered as UTF-8 data, and it does not require that all characters use two or more bytes of storage.

1

4

The PeopleSoft system uses UTF-8 for serving HTML pages in the PeopleSoft Pure Internet Architecture and for inbound and outbound XML.

UCS-2

A 2-byte (16-bit) representation of each Unicode character. As such, it can reference only 65,535 code points and is limited to characters in the BMP.

2

2

The PeopleSoft system uses UCS-2 in memory for the Microsoft Windows development tools and for the application server.

Unicode Encoding Examples

This section includes Unicode encoding examples for the following characters:

Character

Unicode Code Point

Description

a

U+0061

Latin small letter a.

ñ

U+00F1

Latin small letter ñ.

U+20AC

Euro symbol

The following table shows the hexadecimal representation of these characters in each of the four Unicode encodings:

Unicode Encoding

Latin Small Letter a
(U+0061)

Latin Small Letter ñ
(U+00F1)

Symbol
(U+20AC)

UTF-32

0x0000006

0x000000F1

0x000020AC

UTF-16

0x0061

0x00F1

0x20AC

UTF-8

0x61

0xC3B1

0xE282AC

UCS-2

0x0061

0x00F1

0x20AC

See Also

Selecting Email Character Sets

Click to jump to top of pageClick to jump to parent topicNon-Unicode Character Sets

Although much of the PeopleSoft system runs by using Unicode, you can configure several components with a non-Unicode character set. When making these choices, you should understand the types of character sets other than Unicode that exist.

This section discusses:

Note. For the sake of terminology, some systems, such as Microsoft Windows, refer to two types of character sets: Unicode and ANSI. ANSI, in this context, refers to the American National Standards Institute, which maintains equivalent standards for many national and international standard character sets. Informally, ANSI character sets refer to non-Unicode character sets, which can be any international, national, or vendor standard character set, such as those that are discussed at the beginning of this chapter.

Single-Byte Character Sets

Most character sets use one byte to represent each character and are therefore known as SBCSs. These character sets are relatively simple and can represent up to 255 unique characters. Examples of SBCSs are ISO 8859-1 (Latin1), ISO 8859-2 (Latin2), Microsoft CP1252 (similar to Latin1, but vendor specific), and IBM CCSID 37.

Double-Byte Character Sets

DBCSs use one or two bytes to represent each character and are typically used for writing ideographic scripts, such as Japanese, Chinese, and Korean. Most DBCSs allow a mix of one-byte and two-byte characters, so you cannot assume an even-string byte length. Encoding with a mix of one- and two-byte characters is also known as variable-width encoding, and such a character set is sometimes referred to as a multi-byte character set (MBCS).

The PeopleSoft system supports two types of DBCSs:

The difference between these types of DBCSs is in the way in which the system determines whether a particular byte represents one character or is part of a two-byte character.

Nonshifting DBCSs

Nonshifting DBCSs use ranges of codepoints, specified by the character set definition, to determine whether a particular byte represents one character or is part of a two-byte character.

In nonshifting DBCSs, the two bytes that are used to form a character are called lead bytes and trail bytes. The lead byte is the first in a two-byte character, and the trail byte is the last. Nonshifting DBCSs differentiate single-byte characters from double-byte characters by the numerical value of the lead byte. For example, in the Japanese Shift-JIS encoding, if a byte is in the range 0x81-0x9F or 0xE0-0xFC, then it is a lead byte and must be paired with the following byte to form a complete character.

The most popular client-side Japanese code page, Shift-JIS, uses this lead byte/trail byte encoding scheme, as do most Microsoft Windows and Unix/Linux ASCII-based double-byte character sets that represent Chinese, Japanese, and Korean characters. Contrary to its name, Shift-JIS is a nonshifting double-byte character set.

Shifting DBCSs

A shifting DBCS is another double-byte encoding scheme in use that doesn’t use the lead byte and trail byte concept. The IBM DB2 UDB for OS/390 and z/OS EBCDIC-based Japanese, Chinese, and Korean character sets use this shifting encoding scheme.

Instead of reserving a range of bytes as lead bytes, shifting DBCSs always keep track of whether they are in double-byte or single-byte mode. In double-byte mode, every two bytes form a character. In single-byte mode, every byte is a character in itself. To track what mode the character set is in, the system uses shifting characters. By default, the character set is expected as single-byte data. As soon as a double-byte character needs to be represented, a shift-in byte is added to the string. From this point on, all characters are expected to be two bytes. This continues until a shift-out byte is detected, which indicates that the character set should go back to single-byte per characters.

This scheme, while more complex than the lead byte and trail byte scheme, provides greater performance, because the system always knows how many bytes should be in any particular character. Unfortunately, it also increases the length of the string. For example, a character string that comprises a mixture of single-byte and double-byte characters could require more space to store in a shifting character set because you need to include the shift-in and shift-out bytes, as well as the data itself.

Important! In the PeopleSoft system, shifting DBCSs have limited usage, such as for file I/O, and are not supported for use as a database character set.

Click to jump to top of pageClick to jump to parent topicCharacter Sets Across the Tiers of the PeopleSoft Architecture

PeopleSoft installations include multiple components, each of which must be able to handle differing character sets.

PeopleSoft application servers and clients (for example, the PeopleTools development environment and PeopleSoft Pure Internet Architecture pages) use Unicode exclusively and do not rely on other character sets to represent and process data. However, depending on your environment, not all system components support Unicode-encoded data. Therefore, you might not be able to run all parts of your system in Unicode. For example, some database platforms and third-party products do not support Unicode. The following table illustrates support for Unicode in the PeopleSoft system.

Tier

Component

Unicode Support

Client

PeopleTools development environment

Yes

PeopleSoft Pure Internet Architecture pages

Yes

Web server

Web server

Yes

Application server

Application server

Yes

Database server

Non-Unicode DB (Western European or Japanese)

No

Unicode DB

Yes

File attachment storage location

FTP server

Yes

HTTP repository

Yes

Database table

See the previous entry for the database server tier

Examples of how to configure these tiers are provided in this chapter.

See Understanding Character Set Selection.

In addition to the tiers listed in the previous table, PeopleTools enables you to configure these system components to use other character sets:

When Unicode is not used for any of these types of operations or data storage, the PeopleSoft system transparently handles the conversion from Unicode to a non-Unicode character set. The non-Unicode character set that is used depends on several settings, which are discussed in detail later in this chapter.

PSCHARSETS Table

The character sets that the PeopleSoft system supports are defined in the PSCHARSETS table. The following table lists these character sets and the names by which they may be referred to in PeopleSoft applications. You may need to know the correct character set name to use in several situations including:

PSCHARSETS
Character Set Name

Description and Comments

Character Set Type

ANSI

Current ANSI-based code page.

Not really a character set, but causes the system to use the default non-Unicode character set of the host operating system.

SBCS or DBCS, depending on the host operating system.

ASCII

7–bit US-ASCII

SBCS

Big5

Big5 (Traditional Chinese)

Nonshifting DBCS

CCSID1027

IBM EBCDIC 1027 (Japanese-Latin)

SBCS

CCSID1047

IBM EBCDIC 1047 (Latin1)

SBCS

CCSID2901

IBM EBCDIC 290 (Katakana)

SBCS

CCSID3001

IBM EBCDIC 300 (Kanji)

Nonshifting DBCS

CCSID9302

IBM EBCDIC 930 (Kana-Kanji)

Shifting DBCS

CCSID9352

IBM EBCDIC 935 (Simplified Chinese)

Shifting DBCS

CCSID9372

IBM EBCDIC 937 (Traditional Chinese)

Shifting DBCS

CCSID9392

IBM EBCDIC 939 (Latin-Kanji)

Shifting DBCS

CCSID942

IBM EBCDIC 942 (Japanese PC)

Nonshifting DBCS

CP1026

Windows 1026 (EBCDIC)

SBCS

CP1250

Windows 1250 (Eastern Europe)

SBCS

CP1251

Windows 1251 (Cyrillic)

SBCS

CP1252

Windows 1252 (Western Europe)

SBCS

CP1253

Windows 1253 (Greek)

SBCS

CP1254

Windows 1254 (Turkish)

SBCS

CP1255

Windows 1255 (Hebrew)

SBCS

CP1256

Windows 1256 (Arabic)

SBCS

CP1257

Windows 1257 (Baltic)

SBCS

CP1258

Windows 1258 (Vietnamese)

SBCS

CP1361

Windows 1361 (Korean Johab)

SBCS

CP437

MS-DOS 437 (U.S.)

SBCS

CP500

Windows 500 (EBCDIC 500V1)

SBCS

CP708

Windows 708 (Arabic - ASMO708)

SBCS

CP720

Windows 720 (Arabic - ASMO)

SBCS

CP737

Windows 737 (Greek - 437G)

SBCS

CP775

Windows 775 (Baltic)

SBCS

CP850

MS-DOS 850 (Western Europe)

SBCS

CP852

MS-DOS 852 (Eastern Europe)

SBCS

CP855

MS-DOS 855 (IBM Cyrillic)

SBCS

CP857

MS-DOS 857 (IBM Turkish)

SBCS

CP860

MS-DOS 860 (IBM Portuguese)

SBCS

CP861

MS-DOS 861 (Icelandic)

SBCS

CP862

MS-DOS 862 (Hebrew)

SBCS

CP863

MS-DOS 863 (Canadian French)

SBCS

CP864

MS-DOS 864 (Arabic)

SBCS

CP865

MS-DOS 865 (Nordic)

SBCS

CP866

MS-DOS 866 (Russian)

SBCS

CP869

MS-DOS 869 (Modern Greek)

SBCS

CP870

Windows 870

SBCS

CP874

Windows 874 (Thai)

SBCS

CP875

Windows 875 (EBCDIC)

SBCS

CP932

Windows 932 (Japanese)

Nonshifting DBCS

CP936

Windows 936 (Simplified Chinese)

Nonshifting DBCS

CP949

Windows 949 (Korean)

Nonshifting DBCS

CP950

Windows 950 (Traditional Chinese)

Nonshifting DBCS

EBCDIC

IBM EBCDIC CCSID37 (USA)

SBCS

EUC-JP

Extended UNIX code (Japanese)

Nonshifting DBCS

EUC-KR

Extended UNIX code (Korean)

Nonshifting DBCS

EUC-TW

Extended UNIX code (Taiwan)

Nonshifting DBCS

EUC-TW-1986

Extended UNIX code (TW-1986)

Nonshifting DBCS

GB12345

GB 2312 (Simplified Chinese)

Nonshifting DBCS

GB18030

GB18030 (Simplified Chinese)

Nonshifting DBCS

GB2312

GB 2312 (Simplified Chinese)

Nonshifting DBCS

HKSCS

Hong Kong Supplementary Character Set

Nonshifting DBCS

ISO-2022-JP2, 3

ISO-2022-JP Japanese

Shifting DBCS

ISO-2022-KR2

ISO-2022-JP Korean

Shifting DBCS

ISO_8859-1

ISO 8859-1 (Latin1)

SBCS

ISO_8859-10

ISO 8859-10 (Latin6)

SBCS

ISO_8859-11

ISO 8859-11 (Thai)

SBCS

ISO_8859-14

ISO 8859-14 (Latin8)

SBCS

ISO_8859-15

ISO 8859-15 (Latin9/Latin0)

SBCS

ISO_8859-2

ISO 8859-2 (Latin2)

SBCS

ISO_8859-3

ISO 8859-3 (Latin3)

SBCS

ISO_8859-4

ISO 8859-4 (Latin4)

SBCS

ISO_8859-5

ISO 8859-5 (Cyrillic)

SBCS

ISO_8859-6

ISO 8859-6 (Arabic)

SBCS

ISO_8859-7

ISO 8859-7 (Greek)

SBCS

ISO_8859-8

ISO 8859-8 (Hebrew)

SBCS

ISO_8859-9

ISO 8859-9 (Latin5)

SBCS

JIS_X02011

Japanese Half-width Katakana

Nonshifting DBCS

JIS_X_0208

Japanese Kanji

Nonshifting DBCS

Java

Java (Unicode encoding)

Unicode

Johab

Johab (Korean)

Nonshifting DBCS

Shift_JIS

Shift-JIS (Japanese)

Nonshifting DBCS

UCS2

Unicode UCS-2

Unicode

UTF-8

Unicode UTF-8

Unicode

UTF71

Unicode UTF-7. (An outdated Unicode 7-bit clean transformation that is sometimes used for email that must pass through gateways that do not support 8-bit characters.)

Unicode

UTF8

Unicode UTF-8

Unicode

UTF8BOM

Unicode UTF-8 with BOM (byte-order mark)

Unicode

1 Not commonly used.

2 In the PeopleSoft system, shifting DBCSs have limited usage, such as for file I/O, and are not supported for use as a database character set.

3 To use certain Windows-31J (also known as Microsoft CP932) characters in incoming or outgoing email messages, you must complete additional configuration of your web server (incoming email) and application server or PeopleSoft Process Scheduler (outgoing email).

See Selecting Email Character Sets.

This PeopleBook also contains information about supported character set encodings for globalization when using SQR for PeopleSoft.

See SQR for PeopleSoft-Supported Character Set Encodings.

See Also

PeopleTools 8.52 Hardware and Software Requirements Guide

For more information and code charts for Microsoft code pages, visit http://msdn.microsoft.com/en-us/goglobal/bb964654.aspx

For more information and code charts for Unicode, visit http://www.unicode.org

Click to jump to parent topicSelecting Character Sets

This section provides an overview of selecting character sets and discuses how to:

See Also

Understanding Character Sets

Click to jump to top of pageClick to jump to parent topicUnderstanding Character Set Selection

When configuring your PeopleSoft system, you need to consider the character set (or sets) that will be in use on the following tiers:

Some operations of your PeopleSoft system require the interaction of multiple tiers. For example, the uploading of a file attachment involves the browser on the client, the web server, the application server, the database server, and ultimately the file attachment storage location. To ensure the correct transfer of data and files between these tiers, Oracle recommends configuring each server tier (web server, application server, database server, and file storage location) to use the same character set as follows:

Clients can always be configured to use the native language of the user of that workstation or browser.

The following table depicts example character set settings across all tiers for three typical configurations—a multi-language environment, a single language environment (Western), and a single language environment (non-Western).

Note. This table shows examples for a particular combination of languages and platforms; your specific configuration could differ.

Tier (Platform)

Multi-Language

Single Language
(Western: French)

Single Language
(Non-Western: Japanese)

Where to Check

Client (Windows)

Any (for example, English uses CP1252).

French (uses CP1252).

Japanese (uses CP932).

Start, Settings, Control Panel, Regional Options

Web server (Linux) – Shell processes

en_US.utf8

fr_FR.iso88915

ja_JP.sjis

locale command

Application server (Linux) – PSAPPSRV processes

utf-8

latin15

sjis

psappsrv.cfg
[PSTOOLS]
Character Set

Application server (Linux) – Email processes

utf-8

utf-8

utf-8

psappsrv.cfg
[SMTP Settings]
SMTP Character Set

Application server (Linux) – Shell processes

en_US.utf8

fr_FR.iso88915

ja_JP.sjis

locale command

Database server (Oracle)

AL32UTF8

WE8ISO885915

JS16SJISTILDE

NLS_DATABASE_PARAMETERS

File attachments: FTP site4 (Linux) – Shell processes

en_US.utf8

fr_FR.iso88915

ja_JP.sjis

locale command

4 For file attachments, if the storage location is a database table or an HTTP repository, then the configuration of one of the other server tiers will also configure the character set in use for a file attachment storage location on that tier. Specifically, a database table as a storage location depends on the settings for the database server; an HTTP file repository as a storage location depends on the web server settings if the HTTP repository is deployed on the web server. In the preceding table, information is provided for an FTP site as a storage location only because an FTP site can be deployed independently from the other server tiers.

Failure to configure character sets correctly across server tiers can result in garbled file names.

See Also

Attachments with non-ASCII File Names

Click to jump to top of pageClick to jump to parent topicSelecting Database Character Sets

The primary character set decision that you must make when installing a PeopleSoft implementation is which character set to use for the database system. Ideally, all databases are encoded in Unicode; however, in some cases Unicode requires several bytes to represent each character when only one byte may be required in a non-Unicode character set. Therefore, the PeopleSoft system enables you to use certain non-Unicode character sets for the database.

By using a Unicode encoded database, you can maintain a single database with data in any combination of languages. A single PeopleSoft application server can serve multiple users connecting to the mixed-language database, regardless of the language or character set of those users’ client machines. The only restriction on a user’s ability to access mixed-language data is the capability of the user’s client workstation to interpret, display, and accept keyboard entry of the characters from the various languages.

Most language or region-specific non-Unicode character sets provide sufficient characters for only a few languages. If you create a non-Unicode database, you must ensure that all of the characters for all of the languages that you plan on using can be represented in the character set that you choose.

The following table lists whether a PeopleSoft language is supported in a Unicode or non-Unicode database character set:

Language Code

Language

Database Character Set

ARA

Arabic

Unicode

BUL

Bulgarian

Unicode

CFR

Canadian French

Unicode or non-Unicode

CRO

Croatian

Unicode

CZE

Czech

Unicode

DAN

Danish

Unicode or non-Unicode

DUT

Dutch

Unicode or non-Unicode

ENG

US English

Unicode or non-Unicode

FIN

Finnish

Unicode or non-Unicode

ESP

Spanish

Unicode or non-Unicode

FRA

French

Unicode or non-Unicode

GER

German

Unicode or non-Unicode

HUN

Hungarian

Unicode

ITA

Italian

Unicode or non-Unicode

JPN

Japanese

Unicode or non-Unicode

KOR

Korean

Unicode

NOR

Norwegian

Unicode or non-Unicode

POL

Polish

Unicode

POR

Portuguese

Unicode or non-Unicode

ROM

Romanian

Unicode

RUS

Russian

Unicode

SER

Serbian

Unicode

SLK

Slovak

Unicode

SLV

Slovenian

Unicode

SVE

Swedish

Unicode or non-Unicode

THA

Thai

Unicode

UKE

English

Unicode or non-Unicode

ZHS

Simplified Chinese

Unicode

ZHT

Traditional Chinese

Unicode

Depending on the data that you store and how the database stores Unicode characters, a Unicode database can be significantly larger than a non-Unicode database. However, only the storage of character data is affected; the space that is required for non-character data, such as numbers and dates (which are stored by the database system as numbers), is not affected.

Depending on the database platform, you can use one of the four character set types (SBCS, nonshifting DBCS, shifting DBCS, or Unicode) when creating the database. However, the number of characters that you can store in each column is affected greatly by the type of character set that you choose for the database encoding.

See Also

The Unicode Standard

Non-Unicode Character Sets

PeopleTools 8.52 Hardware and Software Requirements Guide

PeopleTools 8.52 installation guide for your database platform

Your operating system and database guides

Click to jump to top of pageClick to jump to parent topicSelecting Application Server Character Sets

All data that is stored in memory and processed by the PeopleTools application server is held in Unicode. However, the application server allows files on the server (created through PeopleCode file layout objects) and log and trace files to be Unicode or non-Unicode. Although the PeopleSoft application server uses Unicode internally for all data processing, it can create these files in Unicode or in a non-Unicode character set.

Each PeopleSoft application server is configured with a default non-Unicode character set. If a file operation must create a non-Unicode file, this character set is used, unless another character set is explicitly specified in the file operation. For example, if you create a file layout object to write a non-Unicode file, but you don’t specify in which character set the file should be created, the default non-Unicode character set of the application server is used.

Microsoft Windows enables you to change the default character set of the system, although as installed, the default character set matches the default locale of the Microsoft Windows installation. To change the system default locale (and therefore the character set), on Microsoft Windows servers, use the Control Panel’s Regional Options menu. In the Language settings for the system section, click the Set Default button.

When running on Unix/Linux, the PeopleSoft application server enables you to specify the default non-Unicode character set in the application server’s configuration file, which you select by using the PSADMIN tool. Any valid PeopleSoft character set with a character set type of SBCS or nonshifting DBCS is a valid default non-Unicode character set for PeopleSoft application servers that run on Unix/Linux.

See Also

Character Sets Across the Tiers of the PeopleSoft Architecture

Click to jump to top of pageClick to jump to parent topicSelecting and Managing Client Workstation Character Sets

You must consider the client components of PeopleTools when you are planning your language strategy. The requirements for language support on client workstations are different, depending on whether you are using the PeopleSoft Pure Internet Architecture or the PeopleTools development tools for Microsoft Windows.

This section discusses:

Character Sets and Fonts in the PeopleSoft Pure Internet Architecture

The PeopleSoft Pure Internet Architecture serves all HTML pages in the UTF-8 encoding of Unicode. This encoding is recognized automatically by the web browser, because the encoding of the page is announced in the HTTP header when the browser communicates with the web server. All browsers supported by PeopleTools can support UTF-8 encoded HTML pages.

However, the browser needs other components to correctly display and enter the vast array of characters that are available in Unicode. Specifically, you need appropriate fonts to display the various scripts in which you expect data to be maintained. In addition, you might need alternate keyboard layouts or, in the case of ideographic scripts such as Chinese, Japanese, and Korean, you need input method editors (IMEs) to convert sequences of keystrokes into ideographs. The requirement for alternate keyboard and IMEs is the same for both the PeopleSoft Pure Internet Architecture and the PeopleTools development environment.

Not all fonts contain a full repertoire of Unicode characters, because many fonts are tailored to address a specific list of languages and contain only the glyphs that are required by those languages. If you try to view Unicode data with a font that does not contain the appropriate characters for the displayed language, you will most likely see square boxes in place of the appropriate characters. The data has not been corrupted; there is just no glyph available in the current font for the character that the system is trying to display. For this reason, you may need to license or configure several fonts for a global PeopleSoft system.

The PeopleSoft Pure Internet Architecture includes a set of style sheets, defined with Application Designer, that determine the font that is used to display HTML pages. In some cases, the application data may contain characters that are not present in this font and that require a different font.

The Albany TrueType fonts shipped in the PS_HOME\fonts\truetype directory support all of the languages supported by the PeopleSoft system. Alternatively, you may need to obtain and configure fonts that contain the characters for the languages that you are planning to use, if your workstations are not already configured with these fonts. Obtain fonts from the following sources:

Depending on your browser, you can also download fonts from your browser’s manufacturer.

To enable the display of GB18030 characters, you can use either the SimSun-18030 font from Microsoft or the Albany fonts shipped in the PS_HOME\fonts directory. Both of these fonts have glyphs for the supported ranges of the GB18030 character set.

Fonts and the PeopleTools Development Environment

PeopleTools enables you to specify the font that is used for all graphical components for all PeopleTools modules that run on Windows, such as Application Designer. Use these methods to specify fonts:

Input Methods

If users will enter translated data by using PeopleSoft Pure Internet Architecture or the PeopleTools development environment, you must ensure that an appropriate keyboard layout or input method editor is installed on the workstation.

Most alphabetic languages can be typed by using a relatively simple keyboard layout. Several specialized keyboard layouts exist for most languages; configure these keyboard layouts through your operating system. For example, a Spanish keyboard layout contains keys for the n-tilde character (ñ) and several other accented characters.

However, certain PeopleSoft hot keys do not work as expected on alternate, non-U.S. keyboard layouts. For example, Alt+', Alt+\, and Alt+/ do not produce the expected results on the AZERTY keyboard. This occurs because some keys on non-U.S. keyboards produce different key codes than the same key on a U.S. keyboard (also known as a QWERTY keyboard).

A solution to this problem can be found in the appendix.

See PeopleSoft Hot Keys Do Not Function As Expected on a non-U.S. Keyboard.

There are several ways of entering these characters by using a nonlocalized keyboard. Your operating system manual can help you use specialized keyboard layouts, such as the English international layout, which enables you to enter accented characters by using two keystrokes. The Microsoft web site contains information about keyboards that are supported by Microsoft Windows and instructions for installing and configuring Windows keyboard layouts.

Ideographic languages, such as Chinese, Japanese, and Korean, require the use of a front-end processor to intercept multiple keyboard strokes and transform them into an ideographic character. These are known as IMEs, and they must be installed on each workstation where you plan to enter the ideographic languages.

Most localized versions of operating systems for these languages come preconfigured with IMEs that are appropriate for the language that is supported by the operating system. But on systems where the default locale is not Chinese, Japanese, or Korean, you may need to configure or license an IME from a third-party vendor. The PeopleSoft Pure Internet Architecture supports any IME that is supported by your browser. The designer tools in Microsoft Windows support all standard Microsoft IMEs.

Click to jump to top of pageClick to jump to parent topicSelecting Email Character Sets

The PeopleSoft system supports UTF-8 for outgoing Simple Mail Transfer Protocol (SMTP) email messages from PeopleTools application servers. In addition, the PeopleSoft system supports several additional encodings for outgoing email.

PeopleSoft application servers support the following for outgoing email:

Specifying Email Character Sets

You specify an email character set in the SMTPCharacterSet parameter in the application server configuration file, psappsrv.cfg. By default, the SMTPCharacterSet parameter is set to UTF-8.

Note. You should specify a value for the SMTPCharacterSet. If you do not specify a value for the parameter, email is sent as-is, with no encoding. Leave the parameter set to the default value of UTF-8 if you are not certain about which value to use.

For example, to use ISO-2022-JP encoding for outgoing SMTP mail, in the psappsrv.cfg file, set the SMTPCharacterSet parameter to ISO-2022-JP, as shown in the following example:

[SMTP Setting] ... SMTPCharacterSet=ISO-2022-JP SMTPEncodingDLL=blank

You can also write your own SMTPEncodingDLL modules, if necessary.

Using Extended Japanese Characters

To use certain Windows-31J (also known as Microsoft CP932) characters—specifically, NEC special characters, NEC-selected IBM extended characters, IBM extension characters, and user-defined characters—in incoming or outgoing email messages with the ISO-2022-JP Japanese character set, you must complete additional configuration of your web server (for incoming email) and application server or PeopleSoft Process Scheduler (for outgoing email).

For incoming email on the web server, the following JVM setting must be added to the JAVA_OPTIONS_WIN32 parameter in the setenv.cmd file:

SET JAVA_OPTIONS_WIN32= "-Dsun.nio.cs.map=x-windows-iso2022jp/ISO-2022-JP"

For outgoing email on the application server or PeopleSoft Process Scheduler, the following JVM option must be added to either the psappsrv.cfg file or the psprcs.cfg file depending on whether the application server or an AE program, respectively, will be handling outgoing email messages. JVM options are set in the PSTOOLS section of the file:

[PSTOOLS] ... JavaVM Options=-Dsun.nio.cs.map=x-windows-iso2022jp/ISO-2022-JP

In addition, your web server, application server, and PeopleSoft Process Scheduler must be using a Java Runtime Environment (JRE) or Java Development Kit (JDK) that is supported for extended Japanese characters. See the release notes on My Oracle Support website.

See My Oracle Support, Knowledge, Tools and Technology, Documentation, Release Notes.

Click to jump to parent topicConverting Between Character Sets

You can use PeopleCode file functions to convert files or text strings from one supported PeopleSoft character set to another supported PeopleSoft character set.

PeopleCode operations such as GetFile, GetTempFile, Open, ReadLine, and WriteLine automatically account for the file encoding. Therefore, you can:

When using a character set such as UCS2 or UTF8BOM, the BOM is added at the beginning of the file contents when using WriteLine. The BOM is skipped when read by the ReadLine PeopleCode function, and not interpreted as a text character. Since the BOM is recognised as meta data and not part of the file's text the BOM is not added to file contents when writing in a non-Unicode character set with the WriteLine PeopleCode function.

In the following example PeopleCode program, the FileEncodingConversion function handles converting files from one supported character set to another. In the body of the program, the function is called to convert from the UTF8BOM character set to the UCS2 character set.:

REM this function, FileEncodingConversion() converts character encoding of input file to another of output file. REM an example of how to call the function is at the end of this file. Local File &InputFile, &OutputFile; Local string &InputDirFile, &OutputDirFile, &InputFilename, &OutputFilename; Local string &sDirSep, &LogLine; Local array of string &FIleEncoding; Local boolean &ret; Function FileEncodingConversion(&InputEncoding, &InputDirectoryFile, &OutputEncoding, &OutputDirectoryFile) Returns boolean &InputFile = GetFile(&InputDirectoryFile, "R", &InputEncoding, %FilePath_Absolute); &OutputFile = GetFile(&OutputDirectoryFile, "W", &OutputEncoding, %FilePath_Absolute); If &InputFile.IsOpen And &OutputFile.IsOpen Then While (&InputFile.readline(&LogLine)) &OutputFile.Writeline(&LogLine); End-While; &InputFile.Close(); &OutputFile.Close(); Return True; Else If &InputFile = Null Then WinMessage("Error: PeopleCode: File Encoding I/O: " | "Failed to open: " | &InputFile.Name); Else If &OutputFile = Null Then WinMessage("Error: PeopleCode: File Encoding I/O: " | "Failed to open:" | &OutputFile.Name); End-If; End-If; Return False; End-If; End-Function; /*-----------------------------------------------------------------------*/ /* Function IsUnix */ /* check if OS = Unix */ /*-----------------------------------------------------------------------*/ Function IsUnix Returns boolean &DummyFile = GetFile("/bin/sh", "E", %FilePath_Absolute); If &DummyFile.IsOpen Then; &DummyFile.Close(); Return True; Else; Return False; End-If; End-Function; REM test the function above; &FIleEncoding = CreateArray("UTF8BOM", "UCS2", "SJIS", "GB18030", "UTF8", ⇒ "a", "u"); REM WinMessage("ret: " | &ret); If IsUnix() Then /* for UNIX */ &ret = FileEncodingConversion(&FIleEncoding [1], "/home/FS_" | &FIleEncoding [1] | ".txt", &FIleEncoding [2], "/home/BEFORE/FS_PCode_" | ⇒ &FIleEncoding [1] | "_to_" | &FIleEncoding [2] | ".txt"); Else /* for WINDWOS */ &ret = FileEncodingConversion(&FIleEncoding [1], "D:\TMP\FS_" | &FIleEncoding [1] | ".TXT", &FIleEncoding [2], "D:\TMP\AFTER\FS_PCode_" | &FIleEncoding [1] | "_to_" | &FIleEncoding [2] | ".TXT"); End-If

Click to jump to parent topicSetting Data Field Length Checking

This section provides overviews of Application Designer field length semantics and field length checking for non-Unicode databases and discusses how to enable or disable data field length checking.

Click to jump to top of pageClick to jump to parent topicUnderstanding Application Designer Field Length Semantics

The database character set determines the way that PeopleTools interprets the column length that is defined in Application Designer.

If you create a Unicode database, the field length, as shown in Application Designer, indicates the maximum number of Unicode BMP characters that are permitted in the field, regardless of the Unicode encoding that is used by the database. Some database platforms, such as Oracle with byte semantics, use byte lengths to measure column sizes when operating in a Unicode database, while others use character lengths.

When the database uses byte-sized column lengths, the PeopleSoft system sizes the database columns based on the worst-case ratio between bytes and characters in the Unicode encoding that is used by your database. For example, if the AL32UTF8 character set is used by Oracle with byte semantics, the worst-case character-to-byte ratio when running against an Oracle Unicode database is 1:3. So, column size is tripled when creating a Unicode database on Oracle. A field that is defined in Application Designer as a CHAR(10) is created on an Oracle Unicode database with a type of VARCHAR2(30). This tripling of the maximum column size does not affect the actual size of the database, because variable length character fields do not reserve space in the database.

Other database platforms use character-based column lengths whose sizes represent the maximum number of Unicode characters instead of bytes that may be stored. Examples of this implementation are the NCHAR data type in Microsoft SQL Server and the GRAPHIC data type in DB2 UDB for Linux, Unix, and Microsoft Windows.

If you create a non-Unicode database, the field length in Application Designer represents the number of bytes that are permitted in the field, based on the character set that you used to create the database. Therefore, a PeopleSoft Unicode database enables you significantly more space for character data within the database when dealing with ideographic languages, such as Japanese, that require more than one byte storage per character.

The following tables show some of the possible database encodings for database platforms that the PeopleSoft system supports in Unicode and DBCS and their effects on database column sizes. Each table shows the database representation and the worst case number of characters allowed in the character field for a character field defined in Application Designer with a length of 10.

This table shows the information for an Oracle database with byte semantics (used by PeopleSoft 8.9 applications and earlier):

Database Character Set

Database Representation

Number of Characters

Unicode (AL32UTF8)

VARCHAR2(30)

10

Any SBCS

VARCHAR2(10)

10

Shift-JIS (JA16SJIS or JA16SJISTILDE)

VARCHAR2(10)

5

This table shows the information for an Oracle with character semantics (used by PeopleSoft 9.0 applications and later):

Database Character Set

Database Representation

Number of Characters

Unicode (AL32UTF8)

VARCHAR2(10)

10

Any SBCS

VARCHAR2(10)

10

Shift-JIS (JA16SJIS or JA16SJISTILDE)

VARCHAR2(10)

10

This table shows the information for an Microsoft SQL Server database with VARCHAR semantics:

Database Character Set

Database Representation

Number of Characters

Unicode (UCS-2)

NVARCHAR(10)

10

Any SBCS

VARCHAR(10)

10

Shift-JIS (CP932)

VARCHAR(10)

5

This table shows the information for a Microsoft SQL Server database with CHAR semantics:

Database Character Set

Database Representation

Number of Characters

Unicode (UCS-2)

NCHAR(10)

10

Any SBCS

CHAR(10)

10

Shift-JIS (CP932)

CHAR(10)

5

This table shows the information for a DB2 UDB for OS/390 and z/OS database:

Database Character Set

Database Representation

Number of Characters

Any SBCS

CHAR(10)

10

Shifting DBCS (CCSID 930/939)

CHAR(10)

4 (4 x 2 byte characters, plus shift-in & shift-out bytes)

This table shows the information for all other databases:

Database Character Set

Database Representation

Number of Characters

Any SBCS

CHAR(10)

10

Click to jump to top of pageClick to jump to parent topicUnderstanding Field Length Checking for Non-Unicode Databases

The maximum number of characters that are permitted in a PeopleSoft field varies, depending on the character set of the database. Because all components of PeopleTools use Unicode for internal storage, by default, field length checking occurs in terms of Unicode character counts. This calculation is appropriate for Unicode databases and for any SBCS databases.

However, if you are using a non-Unicode DBCS, special length checking must occur each time you move off a field to ensure that the string that you entered fits in the database column when the string is converted to the database’s character set.

For graphically sizing page fields, PeopleTools uses the Unicode length of the field as defined in Application Designer. For example, if a field is defined in Application Designer as a 10-character field, page fields in both the PeopleSoft Pure Internet Architecture and the PeopleTools clients for Microsoft Windows allow 10 characters to be displayed unless manually resized by the developer.

However, if the database is encoded in a non-Unicode DBCS character set, such as Japanese Shift-JIS, special length validation must occur because the database column size is created relative to a byte count, not to a character count as is used by the simple field length validation.

For example, if a user enters 10 Japanese characters into a field that is defined as CHAR(10) in Application Designer, this string needs 20 bytes of storage in a nonshifting DBCS character set and 22 bytes of storage in a shifting character set. This 10-character input would fail insertion in both of these databases.

To address this issue, the page processor checks the Data Field Length Checking option on the PeopleTools Options page and performs character-set specific length validation against the contents of each field when the field is validated. Typically length validation occurs when the field’s FieldChange PeopleCode event fires, so the actual time of validation may differ, depending on whether your page uses deferred mode processing.

Click to jump to top of pageClick to jump to parent topicEnabling or Disabling Data Field Length Checking

To enable or disable data field length checking:

  1. Select PeopleTools, Utilities, Administration, PeopleTools Options.

    The PeopleTools Options page appears.

  2. From the Data Field Length Checking drop-down list box, select a value based on the character set that you are using for the database:

    Others

    Select if you are using a Unicode encoded database or a non-Unicode SBCS database. This option prevents special field length checking, which is not required by these types of databases.

    DB2 MBCS

    Select DB2 MBCS if you are running a Japanese database on the DB2 UDB for Linux, Unix, and Microsoft Windows platform. This options enables field length checking based on a shifting DBCS.

    MBCS

    Select if you are running a non-Unicode Japanese database on any other platform. This option enables field length checking based on a nonshifting DBCS.

    Note. The non-Unicode DBCS settings are specifically oriented to Japanese language installations, because Japanese is the only language that the PeopleSoft system supports in a non-Unicode DBCS encoding. All languages other than Western European languages and Japanese are supported by the PeopleSoft system only when using Unicode encoded databases.

  3. Click the Save button.

Click to jump to parent topicUsing CJK Ideographic Characters in Name Character Fields

This section discusses PeopleSoft standard name conventions, including name conventions for Chinese, Japanese, and Korean (CJK) ideographic characters. PeopleSoft standard name conventions apply when data is entered or displayed in character fields that use Name as the format type. These conventions should be used when a complete name is constructed from multiple name character fields or when all name data is entered into a single name character field.

The PeopleSoft standard name convention is:

[lastname] [suffix],[prefix] [firstname] [middle name/initial]

Examples of typical suffixes include degrees, affiliations, and titles such as MD, PhD, Jr., and III. Examples of typical prefixes include titles and honorifics such as Ms., Mr., Dr., Rev., and Hon.

Valid examples of these conventions include:

Name as Displayed by PeopleSoft Convention

Name Elements Used

Actual Name

O’Brien,Michael

[lastname],[firstname]

Michael O’Brien

Jones IV,James

[lastname] [suffix],[firstname]

James Jones IV

Phillips MD,Deanna Lynn

[lastname] [suffix],[firstname] [middle name]

Deanna Lynn Phillips, MD

Reynolds Jr.,Dr. John Q.

[lastname] [suffix],[prefix] [firstname] [middle initial]

Dr. John Q. Reynolds Jr.

Phipps-Scott,Ms. Adrienne

[lastname],[prefix] [firstname]

Ms. Adrienne Phipps-Scott

Knauft,Günter

[lastname],[firstname]

Günter Knauft

However, if the name contains any CJK ideographic characters, different standard name conventions apply.

If the name contains any Japanese or Korean ideographic characters, the first and last names are separated by a space instead of a comma. In Japanese, a prefix or suffix is optional; in Korean, only an optional prefix can be used. These modified PeopleSoft standard name conventions can be used when a name includes any of the following types of characters:

The PeopleSoft standard name convention for Japanese names including these ideographic characters is:

[lastname] [firstname][{suffix|prefix}]

The PeopleSoft standard name convention for Korean names including these ideographic characters is:

[lastname] [firstname][prefix]

Valid examples of these conventions include:

Name as Displayed by PeopleSoft Convention

Name Elements Used

English Equivalent

塩次 伸二

[lastname] [firstname]

Shinji Shiotsugu

塩次 伸二様

[lastname] [firstname][prefix]

Mr. Shinji Shiotsugu

홍 길동

[lastname] [firstname]

Hong Gildong

홍 길동씨

[lastname] [firstname][prefix]

Mr. (or Ms.) Hong Gildong

If the name contains Chinese Hanzi, there is no space or comma between the first name, last name, suffix, and prefix. The PeopleSoft standard name convention for names including Chinese Hanzi characters is:

[lastname][firstname][{suffix|prefix}]

Valid examples of this convention include:

Name as Displayed by PeopleSoft Convention

Name Elements Used

English Equivalent

陳嘉明

[lastname][firstname]

Chan Ka Ming

陳嘉明先生

[lastname][firstname][prefix]

Mr. Chan Ka Ming

See Also

Specifying Character Field Attributes

Character Processing

Click to jump to parent topicDetecting and Converting Between Character Types

PeopleTools also provides PeopleCode string functions that recognize and convert between different characters within the Japanese character set. This enables you to detect, convert, and enforce the types of characters that you can enter in any PeopleSoft field. For example, the PeopleSoft system uses these functions in the development of the Alternate Character Architecture in some PeopleSoft applications. The Alternate Character Architecture is used in several PeopleSoft applications to provide a feature that enables the entry of, and enforces the characters contained in, Japanese phonetic spellings (Furigana) by using the Hiragana or Katakana scripts.

The following PeopleCode string functions can be used to recognize and convert between different characters within the Japanese character set:

See Also

CharType