Understanding COBOL in a Unicode Environment

This section discusses COBOL in Unicode environment.

The character set that is used for PeopleSoft COBOL processing must match the character set for your database. If you created a Unicode database for the PeopleSoft system, you must also run COBOL in Unicode.

Note: In this document, the word character refers to a single character in any language, regardless of how many bytes are required to store the character.

The Unicode standard provides several methods of encoding Unicode characters into a byte stream. Each encoding has specific properties that make it suitable for use in different environments. The two main encodings that are important to understanding how PeopleSoft COBOL operates when running in Unicode are:

UCS-2 (2-byte Universal Character Set) — which is the Unicode encoding that PeopleTools uses internally for data that is held in memory on the application server.

UCS-2 encodes all characters into a fixed storage space of two bytes.
UTF-8 (8-bit Unicode Transformation Format) — which is the encoding that the PeopleSoft system uses in COBOL.
UTF-8 uses a format that varies from one to four bytes per character. Currently, the PeopleSoft system supports Unicode’s Basic Multilingual Plane (BMP), which requires one to three bytes per character. Four-byte UTF-8 characters are required to represent supplementary characters that are outside Unicode’s BMP.

In UTF-8, the actual number of bytes to encode a character can be determined by the first three bits of the first, and sometimes only, byte of a character. The following table shows how the bit pattern of the first byte is related to the number of bytes needed to encode the UTF-8 character.

Unicode Code Point Range	UTF-8 Bit Pattern	UTF-8 Character Length
U+0000 – U+007F	0`xxxxxxx`	One byte
U+0080 – U+07FF	110`xxxxx` 10`xxxxxx`	Two bytes
U+0800 – U+FFFF	1110`xxxx` 10`xxxxxx` 10`xxxxxx`	Three bytes

The x bit positions are filled with the bits of the character code number in binary representation. The rightmost x is the least-significant bit (a big-endian representation). In multi-byte sequences (for Unicode code points greater than U+007F), the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence. In addition, each byte in a multi-byte sequence has the most significant bit set.

This section includes Unicode encoding examples for the following characters:

Character	Unicode Code Point	Description
a	U+0061	Latin small letter a.
ñ	U+00F1	Latin small letter ñ.
€	U+20AC	Euro symbol

The following table shows the difference in how UCS-2 and UTF-8 encode several characters:

Character	Unicode Code Point	UCS-2 Byte Values	UTF-8 Byte Values
a	U+0061	0x00 0x61	0x61
ñ	U+00F1	0x00 0xF1	0xC3 0xB1
€	U+20AC	0x20 0xAC	0xE2 0x82 0xAC

The PeopleSoft system transparently handles the conversion between UCS-2 and UTF-8 when data is passed into the COBOL program from the database. If you are reading or writing files directly from a COBOL program, the input and output files are UTF-8 encoded when running PeopleSoft COBOL programs in Unicode.

Moving to a COBOL Unicode environment means that character data can potentially require three times the storage space that is required in a single-character environment, given the variable length of a character that is encoded in UTF-8 from one to three bytes. To allow for this, all internal data definitions for character-type data in COBOL programs must be expanded to allow for three times as many bytes. This expansion is critical because in a Unicode PeopleSoft database, column sizes are calculated based on a character length, not a byte length. So, a CHAR(10) column on a Unicode database allows the storage of 10 characters, regardless of how many bytes each character takes to store. Given the three-bytes-per-character maximum requirement of UTF-8 (four-byte UTF-8 characters are not yet supported by the PeopleSoft system), the maximum byte size of this CHAR(10) column is 30 bytes. Therefore, a COBOL type of PIC X(30) may be required to store the contents of a CHAR(10) field on a Unicode database.

The PeopleSoft system provides a COBOL conversion utility to automatically expand the character-data fields in the working storage to accommodate the number of bytes in the UTF-8 encoding scheme.

In a non-Unicode COBOL installation, parsing through a string is easy because you can assume that all characters coming in are one byte in length. But in UTF-8, a character can vary between one byte and three bytes in length. Therefore, you must incorporate special logic to handle string parsing when you are dealing with characters in UTF-8 format.

Any in-memory sorting that is performed by using COBOL functions is performed as a binary sort in the current character encoding that is used for COBOL processing and may not necessarily match the sort order that is returned by the database in response to an ORDER BY clause. If you require the database to return data that is sorted by using a binary sort of its encoding rather than the default linguistically correct sort, you must use the %BINARYSORT meta-Structured Query Language (meta-SQL) function around each column in the WHERE or ORDER BY clause where binary ordering is important.

However, for DB2 UDB for OS/390 and z/OS implementations, this binary sorting is equivalent only when the COBOL program is run on a DB2 UDB for OS/390 and z/OS server. For example, the binary sort that is produced in COBOL differs from the binary sort that is produced by the database, because the database is encoded in EBCDIC and the client is in an ASCII-based encoding. Therefore, use the %BINARYSORT meta-SQL function only in COBOL programs that are not run through RemoteCall (the DB2 UDB for OS/390 and z/OS platform is not supported as a RemoteCall server).

When running against non-z/OS systems, the %BINARYSORT function can be used in both RemoteCall and non-RemoteCall programs.

For example:

SELECT RECNAME FROM PSRECDEFN  WHERE %BINARYSORT(RECNAME) < %BINARYSORT('xxx')
SELECT RECNAME FROM PSRECDEFN  ORDER BY %BINARYSORT(RECNAME)

Note: Using the %BINARYSORT Meta-SQL token in WHERE and ORDER BY clauses often negates the use of any indexes, because most databases can't use indexes for functional comparisons (for example, WHERE %BINARYSORT(column) > 'X'). Use this syntax only when sorting equivalence of SQL statement results and when COBOL memory order is absolutely required.

These error messages can occur when you are running a COBOL program against a Unicode database:

Fetch failed: unsuccessful UCS-2 to UTF-8 conversion on column column_number.
Bind of parameter failed: unsuccessful UTF-8 to UCS-2 conversion on column column_number.
Attempting to use a non-Unicode API to access a Unicode database.
Attempting to use a non-Unicode COBOL with a Unicode database.
Attempting to use a Unicode API to access a non-Unicode database.
Fetch failed: the converted Unicode length of length exceeds the allocated buffer length length for column column_number.

These messages appear in the COBOL output log file.

Understanding COBOL in a Unicode Environment

Unicode Encodings in PeopleSoft COBOL

Expanded Storage Space Requirements

Special Logic for Parsing Unicode Strings

COBOL Sorting

Unicode-Specific Error Messages