EBCDIC Emulation for Mainframe Migration

12 EBCDIC Emulation for Mainframe Migration

This chapter provides guidance for migrating applications and databases from mainframe environments—where EBCDIC character sets are prevalent—to Oracle databases on modern platforms such as Linux. It covers migration strategies, recommendations for character set selection, the use of accurate source-to-target character mappings during data migration, and Oracle features for emulating EBCDIC binary ordering to support legacy application logic without modification.

This chapter includes the following topics:

12.1 Overview of Application Migration from a Mainframe

Organizations may choose to migrate from mainframe platforms to modern environments to improve scalability, flexibility, and integration with emerging technologies, to optimize costs, and to support evolving business and regulatory requirements.

This section covers migration challenges and strategies, including:

12.1.1 Character Set Differences

A major challenge in migrating from mainframe platforms is the fundamental difference in character encodings between the source and target environments. Character sets used on mainframe platforms are predominantly based on EBCDIC (Extended Binary Coded Decimal Interchange Code), developed by IBM for their mainframes and midrange computer systems. In contrast, character sets on Linux and other modern operating systems are based on ASCII (American Standard Code for Information Interchange) , developed by the American National Standards Institute (ANSI). ASCII is a subset of the Unicode standard, and its characters are encoded identically in UTF-8.

Although EBCDIC and ASCII encode a similar set of basic characters, they assign very different binary codes. The most significant differences are:

ASCII assigns codes in the range 0-127, while EBCDIC uses the range 0-255. ASCII supersets, such as ISO 8859-1, use codes 128-255 for additional characters, while EBCDIC supersets add codes in unused positions among the basic characters.
In EBCDIC, digits have numerically higher codes than basic Latin letters and most accented letters. In ASCII, digits have lower codes than any letter.
In contrast to ASCII, the code range for basic uppercase Latin letters (A-Z) in EBCDIC is not continuous, but divided into three ranges: A-I, J-R, and S-Z. Similarly, lowercase letters (a-z) are divided into ranges a-i, j-r, and s-z.
Basic Latin lowercase letters have lower codes than their uppercase counterparts in EBCDIC, but higher codes in ASCII.
The sets of control characters defined in EBCDIC and ASCII differ significantly.

Table 12-1 and Table 12-2 illustrate the basic structure of EBCDIC and ASCII code pages. Empty cells indicate code points reserved in EBCDIC- and ASCII-based character sets for Latin letters with diacritics, non-Latin letters (such as Cyrillic or Greek), or additional punctuation. ASCII-based character sets from the ISO 8859 family also encode additional control codes in the range 0x80-0x9F.

Table 12-1 The Structure of an EBCDIC-Based Character Set (Code Page)

	-0	-1	-2	-3	-4	-5	-6	-7	-8	-9	-A	-B	-C	-D	-E	-F
0-	`NUL` ^Foot 1	`SOH`	`STX`	`ETX`	`SEL`	`HT`	`RNL`	`DEL`	`GE`	`SPS`	`RPT`	`VT`	`FF`	`CR`	`SO`	`SI`
1-	`DLE`	`DC1`	`DC2`	`DC3`	`RES`	`NL`	`BS`	`POC`	`CAN`	`EM`	`UBS`	`CU1`	`IFS`	`IGS`	`IRS`	`IUS`
2-	`DS`	`SOSI`	`FS`	`WUS`	`BYP`	`LF`	`ETB`	`ESC`	`SA`	`SFE`	`SM/SW`	`CSP`	`MFA`	`ENQ`	`ACK`	`BEL`
3-			`SYN`	`IR`	`PP`	`TRN`	`NBS`	`EOT`	`SBS`	`IT`	`RFF`	`CU3`	`DC4`	`NAK`		`SUB`
4-	`SP`											`.`	`<`	`(`	`+`	`\|`
5-	`&`										`!` ^Foot 2	`$` ^Foot 2	`*`	`)`	`;`
6-	`-`	`/`										`,`	`%`	`_`	`>`	`?`
7-										` ^Foot 2	`:`	`#` ^Foot 2	`@` ^Foot 2	`'`	`=`	`"`
8-		`a`	`b`	`c`	`d`	`e`	`f`	`g`	`h`	`i`
9-		`j`	`k`	`l`	`m`	`n`	`o`	`p`	`q`	`r`
A-		`~` ^Foot 2	`s`	`t`	`u`	`v`	`w`	`x`	`y`	`z`
B-	`^` ^Foot 2										`[` ^Foot 2	`]` ^Foot 2
C-	`{` ^Foot 2	`A`	`B`	`C`	`D`	`E`	`F`	`G`	`H`	`I`
D-	`}` ^Foot 2	`J`	`K`	`L`	`M`	`N`	`O`	`P`	`Q`	`R`
E-	`\` ^Foot 2		`S`	`T`	`U`	`V`	`W`	`X`	`Y`	`Z`
F-	`0`	`1`	`2`	`3`	`4`	`5`	`6`	`7`	`8`	`9`						`EO`

^Footnote 1Names of control codes are italicized.

^Footnote 2These codes are not present in all EBCDIC code pages. They are presented here as defined in EBCDIC code page 37 because they are defined in ASCII.

Table 12-2 The Structure of an ASCII-Based Character Set

	-0	-1	-2	-3	-4	-5	-6	-7	-8	-9	-A	-B	-C	-D	-E	-F
0-	`NUL` ^Foot 3	`SOH`	`STX`	`ETX`	`EOT`	`ENQ`	`ACK`	`BEL`	`BS`	`HT`	`LF`	`VT`	`FF`	`CR`	`SO`	`SI`
1-	`DLE`	`DC1`	`DC2`	`DC3`	`DC4`	`NAK`	`SYN`	`ETB`	`CAN`	`EM`	`SUB`	`ESC`	`FS`	`GS`	`RS`	`US`
2-	`SP`	`!`	`"`	`#`	`$`	`%`	`&`	`'`	`(`	`)`	`*`	`+`	`,`	`-`	`.`	`/`
3-	`0`	`1`	`2`	`3`	`4`	`5`	`6`	`7`	`8`	`9`	`:`	`;`	`<`	`=`	`>`	`?`
4-	`@`	`A`	`B`	`C`	`D`	`E`	`F`	`G`	`H`	`I`	`J`	`K`	`L`	`M`	`N`	`O`
5-	`P`	`Q`	`R`	`S`	`T`	`U`	`V`	`W`	`X`	`Y`	`Z`	`[`	`\`	`]`	`^`	`_`
6-	`	`a`	`b`	`c`	`d`	`e`	`f`	`g`	`h`	`i`	`j`	`k`	`l`	`m`	`n`	`o`
7-	`p`	`q`	`r`	`s`	`t`	`u`	`v`	`w`	`x`	`y`	`z`	`{`	`\|`	`}`	`~`	`DEL`
8-
9-
A-
B-
C-
D-
E-
F-

^Footnote 3Names of control codes are italicized.

12.1.2 Migration Strategies

There are two main strategies for migrating an application system between platforms:

Data-Only Migration: Only the data is migrated to an ASCII-based system, while the mainframe applications continue to run in EBCDIC mode (often via emulation). When migrating to an Oracle Database, this approach requires that the source data is stored in a relational database and accessed with SQL, allowing applications to continue to access data in the same way.
Application and Data Migration: Both applications and data are migrated. Applications are completely rewritten—for example, moving COBOL logic to Java or another language—thereby removing any EBCDIC dependencies and enabling the use of modern character sets and collations.

For migration to Oracle Database, regardless of whether applications process source data in a relational database or directly in files, the main globalization-related tasks are selecting the database character set for the target database and configuring the data loading process to ensure that character codes are properly converted between source and target character sets. See "Choosing a Character Set for the Target Database" and "Choosing a Source Character Set for Data Migration" for more information.

Minimal or no modification to applications may require additional configuration of the target database to ensure that, for a given, unmodified SQL query, the database returns the same rows in the same order as the source mainframe database. Refer to "Emulating EBCDIC Binary Ordering" for more information.

Certain SQL functions cannot be used in queries if those queries are intended to remain unchanged during the migration process. See "Avoiding Dependency on Character Encoding" for more details.

Because Oracle no longer supports Oracle Database Client software on IBM EBCDIC-based platforms, such as IBM z/OS, unchanged applications require a third-party translation layer to communicate through Oracle Database Client software running on an ASCII-based platform. This layer must convert between the application’s EBCDIC code page and the Oracle character set configured for the Oracle Database Client. The conversion should be compatible with that used for the initial data migration. See "Choosing a Source Character Set for Data Migration".

If applications are being translated to another programming language and framework, the translation process should include logic to adapt all programming constructs that depend on a specific character encoding to encoding-independent mechanisms, such as a character classification API. A detailed discussion of this requirement is beyond the scope of this guide.

12.1.3 Avoiding Dependency on Character Encoding

The SQL and PL/SQL statements in applications targeted for migration should not assume the use of the original EBCDIC character set. For example, the CHR function should not be called with arguments corresponding to binary codes from an EBCDIC code page. While CHR(67) returns the character 'ä' (a with diaeresis) in the EBCDIC code page 500, it returns the uppercase letter 'C' in all ASCII-based character sets. Such CHR calls must be replaced with character literals—which the translation layer (mentioned in the previous section) can automatically convert to the appropriate database character set—or updated to use target character set’s codes (for example, CHR(228) if the target database character set is WE8ISO8859P1, or CHR(50084) if using AL32UTF8).

At a minimum, the following SQL and PL/SQL function calls (or their equivalents in the source mainframe SQL dialect) must be reviewed and typically modified to support running an EBCDIC application against a database with an ASCII-based database character set: CHR, ASCII, UTL_RAW.CAST_TO_VARCHAR2, UTL_RAW.CAST_TO_RAW, and DUMP.

If an application uses hashing and encryption of character values, these operations should be performed entirely on the application side or entirely on the database server side. Hashed or encrypted values do not undergo client-server character set conversion. As a result, a value encrypted on one side and decrypted on the other may be in the wrong encoding, and hashes generated on different sides for the same string may not match.

12.2 Choosing a Character Set for the Target Database

Selecting the appropriate character set for the target Oracle Database is crucial to avoid data loss during migration. This section discusses the requirements for the target character set and outlines specific considerations for selecting the Unicode character set AL32UTF8 or one of the single-byte Oracle character sets.

This section includes the following topics:

12.2.1 Requirements for Target Character Sets

When planning migration, it is essential that the target Oracle Database character set supports every character present in the source mainframe data. The target set does not need to cover all possible characters in the source set—only those that are actually used. For example, control codes other than those corresponding to new line, carriage return, horizontal tab, vertical tab, form feed (new page), and escape are usually not included in business data.

A special case arises when a source character column is used to store binary data, under the data-only migration strategy. In such cases, the data migration process and the target environment must preserve all possible source byte values through round-trip conversion—any byte sequence stored in the original database must be storable in and retrievable from the target database without any loss of information.

12.2.2 Migrating to AL32UTF8

The Unicode UTF-8 character encoding form, as implemented in the Oracle character set AL32UTF8, is recommended for full character repertoire of any EBCDIC code page. With its extensive set of defined characters—including a range of codes reserved for private use by applications—AL32UTF8 is the only Oracle Database character set capable of representing all characters from any EBCDIC character set. AL32UTF8 can serve as the target character set for any specific EBCDIC source system. It is also suitable for databases that consolidate application systems migrated from multiple different EBCDIC code pages and language groups.

Conversion to and from AL32UTF8 is a round-trip process that preserves all characters, except in cases where an EBCDIC code page maps more than one character code to the same Unicode character or when undefined codes are used to store binary data. For more information on these exceptional cases, see "Undefined or Duplicated Codes in a Source Character Set".

Because AL32UTF8 is a multibyte character set, single-byte applications running unchanged after migration must meet the requirements outlined in section "Running Single-Byte Client Applications with a Unicode Database" in order to work correctly with an AL32UTF8 database. To fully benefit from Unicode, applications should be rewritten using a Unicode-capable framework, such as Java or .NET.

12.2.3 Migrating to Legacy ASCII-Based Character Sets

If Unicode is not feasible—due to application limitations or certification requirements—the Oracle Database character set that most closely matches the source EBCDIC code page should be selected to minimize potential data loss during data migration and character set conversions.

Table A-15 lists the closest ASCII-based character sets for several single-byte EBCDIC code pages.

For Western European code pages that do not include the euro sign, WE8ISO8859P1 provides a complete round-trip mapping. For code pages where the euro sign (U+20AC) replaces the general currency symbol (U+00A4), no single-byte ASCII character set offers complete mapping. The available options are:

WE8ISO8859P1: Does not encode the euro sign.
WE8ISO8859P15: Encodes the euro sign, but omits several punctuation characters.
WE8MSWIN1252: Encodes both the euro sign and all needed printable characters, but does not encode many control codes. Most of these control codes are typically not present in character data; however, the character U+0085 (Next Line) may be used.

The target database character set should be chosen based on a thorough analysis of the source data. Characters not supported by the selected character set must be absent from the source data, or you must be able to replace them with other encoded characters without compromising data integrity.

Mapping the Arabic code page IBM-420 to an ASCII-based single-byte character set is particularly challenging. IBM-420 encodes the presentation forms (shapes) of Arabic letters, such as medial or final forms, while Oracle Database character sets support only the characters as they are stored in memory (logical order). Arabic text stored in Oracle Database is expected to be shaped into the correct glyphs by application GUI subsystems.

For data-only migration, AL32UTF8 is the recommended target character set for Arabic text in presentation forms, as it encodes these presentation forms as Unicode compatibility characters.
For application and data migration, applications should be re-engineered to process Arabic text using modern approaches, delegating letter shaping to the GUI layer. The data migration process should unify multiple presentation forms of a single character into its logical (memory representation) form.

12.3 Choosing a Source Character Set for Data Migration

This section describes how to select the appropriate character set for loading EBCDIC source data into the target database. It also outlines common methods for loading data and discusses potential issues that may arise during the process.

This section includes the following topics:

12.3.1 CDRA-Compatible Character Sets

For accurate data migration, use character sets that map EBCDIC code pages to Unicode according to IBM's Character Data Representation Architecture (CDRA) mappings. Table A-10 lists Oracle character sets that provide these accurate mappings. Earlier Oracle implementations of EBCDIC code pages, such as WE8EBCDIC37, may map certain characters differently. For example, they map both control codes 0x15 (NL) and 0x25 (LF) to the Unicode Line Feed character U+000A, whereas character sets with CDRA-accurate mappings map 0x15 to the Next Line character U+0085.

The character sets listed in Table A-10 are client-only character sets. They can be used, as described in the following topics, to load source data from operating system files into a target Oracle Database.

12.3.2 SQL*Loader

SQL*Loader is the primary Oracle utility for loading data from text files. You specify the source character set for data loading in the control file, and SQL*Loader automatically converts characters from the source character set to the target database character set.

Although you can also specify the source character set for SQL*Loader input files using the NLS_LANG environment variable, do not specify EBCDIC character sets in this variable when running on ASCII-based platforms.

See Also:

Oracle AI Database Utilities for more information about SQL*Loader and the syntax of control files

Example 12-1 SQL*Loader Control File

The following control file declares the source character set of text in the file customer_data.dat as IBM1140. The remainder of the control file should specify the file structure, such as field names, data types, and separators used as column delimiters.

LOAD DATA
CHARACTERSET IBM1140
INFILE 'customer_data.dat'
...

12.3.3 External Tables

External tables that use the ORACLE_LOADER access driver allow data in text files to be queried using SQL as if the data were rows in a relational table. With this approach, the data remains in the external file and is not physically loaded into the database. This method enables complex transformations to be performed on the data before it is made available to database applications.

See Also:

Oracle AI Database Utilities for more information about external tables that use the ORACLE_LOADER access driver

Example 12-2 External Table Definition

The following CREATE TABLE statement creates an external table as a proxy for the contents of the file customer_data.dat, which is located in the data_dir database directory. The character data in the file is encoded using the IBM273 character set. Database directories are created with the SQL statement CREATE DIRECTORY, making operating system directories accessible to SQL statements. Column and field definitions are omitted for brevity.

CREATE TABLE ext_customer_data (
  ...
)
ORGANIZATION EXTERNAL (
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY data_dir
  ACCESS PARAMETERS (
    RECORDS DELIMITED BY NEWLINE
    CHARACTERSET IBM273
    ...
  )
  LOCATION ('customer_data.dat')
)

12.3.4 DBMS_LOB

The Oracle-supplied PL/SQL package DBMS_LOB provides APIs for loading text files into a CLOB column, either from an operating system file mapped to a BFILE or from a BLOB column.

The function LOADCLOBFROMFILE transfers data from a text file represented by a BFILE to a CLOB, performing necessary character set conversion.

DBMS_LOB.LOADCLOBFROMFILE (
   dest_lob       IN OUT NOCOPY   NOCOPY CLOB CHARACTER SET ANY_CS, 
   src_bfile      IN              BFILE, 
   amount         IN              INTEGER, 
   dest_offset    IN OUT          INTEGER, 
   src_offset     IN OUT          INTEGER,
   bfile_csid     IN              NUMBER,
   lang_context   IN OUT          INTEGER,
   warning        OUT             INTEGER);

The bfile_csid parameter lets you specify the numeric ID of the character set used in the file being loaded. The file’s contents are converted from the specified character set to the database character set. You can map character set names to their corresponding numeric IDs using the SQL function NLS_CHARSET_ID.

The function CONVERTTOCLOB copies data from a binary LOB (BLOB) to a character LOB (CLOB) with the necessary character set conversion. This function enables data originally loaded as binary, without interpretation, to be converted to text in the database character set.

DBMS_LOB.CONVERTTOCLOB(
   dest_lob       IN OUT NOCOPY  CLOB CHARACTER SET ANY_CS,
   src_blob       IN             BLOB,
   amount         IN             INTEGER,
   dest_offset    IN OUT         INTEGER,
   src_offset     IN OUT         INTEGER, 
   blob_csid      IN             NUMBER,
   lang_context   IN OUT         INTEGER,
   warning        OUT            INTEGER);

The blob_csid parameter lets you specify the numeric ID of the character set used for the text data stored in the BLOB value. The contents of the BLOB are then converted from the specified character set to the database character set.

See Also:

Oracle AI Database SecureFiles and Large Objects Developer's Guide for more information about using DBMS_LOB

Example 12-3 Loading a File into a CLOB Column

This example demonstrates the procedure load_product_description, which accepts a product ID and the name of a file located in the database directory description_dir, and loads the file into the product row in the products table. The character set of the file is IBM1140.

CREATE TABLE products
( product_id NUMBER
, product_name VARCHAR2(100 CHAR)
, product_description CLOB
)
/
CREATE OR REPLACE DIRECTORY description_dir
    AS '/scratch/import/products/descriptions'
/
CREATE OR REPLACE PROCEDURE load_product_description
( p_id NUMBER, p_description_file VARCHAR2 )
AS
  column_clob  CLOB;
  descr_file   BFILE := BFILENAME('DESCRIPTION_DIR',p_description_file);
  dst_off      INTEGER := 1;
  src_off      INTEGER := 1;
  lang_ctx     INTEGER := 0;
  warn         INTEGER := 0;
BEGIN
  UPDATE products
     SET product_description = EMPTY_CLOB()
   WHERE product_id = p_id
   RETURNING product_description INTO column_clob;

  DBMS_LOB.FILEOPEN(descr_file);
  DBMS_LOB.LOADCLOBFROMFILE (
     dest_lob     => column_clob, 
     src_bfile    => descr_file, 
     amount       => DBMS_LOB.LOBMAXSIZE, -- whole file
     dest_offset  => dst_off,
     src_offset   => src_off,
     bfile_csid   => NLS_CHARSET_ID('IBM1140'),
     lang_context => lang_ctx, -- unused in this example
     warning      => warn);
  DBMS_LOB.FILECLOSE(descr_file);
  COMMIT;
EXCEPTION
  WHEN NO_DATA_FOUND THEN
    NULL;
  WHEN OTHERS THEN
    IF DBMS_LOB.FILEISOPEN(descr_file) = 1 THEN
      DBMS_LOB.FILECLOSE(descr_file);
    END IF;
    RAISE;
END;
/

12.3.5 Character Set Conversion Functions

The Oracle-supplied PL/SQL packages UTL_I18N and UTL_RAW provide functions that allow a value of the data type RAW to be interpreted as a string in a specified character set and then:

converted to a string in another specified character set and returned as a RAW value, or
converted to a VARCHAR2 value in the database character set.

These functions can be used with the client-only EBCDIC character sets listed in Table A-10 to implement custom character set conversion processing as needed by applications. For example, EBCDIC files can be loaded as binary (RAW) values and then processed and transformed as necessary into character values in the database character set.

UTL_RAW.CONVERT(
   r            IN RAW,
   to_charset   IN VARCHAR2,
   from_charset IN VARCHAR2) 
  RETURN RAW;

UTL_I18N.RAW_TO_CHAR(
   data          IN RAW,
   src_charset   IN VARCHAR2 DEFAULT NULL)
 RETURN VARCHAR2;

The following function can convert a VARCHAR2 value from the database character set to a specified target character set and return it as a RAW value. For example, the returned value can be written to a file using the Oracle-supplied procedure UTL_FILE.PUT_RAW.

UTL_I18N.STRING_TO_RAW( 
   data          IN VARCHAR2 CHARACTER SET ANY_CS,
   dst_charset   IN VARCHAR2 DEFAULT NULL)
RETURN RAW;

The parameters to_charset, from_charset, src_charset, and dst_charset are used in character set conversion functions to specify the names of the source or target character sets, respectively.

See Also:

Oracle AI Database PL/SQL Packages and Types Reference for more information about the UTL_I18N and UTL_RAW packages

12.3.6 Undefined or Duplicated Codes in a Source Character Set

There are two potential issues with source character encoding that must be addressed before migrating data to a target database.

Undefined Codes

The character set (EBCDIC code page) used as the encoding for source data may include undefined codes—bytes that do not correspond to any character. For example, EBCDIC code page 424 (used for Hebrew) does not assign characters to thirty-eight byte values (0x70, 0x72–0x73, 0x75–0x77, 0x80, 0x8C–0x8E, 0x9A–0x9C, 0x9E, 0xAA–0xAE, 0xCB–0xCF, 0xDB–0xDF, 0xEB–0xEF, and 0xFB–0xFE). Applications may still store data containing these bytes, for example, by inappropriately storing binary data in character fields. Character values containing such unassigned bytes cannot be converted to another character set using this code page as the source.

The solution depends on the migration strategy:

Data-only migration: A customized source character set must be defined to allow round-trip conversion of all byte values between the source and target character sets. Typically, the third-party translation layer between the unmodified application and the Oracle Database Client is the best place to apply such a customized character set. With this approach, you must first use a third-party utility to convert EBCDIC files to the target ASCII-based character set using the customized character set. The resulting files can then be loaded into Oracle using standard utilities, without further character set conversion.
Application rewriting: If applications are rewritten as part of migration, any binary data currently stored in character fields should be moved to binary data types (such as RAW or BLOB) and processed accordingly. A customized source character set may still be required if the unassigned bytes are used for user-specific characters rather than binary data.

Duplicated Codes

A second problem occurs when the EBCDIC source character set maps more than one code to the same Unicode character. For example, EBCDIC code page 918 (used for Arabic) maps both 0x47 and 0x48 to Unicode character U+FE8E. In a conversion cycle to any target database character set and back, only one of these codes will be preserved; the distinction between them is lost. While this may be insignificant for standard character data, it is a critical issue for binary data incorrectly stored in character fields. For applications running unchanged in EBCDIC emulation mode, bytes of binary data must remain unaltered in a round-trip conversion to and from the target database.

The solutions to this duplicated codes issue are essentially the same as those described for undefined codes above.

12.4 Emulating EBCDIC Binary Ordering

This section describes how to use emulated binary collations to ensure that applications observe the same ordering of character values in both the target and source databases.

This section covers the following topics:

12.4.1 Dependency on Binary Ordering

Applications on mainframe systems may use SQL queries with predicates that rely on specific binary ordering of character values—ordering that depends on the underlying character set. For example, the following query is intended to select products with numeric product codes, filtering out those with alphanumeric codes (assuming alphanumeric codes start with a letter):

SELECT product_id, product_name
  FROM products
WHERE product_code >= '0'

This query takes advantage of the fact that, in EBCDIC-based character sets, all basic Latin letters have character codes numerically lower than digits, and thus appear before digits in binary order. As a result, the query returns only products with numeric codes. However, in ASCII-based character sets, letters have character codes higher than those of digits, so the same query would return products with both numeric and alphanumeric codes.

12.4.2 Emulated Binary Collations

To ensure that comparison predicates return the same results in the target Oracle Database as in the source database, you can configure these predicates to use an emulated binary collation corresponding to the source EBCDIC character set.

Emulated binary collations were introduced in Oracle AI Database 26ai, Release Update 23.26.2. For more information, see "Using Emulated Binary Collations". Table A-20 lists all supported emulated binary collations.

Collation keys generated by emulated binary collations do not exceed the length of the original character values. As a result, linguistic indexes based on these collations have sizes comparable to standard binary indexes. Linguistic indexes—functional indexes created on the expression NLSSORT(column, 'NLS_SORT=collation_name')—may be required to support efficient execution plans for predicates that use emulated binary collations. Standard binary indexes cannot be used for this purpose.

An emulated binary collation should be used for data migrated from its base character set. If characters not supported by this character set are added to the database, they should be processed with modernized applications that apply linguistic collations appropriate for user expectations.

12.4.3 Deploying Emulated Binary Collations

There are two primary methods to ensure that query predicates use a given emulated binary collation.

Legacy Session Parameters

Since Oracle Database 10g Release 2 (10.2), you can set the NLS_COMP session parameter to LINGUISTIC and the NLS_SORT session parameter to the name of the desired collation (for example, BIN_IBM37). However, this approach applies the selected collation to all character processing in every query and DML statement within the session. Emulated binary collations require more resources than the standard BINARY collation. For data where EBCDIC ordering is not required (such as data compared only for equality), the standard BINARY collation should be used.

An example of this method is shown in Example 5-16.

Column-Level (Data-Bound) Collations

Available since Oracle Database 12c Release 2 (12.2), column-level collation declarations allow you to apply a specific collation only to character operations on selected columns. This feature, called data-bound collation, enables more granular control. Refer to "Column-Level Collation and Case Sensitivity" for prerequisites and restrictions regarding data-bound collations.

Example 12-4 demonstrates the use of column-level collation declarations with queries from Example 5-16.

Example 12-4 Declaring Column Collation BIN_IBM37

This example illustrates the difference between binary ordering on ASCII- and EBCDIC-based platforms using the emulated binary collation for the EBCDIC code page IBM37. It demonstrates how to control collation using column-level data-bound collation declarations. Confirmation messages from SQL*Plus are omitted for brevity.

SQL> ALTER SESSION SET NLS_COMP = LINGUISTIC;
SQL> ALTER SESSION SET NLS_SORT = FRENCH;
SQL> CREATE TABLE emp4
  2  (id NUMBER, ename VARCHAR2(100) COLLATE BINARY);
SQL> INSERT INTO emp4 VALUES (1, 'Clara');
SQL> INSERT INTO emp4 VALUES (2, 'Eva');
SQL> INSERT INTO emp4 VALUES (3, 'Cindy');
SQL> INSERT INTO emp4 VALUES (4, 'de Vil');
SQL> INSERT INTO emp4 VALUES (5, 'Barbara');
SQL> REM Data sorted using binary sort order, not FRENCH
SQL> SELECT ename FROM emp4 WHERE ename > '0' ORDER BY ename;

ENAME
--------------
Barbara
Cindy
Clara
Eva
de Vil

SQL> DROP TABLE emp4 PURGE;
SQL> CREATE TABLE emp4
  2  (id NUMBER, ename VARCHAR2(100) COLLATE BIN_IBM37);
SQL> INSERT INTO emp4 VALUES (1, 'Clara');
SQL> INSERT INTO emp4 VALUES (2, 'Eva');
SQL> INSERT INTO emp4 VALUES (3, 'Cindy');
SQL> INSERT INTO emp4 VALUES (4, 'de Vil');
SQL> INSERT INTO emp4 VALUES (5, 'Barbara');
SQL> REM Data sorted using BIN_IBM37 sort order, not FRENCH
SQL> SELECT ename FROM emp4 WHERE ename > '0' ORDER BY ename;

no rows selected

SQL> SELECT ename FROM emp4 WHERE ename > 'a' ORDER BY ename;

ENAME
--------------
de Vil
Barbara
Cindy
Clara
Eva