I Globalization Support in the Directory

Oracle Internet Directory uses Globalization Support to store, process and retrieve data in native languages. It ensures that Oracle Internet Directory utilities and error messages automatically adapt to the native language and locale.

This chapter discusses Globalization Support as used by Oracle Internet Directory and tells you the required NLS_LANG environment variables for the various components and tools in an Oracle Internet Directory environment.

See Also:

Section 3.8, "Globalization Support" before configuring Globalization Support.

This chapter contains these topics:

Section I.1, "About Character Sets and the Directory"
Section I.2, "The NLS_LANG Environment Variable"
Section I.3, "Using Non-AL32UTF8 Databases"
Section I.4, "Using Globalization Support with LDIF Files"
Section I.5, "Using Globalization Support with Command-Line LDAP Tools"
Section I.6, "Setting NLS_LANG in the Client Environment"
Section I.7, "Using Globalization Support with Bulk Tools"

I.1 About Character Sets and the Directory

When computer systems process characters, they use numeric codes instead of the graphical representation of the character. For example, when the database stores the letter A, it actually stores a numeric code that is interpreted by software as the letter.

A group of characters (for example, alphabetic characters, ideographs, symbols, punctuation marks, and control characters) can be encoded as a character set. Each encoded character set assigns a unique code to each character in the set. For example, in the ASCII encoding scheme, the character code of the first character of the English upper-case alphabet is Ox4; in the EBCDIC encoding scheme, it is Oxc1.

The computer industry uses many encoded character sets. These character sets can differ in the number and types of characters available and in many other ways as well.

When you create a database, you specify an encoded character set. Choosing a character set determines, among other things, what languages can be represented in the database.

Oracle supports most national, international, and vendor-specific encoded character set standards.

This section contains the following topics:

Section I.1.1, "About Unicode"
Section I.1.2, "About Oracle and UTF-8"
Section I.1.3, "Migration from UTF8 to AL32UTF8 when Upgrading Oracle Internet Directory"

I.1.1 About Unicode

No single character set contains enough characters to meet the requirements of day-to-day e-business requirements. For example, no one national character set can represent all the languages in the European Union. Moreover, there are potential conflicts between character sets because the same character can be represented by different codes in different character sets.

To overcome these obstacles, a global character set, called Unicode, was developed. It is a universal encoded character set that can store information from any language including punctuation marks, diacritics, mathematical symbols, technical symbols, musical symbols, and so forth. As of version 3.2, the Unicode Standard supports over 95,000 characters from the world's alphabets, ideograph sets, and symbol collections. It includes 45,000 supplementary characters, most of which are Chinese, Japanese, and Korean characters that are rarely used but nevertheless need representation in electronic documentation.

Unicode has more than one implementation standard, and these are described in Table I-1.

Table I-1 Unicode Implementations

Implementation	Description
UTF-8	A variable-width 8-bit encoding of Unicode. One Unicode character can be one, two, three, or four bytes. Characters from European scripts are represented in one or two bytes. Those from Asian scripts are represented in three bytes, and supplementary characters are represented in four.
UCS-2	A fixed-width 16-bit encoding of Unicode in which each character, regardless of the script, is two bytes.
UTF-16	The 16-bit encoding of Unicode. It is an extension of UCS-2 that supports the supplementary characters added in Unicode 3.1. One character can be two or four bytes. Characters from European and Asian scripts are represented in two bytes, and supplementary characters are represented in four.

Implementation

Description

UTF-8

A variable-width 8-bit encoding of Unicode. One Unicode character can be one, two, three, or four bytes. Characters from European scripts are represented in one or two bytes. Those from Asian scripts are represented in three bytes, and supplementary characters are represented in four.

UCS-2

A fixed-width 16-bit encoding of Unicode in which each character, regardless of the script, is two bytes.

UTF-16

The 16-bit encoding of Unicode. It is an extension of UCS-2 that supports the supplementary characters added in Unicode 3.1.

One character can be two or four bytes. Characters from European and Asian scripts are represented in two bytes, and supplementary characters are represented in four.

I.1.2 About Oracle and UTF-8

Oracle began supporting Unicode as a database character set beginning with Oracle database version 7. With Oracle9i, Oracle added a new UTF-8 character set called AL32UTF8. This database character set supports the latest version of Unicode (3.2), including the latest supplementary characters. Oracle intends to enhance AL32UTF8 as necessary to support future versions of the Unicode standard.

I.1.3 Migration from UTF8 to AL32UTF8 when Upgrading Oracle Internet Directory

Oracle Internet Directorynow supports AL32UTF8. If you have upgraded Oracle Internet Directory from a version before 10g (10.1.4.0.1), then, for better performance, Oracle recommends that you change the character set for the directory database from UTF8 to AL32UTF8. To do this:

Run the character set scanner (CSSCAN) to ensure that there are no invalid UTF8 characters inside your current database.
Run the CSALTER script to update the database to AL32UTF8.

See Also:

The chapter on character set migration in the Oracle Database Globalization Support Guide in the Oracle Database Documentation Library

I.2 The NLS_LANG Environment Variable

The NLS_LANG parameter has three components—language, territory, and charset—in the form:

NLS_LANG = language_territory.charset

Each component controls the operation of a subset of Globalization Support features.

Components of the NLS_LANG parameter are shown in Table I-2.

Table I-2 Components of the NLS_LANG Parameter

Component Description

Component	Description
`language`	Specifies conventions such as the language used for Oracle messages, day names, and month names. Each supported language has a unique name—for example, American English, French, or German. If language is not specified, the value defaults to American English. See Also: Oracle Database Globalization Support Guide in the Oracle Database Documentation Library for a complete list of languages
`territory`	Specifies conventions such as the default calendar, collation, date, monetary, and numeric formats. Each supported territory has a unique name; for example, America, France, or Canada. If territory is not specified, the value defaults to America. See Also: Oracle Database Globalization Support Guide in the Oracle Database Documentation Library for a complete list of terrotories
`charset`	Specifies the character set used by the client application (normally that of the user's terminal). Each supported character set has a unique acronym, for example, WE8MSWIN1252, JA16SJIS, or AL32UTF8. See Also: Oracle Database Globalization Support Guide in the Oracle Database Documentation Library for a complete list of character sets

language

Specifies conventions such as the language used for Oracle messages, day names, and month names. Each supported language has a unique name—for example, American English, French, or German.

If language is not specified, the value defaults to American English.

See Also: Oracle Database Globalization Support Guide in the Oracle Database Documentation Library for a complete list of languages

territory

Specifies conventions such as the default calendar, collation, date, monetary, and numeric formats. Each supported territory has a unique name; for example, America, France, or Canada.

If territory is not specified, the value defaults to America.

See Also: Oracle Database Globalization Support Guide in the Oracle Database Documentation Library for a complete list of terrotories

charset

Specifies the character set used by the client application (normally that of the user's terminal). Each supported character set has a unique acronym, for example, WE8MSWIN1252, JA16SJIS, or AL32UTF8.

See Also: Oracle Database Globalization Support Guide in the Oracle Database Documentation Library for a complete list of character sets

You can set NLS_LANG as an environment variable at the command line. The following are examples of legal values for NLS_LANG:

AMERICAN_AMERICA.AL32UTF8
JAPANESE_JAPAN.AL32UTF8

I.3 Using Non-AL32UTF8 Databases

You can run the Oracle directory server and database tools on a non-AL32UTF8 database, but be sure that all characters in the client character set are included in the database character set (with the same or different codes). Otherwise, you can lose data during ldapadd, ldapdelete, ldapmodify, or ldapmodifydn operations. For example, suppose that you perform an ldapadd operation using a multibyte character set on an underlying database that uses only single-byte characters. You will lose data because not all of the bytes you enter are accepted by the database.

I.4 Using Globalization Support with LDIF Files

See Also:

"LDIF file formatting rules and examples" in Oracle Fusion Middleware Reference for Oracle Identity Management

Attribute type names are always ASCII strings that cannot contain multibyte characters. Oracle Internet Directory does not support multibyte characters in attribute type names. However, Oracle Internet Directory does support attribute values containing multibyte characters such as those in the simplified Chinese (ZHS16GBK) character set.

Attribute values can be encoded in different ways to allow Oracle Internet Directory tools to interpret them properly. There are two scenarios, described the following sections:

Section I.4.1, "An LDIF file Containing Only ASCII Strings"
Section I.4.2, "An LDIF file Containing UTF-8 Encoded Strings"

I.4.1 An LDIF file Containing Only ASCII Strings

In this scenario, character strings for attribute values are also in ASCII.

Because all tools use the UTF-8 character set by default, and ASCII is a proper subset of UTF-8, all tools can interpret these files. The same is true of keyboard input of values that are simply ASCII strings.

I.4.2 An LDIF file Containing UTF-8 Encoded Strings

In this scenario, character strings for attribute values are also in UTF-8.

Because, by default, all tools use the UTF-8 character set, all tools can interpret these files. The same is true of keyboard input of values that are UTF-8 strings.

In such a file, some characters may be multibyte. Multibyte characters strings can be present in the LDIF files as attribute values or given as keyboard input. They can be encoded in their native character set or in UTF-8. They can also be BASE64 encoded representations of either the native or the UTF-8 string.

Consider the following cases:

CASE 1: Native Strings (Non-UTF-8)
CASE 2: UTF-8 Strings
CASE 3: BASE64 Encoded UTF-8 Strings
CASE 4: BASE64 Encoded Native Strings

Because the directory server understands and expects only UTF-8 encoded strings, cases 1, 3, and 4 need to undergo conversion to UTF-8 strings before they can be sent to the LDAP server.

I.4.2.1 CASE 1: Native Strings (Non-UTF-8)

If NLS_LANG is not set, use the -E character_set argument with the command-line LDAP tools ldapadd, ldapaddmt, ldapbind, ldapcompare, ldapmoddn, ldapmod, ldapdelete, and ldapsearch. You do not need to use the -E character_set argument if NLS_LANG is set.

This example converts simplified Chinese native strings to UTF-8. The baseDN can be a simplified Chinese string:

ldapsearch -h my_host -D "cn=orcladmin" -q -p 3060 -E ".ZHS16GBK" -b base_DN \
   -s base  "objectclass=*"

Use the encode="character_set" argument with the command-line bulk tools bulkload, bulkmodify, bulkdelete, and ldifwrite, where character_set is the character set used in the LDIF file. Set NLS_LANG to the character set used by the directory's database.

I.4.2.2 CASE 2: UTF-8 Strings

No conversion is required.

I.4.2.3 CASE 3: BASE64 Encoded UTF-8 Strings

You do not need to use the -E character_set argument with the LDAP tools, even if NLS_LANG is not set.

You do not need to use the encode=character_set argument with the command-line tools. Oracle Internet Directory tools automatically decode BASE64 encoded UTF-8 strings to UTF-8 strings.

I.4.2.4 CASE 4: BASE64 Encoded Native Strings

Oracle Internet Directory tools automatically decode BASE64 encoded native strings to simple native strings. The native strings are then converted to the equivalent UTF-8 strings.

Note:

In any given input file, only one character set may be used.

I.5 Using Globalization Support with Command-Line LDAP Tools

The Oracle Internet Directory command-line tools read keyboard input or LDIF file input in the following ways:

ASCII characters only
Non-ASCII input (native language character set)
BASE64 encoded values of UTF-8 or native strings (from LDIF file only)

If the character set being given as input from an LDIF file or keyboard is not UTF-8, then the command-line tools need to convert the input into UTF-8 format before sending it to the LDAP server.

You enable the command-line tools to convert the input into UTF-8 by specifying the -E character_set argument with any of the command-line LDAP tools. Use the encode="character_set" argument with bulkload, bulkmodify, bulkdelete, and ldifwrite.

This section contains these topics:

Section I.5.1, "Specifying the -E Argument When Using Each Tool"
Section I.5.2, "Examples: Using the -E Argument with Command-Line LDAP Tools"

I.5.1 Specifying the -E Argument When Using Each Tool

The client tools always assume UTF-8 (the Oracle character set name is AL32UTF8) to be the character set unless otherwise specified by the -E argument. The BASE64-encoded values are decoded, and then the decoded buffer is converted to UTF-8 if the -E argument is specified. For example, if you specify -E ".ZHS16GBK", then the decoded buffer is converted from simplified Chinese GBK to Unicode UTF-8 before being sent to the directory server.

Specifying the -E argument ensures that proper character set conversion can occur from the character set you specify for the -E argument (-E ".character_set") to the AL32UTF8 character set.

The command-line tools use the -E argument to process the input in the character set specified for the -E argument. They display their output in the character set specified in the NLS_LANG environment variable.

For example, to add entries from an LDIF file encoded in the simplified Chinese character set (ZHS16GBK) by using ldapadd, type:

ldapadd -h myhost -p 3060 -E ".ZHS16GBK" -D cn=orcladmin -q -f my_ldif_file

In this example, the ldapadd tool converts the characters from ".ZHS16GBK" (simplified Chinese character set) to ".AL32UTF8" before they are sent across the wire to the directory server.

I.5.2 Examples: Using the -E Argument with Command-Line LDAP Tools

Table I-3 provides additional examples of how to use the -E argument correctly for each command-line tool. In each example, the command converts data from simplified Chinese, as specified by the value ".ZHS16GBK", to AL32UTF8. For example, in each command, the value for the -D option and the password typed at the prompt when -q is specified are in GBK. Specifying the -E argument converts them to UTF-8.

Note that, in the examples in Table I-3, we do not show any actual characters belonging to the .ZHS16GBK character set. These examples would, therefore, work without the -E argument. However, if the argument values contained actual characters in the .ZHS16GBK character set, then we would need to use the -E argument.

See Also:

"Oracle Internet Directory Administration Tools" in Oracle Fusion Middleware Reference for Oracle Identity Management for syntax and usage notes for each of the command-line tools

Table I-3 Examples: Using the -E Argument with Command-Line Tools

Tool	Example
`ldapbind`	ldapbind -h my_host -p 3060 -E ".ZHS16GBK" -D "o=example,c=us" -q
`ldapsearch`	ldapsearch -h my_host -p 3060 -E ".ZHS16GBK" -D "o=example,c=us" -q
`ldapadd`	ldapadd -h my_host -p 3060 -E ".ZHS16GBK" -D "o=example,c=us" -q
`ldapaddmt`	ldapaddmt -h my_host -p 3060 -E ".ZHS16GBK" -D "o=example,c=us" -q
`ldapmodify`	ldapmodify -h my_host -p 3060 -E ".ZHS16GBK" -D "o=example,c=us" -q
`ldapmodifymt`	ldapmodifymt -h my_host -p 3060 -E ".ZHS16GBK" -D "o=example,c=us" -q
`ldapdelete`	ldapdelete -h my_host -p 3060 -E ".ZHS16GBK" -D "o=example,c=us" -q
`ldapcompare`	ldapcompare -h my_host -p 3060 -E ".ZHS16GBK" -D "o=example,c=us" \ -b "ou=Construction,ou=Manufacturing,o=example,c=us" \ -a title -v manager -q
`ldapmoddn`	ldapmoddn -h my_host -p 3060 -E ".ZHS16GBK" -D "o=example,c=us" \ -b "cn=Frank Smith,ou=Construction,ou=Manufacturing,c=us, o=example"\ -N "ou=Contracting,ou=Manufacturing,o=example,c=us" -r -q

I.6 Setting NLS_LANG in the Client Environment

If the output required by the client is UTF-8, then you do not need to set the NLS_LANG environment variable. In this case, the character set component of the NLS_LANG environment variable defaults to AL32UTF8, and both the input path from client to server, and the output path from server to client, do not require any character set conversion.

If the output required by the client is not UTF-8, then you must set the NLS_LANG environment variable. This ensures that proper character set conversion can occur from the AL32UTF8 character set to the character set required by the client.

For example, if the NLS_LANG environment variable is set to the simplified Chinese character set, then the command-line tool displays output in that character set. Otherwise the output defaults to the AL32UTF8 character set.

Note:

If you are using Microsoft Windows, then, to use the command-line tools after server startup, you must reset NLS_LANG in an MS-DOS window. Set it to the character set that matches the code page of your MS-DOS session. AL32UTF8 cannot be used. See the Oracle Database Installation Guide for Microsoft Windows for more information on which character set to use for command-line tools in an MS-DOS session.

If you are using a pre-installed Oracle Database with Oracle Internet Directory, then you must also set the database character set to AL32UTF8.

See Also: Oracle Database Globalization Support Guide in the Oracle Database Documentation Library and Oracle Database Installation Guide for Microsoft Windows

Be careful not to change the NLS_LANG parameter value in the registry.

I.7 Using Globalization Support with Bulk Tools

Oracle Internet Directory ensures that the reading and writing of text data from and to LDIF files are done in UTF-8 encoding as specified by the LDAP standard.

This section provides examples of the argument you use for each of the bulk tools. It contains the following sections:

Section I.7.1, "Using Globalization Support with bulkload"
Section I.7.2, "Using Globalization Support with ldifwrite"
Section I.7.3, "Using Globalization Support with bulkdelete"
Section I.7.4, "Using Globalization Support with bulkmodify"

See Also:

"Oracle Internet Directory Administration Tools" in Oracle Fusion Middleware Reference for Oracle Identity Management for a list of arguments for each bulk tool

I.7.1 Using Globalization Support with bulkload

Add to the command the argument encode="character_set" where the input LDIF file is encoded in "character_set".

For example, ensure that ORACLE_INSTANCE is set, then type:

bulkload connect="connect_string" \ 
   encode=".ZHS16GBK" check="TRUE" generate="TRUE" file="my_ldif_file"

I.7.2 Using Globalization Support with ldifwrite

The ldifwrite utility always writes BASE64 encoded values for multibyte strings.

The BASE64 encoding could be of the UTF-8 strings as they are stored in the directory server, or of native strings as specified by the NLS_LANG environment variable setting when running ldifwrite.

For example:

ldifwrite connect="connect_string" baseDN="baseDN" ldiffile="output_file"

In this example, if the NLS_LANG environment variable is not set, or is set to language_territory.AL32UTF8, then the output LDIF file will contain BASE64-encoded UTF-8 strings for any multibyte characters.

For information about loading this LDIF file into a directory, see "CASE 3: BASE64 Encoded UTF-8 Strings".

If the NLS_LANG environment variable is set to a character set other than AL32UTF8—for example, ".ZHS16GBK"—then the output LDIF file will contain a BASE64 encoded value of simplified Chinese GBK strings.

For information about loading this LDIF file into the directory, see "CASE 4: BASE64 Encoded Native Strings".

I.7.3 Using Globalization Support with bulkdelete

Add encode="character_set" to the command.

For example:

bulkdelete connect="connect_string"  encode=".ZHS16GBK" \
    baseDN="ou=manufacturing,o=example,c=us"

In this case the value for the -base option could be in the ZHS16GBK native character set, that is, simplified Chinese.

I.7.4 Using Globalization Support with bulkmodify

Add encode="character_set" to the command the argument.

For example:

bulkmodify connect="my_service_name" \  
            encode=".ZHS16GBK" baseDN="ou=manufacturing,o=example,c=us" \
            replace=TRUE" value=Foreman filter="objectclass=*"

In this example, values for the basedn, value, and filter arguments can be specified using the simplified Chinese GBK character set.