5
Supporting Multilingual Databases with Unicode

This chapter illustrates how to use Unicode in an Oracle database environment. It includes the following topics:

Overview of Unicode

Dealing with many different languages in the same application or database has been complicated and difficult for a long time. To overcome the limitations of existing character encodings, several organizations began working on the creation of a global character set in the late 1980s. The need for this became even greater with the development of the World Wide Web in the mid-1990s. The Internet has changed how we do business today, with an emphasis on the global market that has made a universal character set a major requirement. This global character set needs to contain all major living scripts, support legacy data and implementations, and be simple enough that a single implementation of a product is sufficient for worldwide use. This character set should also support multilingual users and organizations, conform to international standards, and enable world-wide interchange of data. This global character set exists, is in wide use, and is called Unicode.

What is Unicode?

Unicode is a universal encoded character set that allows you to store information from any language using a single character set. Unicode provides a unique code value for every character, regardless of the platform, program, or language.

The Unicode standard has been adopted by many software and hardware vendors. Many operating systems and browsers now support Unicode. Unicode is required by modern standards such as XML, Java, JavaScript, LDAP, CORBA 3.0, and WML. It is also synchronized with the ISO/IEC 10646 standard.

Oracle started supporting Unicode as a database character set in Oracle7. In Oracle9i, Unicode support has been expanded so that you can find the right solution for your globalization needs. Oracle9i supports Unicode 3.0, the third version of the Unicode standard.

See Also:

http://www.unicode.org for more information about the Unicode standard

Unicode Encoding

There are two different ways to encode Unicode 3.0 characters:

UTF-16 Encoding

UTF-16 encoding is the 16-bit encoding of Unicode in which the character codes 0x0000 through 0x007F contain the ASCII characters. One Unicode character can be 2 bytes or 4 bytes in this encoding. Characters from both European and most Asian scripts are represented in 2 bytes. Surrogate pairs (described below) are represented in 4 bytes.

UTF-8 Encoding

UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding in which the character codes 0x00 through 0x7F have the same meaning as ASCII. One Unicode character can be 1 byte, 2 bytes, 3 bytes or 4 bytes in this encoding. Characters from the European scripts are represented in either 1 or 2 bytes, while characters from most Asian scripts are represented in 3 bytes. Surrogate pairs (described below) are represented in 4 bytes.

Note:

Converting from UTF-8 to UTF-16 is a simple bit-wise operation that is defined in the Unicode standard

Surrogate Pairs

You can extend Unicode to encode more than 1 million characters. These extended characters are called surrogate pairs. Surrogate pairs are designed to allow representation of characters in Unicode 3.0 and future extensions of the Unicode standard. Surrogate pairs require 4 bytes in UTF-8 and UTF-16 encoding.

Sample UTF-16 and UTF-8 Encodings

Figure 5-1 shows some characters and their character codes in UTF-16 and UTF-8 encoding. The last character is a treble clef (a music symbol), a surrogate pair that has been added to the Unicode 3.0 standard.

Figure 5-1 Sample UTF-16 and UTF-8 Encodings

Text description of nls81031.gif follows

Text description of the illustration nls81031.gif

Implementing a Unicode Solution in the Database

You can store Unicode characters in an Oracle9i database in two ways. The following sections explain how to use the two Unicode solutions and how to choose between them:

You can create a Unicode database that allows you to store UTF-8 encoded characters as SQL CHAR datatypes (CHAR, VARCHAR2, CLOB, and LONG). If you prefer to implement Unicode support incrementally, you can store Unicode data in either the UTF-16 or UTF-8 encoding form in SQL NCHAR datatypes (NCHAR, NVARCHAR2, and NCLOB). The SQL NCHAR datatypes are called Unicode datatypes because they are used for storing Unicode data only.

Note:

You can combine a Unicode database solution with a Unicode datatype solution.

Enabling Multilingual Support with Unicode Databases

The Oracle9i database has the concept of a database character set, which specifies the encoding to be used in the SQL CHAR datatypes as well as the metadata such as table names, column names, and SQL statements. A Unicode database is a database with UTF-8 as the database character set. There are three Oracle character sets that implement the UTF-8 encoding. The first two are designed for ASCII-based platforms while the third one should be used on EBCDIC platforms.

AL32UTF8

The AL32UTF8 character set encodes characters in one to three bytes. Surrogate pairs require four bytes. It is for ASCII-based platforms.
UTF8

The UTF8 character set encodes characters in one to three bytes. Surrogate pairs require six bytes. It is for ASCII-based platforms.
UTFE

The UTFE character set should be used as the database character set on EBCDIC platforms to support the UTF-8 encoding.

Note:
The UTF8 character set is used throughout this chapter, but all technical details apply to the other two character sets unless otherwise specified

Example 5-1 Creating a Database with a Unicode Character Set

To create a database with the UTF8 character set, use the CREATE DATABASE statement. For example:

CREATE DATABASE myunicodedatabase
    CONTROL FILE REUSE 
    LOGFILE '/u01/oracle/utfdb/redo01.log' SIZE 1M REUSE
            '/u01/oracle/utfdb/redo02.log' SIZE 1M REUSE
    DATAFILE '/u01/oracle/utfdb/system01.dbf' SIZE 10M REUSE
        AUTOEXTENT ON
        NEXT 10M MAXSIZE 200M
    CHARACTER SET UTF8

Note:

The database character set needs to be specified when you create the database

Enabling Multilingual Support with Unicode Datatypes

An alternative to storing Unicode data in the database is to use the SQL NCHAR datatypes. You can store Unicode characters into columns of these datatypes regardless of how the database character set has been defined. The NCHAR datatype has been redefined in Oracle9i to be a Unicode datatype exclusively. In other words, it stores data in the Unicode encoding only. You can use the SQL NCHAR datatypes in the same way you use the SQL CHAR datatypes. You can create a table using the NVARCHAR2 and NCHAR datatypes as follows:

CREATE TABLE product 
 (id          NUMBER
  ename       NCHAR(100)
  description NVARCHAR2(1000))

The encoding used in the SQL NCHAR datatypes is specified as the national character set of the database. You can specify one of the following two Oracle character sets as the national character set:

AL16UTF16

This is the default character set for SQL NCHAR datatypes. The character set encodes Unicode data in the UTF-16 encoding.
UTF8

When UTF8 is specified for SQL NCHAR datatypes, the data stored in the SQL datatypes is in UTF-8 encoding.

By default, data is stored in the UTF-16 encoding in the SQL NCHAR datatypes, and the length specified in the NCHAR and NVARCHAR2 columns is always in the number of characters instead of the number of bytes.

You can specify the national character set for the SQL NCHAR datatypes when you create a database using the CREATE DATABASE statement. The following command creates a database with WE8ISO8859P1 as the database character set and AL16UTF16 as the national character set.

Example 5-2 Creating a Database with a National Character Set

CREATE DATABASE myunicodedatabase
  CONTROL FILE REUSE 
  LOGFILE '/u01/oracle/utfdb/redo01.log' SIZE 1M REUSE
           '/u01/oracle/utfdb/redo02.log' SIZE 1M REUSE
  DATAFILE '/u01/oracle/utfdb/system01.dbf' SIZE 10M REUSE
    AUTOEXTENT ON
    NEXT 10M MAXSIZE 200M
  CHARACTER SET WE8ISO8859P1
  NATIONAL CHARACTER SET AL16UTF16

This example also illustrates the fact that you can store Unicode data in a non-Unicode database by using SQL NCHAR datatypes.

See Also:

"Migrating to Use the NCHAR Datatypes" for more information about migrating data to the NCHAR datatype

How to Choose Between a Unicode Database and a Unicode Datatype Solution

In order to choose the right Unicode solution for your database, you need to consider the following:

Programming environment

What are the main programming languages used in your applications? How do they support Unicode?
Ease of migration

How easily can your data and applications be migrated to take advantage of the Unicode solution?
Performance

How much performance overhead are you willing to accept in order to use Unicode in the database?
Type of data

Is your data mostly Asian or European? Do you need to store multilingual documents into LOB columns?
Type of applications

What type of applications are you implementing: a packaged application or a customized end-user application?

This section describes some general guidelines for choosing a Unicode database or a Unicode datatype solution. The final decision largely depends on your exact environment and requirements.

When Should You Use a Unicode Database?

You should use a Unicode database when:

You need easy code migration for Java or PL/SQL.

If your existing application is mainly written in Java and PL/SQL and your main concern is to minimize the code change required to support multiple languages, you may want to use a Unicode database solution. As long as the datatypes used to stored your data remain as the SQL CHAR datatypes, the Java and PL/SQL accessing these columns do not need to change.
You need easy data migration from ASCII.

If the database character set of your existing database is US7ASCII, you may want to choose a Unicode database solution because you can migrate your database using a single ALTER DATABASE statement. No data conversion is required because ASCII is a subset of UTF-8.
You have evenly distributed multilingual data.

If the multilingual data will be evenly distributed in your existing schema tables and you are not sure which ones will contain multilingual data, then you should use the Unicode database solution because it does not require you to identify which columns store what data.
Your SQL statements and PL/SQL code contain Unicode data.

If you need to write SQL statements and PL/SQL code in Unicode, you must use the Unicode database solution. SQL statements and PL/SQL code are converted into the database character set before being processed. If your SQL statements and PL/SQL code contain characters that cannot be converted to your database character set, those characters will be lost. A common place where you would use Unicode data in a SQL statement is in a string literal.
You want to store multilingual documents as BLOBs and use Oracle Text for content searching.

You must use a Unicode database in this case. The BLOB data is converted to the database character set before being indexed by Oracle Text. If your database character set is not UTF8, then data will be lost when the documents contain characters that cannot be converted to the database character set.

After you decide to use a Unicode database solution, you need to decide which UTF-8 character set is best for your database. See Table 5-1 for advantages and disadvantages of Unicode database solutions.

When Should You Use Unicode Datatypes?

You should use Unicode datatypes when:

You want to add multilingual support incrementally.

If you want to add Unicode support to your existing database without migrating the character set, you should consider using Unicode datatypes to store Unicode. You can add columns of the SQL NCHAR datatypes to existing tables or new tables to support multiple languages incrementally.
You want to build a packaged application.

If you are building a packaged application that will be sold to customers, you may want to build the application using SQL NCHAR datatypes. This is because the SQL NCHAR datatype is a reliable Unicode datatype in which the data is always stored in Unicode, and the length of the data is always specified in UTF-16 code units. As a result, you need only test the application once, and your application will run on customer databases with any database character set.
You want better performance with single-byte database character sets.

If performance is your biggest concern, you should consider using a single-byte database character set and storing Unicode data in the SQL NCHAR datatypes. Databases using a multibyte database character set such as UTF8 have a performance overhead.
You require UTF-16 support in Windows clients.

If your applications are written in Visual C/C++ or Visual Basic running on Windows, you may want to use the SQL NCHAR datatypes because you can store UTF-16 data in these datatypes in the same way that you store it in the wchar_t buffer in Visual C/C++ and string buffer in Visual Basic. You can avoid buffer overflow in your client applications because the length of the wchar_t and string datatypes match the length of the SQL NCHAR datatypes in the database.

Once you decide to use a Unicode datatype solution, you need to decide which character set to choose as your national character set. See Table 5-2 for advantages and disadvantages of Unicode datatype solutions.

Note:

You can use a Unicode database in conjunction with Unicode datatypes

Comparison of Unicode Solutions

Oracle9i provides two solutions to store Unicode characters in the database: a Unicode database solution and a Unicode datatype solution. After you select the Unicode database solution, the Unicode datatype solution or the combination of both, you need to determine the character set to be used in the Unicode database and/or the Unicode datatype. Different Unicode character sets have different advantages and disadvantages. Table 5-1 and Table 5-2 illustrate advantages and disadvantages for different Unicode solutions:

Table 5-1 Unicode Database Solutions

Database Character Set	Advantages	Disadvantages
AL32UTF8	Surrogate pair Unicode characters are stored in the standard 4 bytes representation, and there is no data conversion upon retrieval and insertion of those surrogate characters. Also, the storage for those characters requires less disk space than that of the same characters encoded in UTF8.	You cannot specify the length of SQL `CHAR` types in the number of characters (Unicode code points) for surrogate characters. For example, surrogate characters are treated as one code point rather than the standard of two code points. The binary order for SQL `CHAR` columns is different from that of SQL `NCHAR` columns when the data consists of surrogate pair Unicode characters. As a result, `CHAR` columns `NCHAR` columns do not always have the same sort for identical strings.
UTF8	You can specify the length of SQL `CHAR` types as a number of characters. The binary order on the SQL `CHAR` columns is always the same as that of the SQL `NCHAR` columns when the data consists of the same surrogate pair Unicode characters. As a result, `CHAR` columns and `NCHAR` columns have the same sort for identical strings.	Surrogate pair Unicode characters are stored as 6 bytes instead of the 4 bytes defined by the Unicode standard. As a result, Oracle has to convert data for those surrogate characters.
UTFE	Same as UTF8. This is the only Unicode character set for the EBCDIC platform.	Same as UTF8. UTFE is not a standard encoding in the Unicode standard. As a result, clients requiring standard UTF-8 encoding must go through data conversion from UTFE to the standard encoding upon retrieval and insertion.

Table 5-2 Unicode Datatype Solutions

National Character Set	Advantages	Disadvantages
AL16UTF16	Asian data in AL16UTF16 is generally more compact than in UTF8. As a result, you will save disk space and have less disk I/O when most of your multilingual data stored in the database is Asian data. Generally speaking, processing strings encoded in the AL16UTF16 character set is faster than those encoded in UTF8 because, in most cases, Oracle9i processes characters in an AL16UTF16 encoded string as fixed-width characters. The maximum length limits for the `NCHAR` and `NVARCHAR2` columns are 1000 and 2000 characters respectively. Because it is fixed-width, the lengths are guaranteed.	European ASCII data requires more disk space to store in AL16UTF16 than in UTF8. If most of your data is European data, the disk space usage is not as efficient as using UTF8. The maximum length limits for `NCHAR` and `NVARCHAR2` are 1000 and 2000 characters, which is less than those for `NCHAR` (2000) and `NVARCHAR2` (4000) in UTF8.
UTF8	European data in UTF8 is generally more compact than in AL16UTF16. As a result, you will save disk space and have better response time when most of your multilingual data stored in the database is European data. The maximum length limits for the `NCHAR` and `NVARCHAR2` columns are 2000 and 4000 characters respectively, which is more than those for `NCHAR` (1000) and `NVARCHAR2` (2000) in AL16UTF16. Although the maximum length of the `NCHAR` and `NVARCHAR2` columns are larger in UTF8, the actual storage size is still bound by the byte limits of 2000 and 4000 bytes respectively. For example, you can store 4000 UTF8 characters in an `NVARCHAR2` column if all the characters are single byte, but only 4000/3 characters if all the characters are three bytes.	Asian data requires more disk space to store in UTF8 than in AL16UTF16. If most of your data is Asian data, the disk space usage is not as efficient as using AL16UTF16. Although you can specify larger length limits for `NCHAR` and `NVARCHAR`, you are not guaranteed to be able to insert the number of characters specified by these limits. This is because it is variable-width. Processing strings encoded in UTF8 is generally slower than those encoded in AL16UTF16 because UTF8 encoded strings consist of variable-width characters.

Unicode Case Studies

This section describes a few typical scenarios for storing Unicode characters in an Oracle9i database.

Example 5-3 Unicode Solution with a Unicode Database

An American company running a Java application would like to add German and French support in the next release of their application, and add Japanese support at a later time. The company currently has the following system configuration:

The existing database has a database character set of US7ASCII.
All character data in the existing database is composed of ASCII characters.
PL/SQL stored procedures are used in the database.
The database is around 300 GB.
There is a nightly downtime of 4 hours.

In this case, a typical solution is to choose UTF8 for the database character set because of the following reasons:

The database is very large and the scheduled downtime is short. How fast the database can be migrated to support Unicode is vital. Because the database is in US7ASCII, the easiest and fastest way of enabling the database to support Unicode is to switch the database character set to UTF8 by issuing the ALTER DATABASE statement. No data conversion is required because US7ASCII is a subset of UTF8.
Because most of the code is written in Java and PL/SQL, changing the database character set to UTF8 is unlikely to break existing code. Unicode support will be automatically enabled in the application.
Because the application will support French, German, and Japanese, there will be few surrogate characters. Both AL32UTF8 and UTF8 are suitable.

Example 5-4 Unicode Solution with Unicode Datatypes

A European company running its applications mainly on Windows platforms wants to add new Windows applications written in Visual C/C++, which will use the existing database to support Japanese and Chinese customer names. The company currently has the following system configuration:

The existing database has a database character set of WE8ISO8859P1.
All character data in the existing database is composed of Western European characters.
The database is around 50 GB.

In this case, a typical solution is to use NCHAR and NVARCHAR2 datatypes to store Unicode characters, and keep WE8ISO8859P1 as the database character set and use AL16UTF16 as the national character set. The reasons for this are:

Migrating the existing database to a Unicode database required data conversion because the database character set is WE8ISO8859P1 (a Latin 1 character set), which is not a subset of UTF8. As a result, there will be some overhead in converting the data to UTF8.
The additional languages are supported in new applications only, so there is no dependency on the existing applications or schemas. It is simpler to use the Unicode datatype in the new schema and keep the existing schemas unchanged.
Only customer name columns require Unicode support. Using a single NCHAR column meets the customer's requirements without migrating the entire database.
Because the languages to be supported are mostly Asian languages, AL16UTF16 should be used as the national character set so that disk space is used more efficiently.
Lengths are treated in terms of characters in the SQL NCHAR datatypes. This is the same as the way they are treated when using wchar_t strings in Windows C/C++ programs. This reduces programming complexity.
Existing applications using the existing schemas are unaffected.

Example 5-5 Unicode Solution with Both a Unicode Database and Unicode Datatypes

A Japanese company wants to develop a new Java application on Oracle9i. The company projects that the application will support as many languages as possible in the long run.

In order to store documents as is, the company decided to use the BLOB datatype to store documents of multiple languages.
The company may also want to generate UTF-8 XML documents from the relational data for business-to-business data exchange.
The back-end has Windows applications written in C/C++ using ODBC to access the Oracle database.

In this case, the typical solution is to create a Unicode database using AL32UTF8 as the database character set and use the SQL NCHAR datatypes to store multilingual data. The national character set should be set to AL16UTF16. The reasons for this solution are:

When documents of different languages are stored as BLOBs, Oracle Text requires the database character set to be one of the UTF-8 character sets. Because the applications may retrieve relational data as UTF-8 XML format (where surrogate characters are stored as four bytes), AL32UTF8 should be used as the database character set to avoid redundant data conversion when UTF-8 data is retrieved or inserted.
Because applications are new and written in both Java and Windows C/C++, the company should use the SQL NCHAR datatype for its relational data as both Java and Windows support the UTF-16 character datatype and the length of a character string is always measured in the number of characters.
If most of the data is for Asian languages, AL16UTF16 should be used in conjunction with the SQL NCHAR datatypes because AL16UTF16 offers better performance and storage efficiency.

Migrating Data to Unicode

It is important to separate the task of character set migration from the task of database version migration, For example, if you have an Oracle8i non-Unicode database and you want to migrate it to an Oracle9i Unicode database, you must first migrate it to Oracle9i, then migrate the data to Unicode.

This section describes how to migrate your data to Unicode in Oracle9i. "Migrating to Use the NCHAR Datatypes" describes how to migrate non-Unicode SQL CHAR datatypes to SQL NCHAR datatypes. It also describes how to migrate pre-Oracle9i SQL NCHAR datatypes to Oracle9i SQL NCHAR datatypes.

Before you actually migrate your data to Unicode, you need to identify areas of possible data character set conversions and truncation of data. Oracle strongly recommends that you analyze your database using the Character Set Scanner Utility for possible problems before actually migrating the database.

Migrating to a Unicode Database

There are three general approaches when migrating data from non-Unicode character set to Unicode:

Full Export and Import

In most cases, a full export and import is recommended to properly convert all data to a Unicode character set. It is important to be aware of data truncation issues because character datatype columns might need to be extended before importing to handle the increase of data byte size.

The steps to migrate to Unicode using a full export and import are:

Scan the database to identify columns that need to be extended. Use the Character Set Scanner Utility.

See Also:
Chapter 10, "Character Set Scanner Utility"

See, for further details.
Export the entire database.
Create a new database using either UTF8 or AL32UTF8 on ASCII-based platforms, using UTFE on EBCDIC platforms.
Create the tables identified in step 1 with extended columns size.
Import the.DMP file exported in step 2 to the new database.

The ALTER DATABASE CHARACTER SET Statement

If, and only if, the current database character set is US7ASCII and all the data is in the 7-bit range, you can use the ALTER DATABASE CHARACTER SET statement to expedite migration to a Unicode database character set. Note that this approach cannot be taken on EBCDIC platforms because UTFE is not a strict superset of any EBCDIC character set.

Use the following steps to migrate to Unicode using the ALTER DATABASE CHARACTER SET statement:

Scan the database to make sure all data is in the 7-bit range. Use the Character Set Scanner Utility.

See Also:
Chapter 10, "Character Set Scanner Utility"
Change the database character set to UTF8 or AL32UTF8 using the ALTER DATABASE CHARACTER SET statement.

See Also:
"Changing the Character Set After Database Creation" for the steps to change the database character set

The ALTER DATABASE CHARACTER SET Statement and Selective Imports and Exports

Another approach is to issue an ALTER DATABASE CHARACTER SET statement followed by selective imports. This methods can be used when the distributions of convertible data are known and they are stored within a small number of tables.

The steps to migrate to Unicode using selective imports are:

Scan the database to identify tables that contain convertible data.
Export those tables identified in step 1.
Delete all rows from those table identified in step 1.
Change the database character set to UTF8 or AL32UTF8 with the ALTER DATABASE CHARACTER SET statement.
Import the dump files into the database.

Migrating to Use the NCHAR Datatypes

The Oracle Server introduced in release 8.0 a national character (NCHAR) datatype that allows for a second, alternate character set in addition to the original database character set. NCHAR supports a number of special, fixed-width Asian character sets that were introduced to provide for higher performance processing of Asian character data.

In Oracle9i, the SQL NCHAR datatypes are limited to the Unicode character set encoding (UTF8 and AL16UTF16) only. Any other Oracle8 Server character sets that were available under the NCHAR datatype, including Asian character sets (for example, JA16SJISFIXED), will no longer be supported.

The migration steps for existing NCHAR, NVARCHAR, and NCLOB columns through export and import are as follows:

Export all SQL NCHAR columns from the Oracle8 or Oracle8i database.
Drop the SQL NCHAR columns.
Upgrade database to Oracle9i.
Import the SQL NCHAR columns into Oracle9i.

The Oracle9i migration utility can also convert your Oracle8 and Oracle8i NCHAR columns to 9i NCHAR columns. A SQL NCHAR upgrade script called utlchar.sql is supplied with the migration utility. You should run it at the end of the migration to convert your Oracle8 and Oracle8i NCHAR columns to the new Oracle9i NCHAR columns. Once the script has been executed the data cannot be downgraded, because there is no downgrade SQL NCHAR script. The only way for you move back to Oracle8 or Oracle8i is to drop all NCHAR columns, downgrade the database, and import the old NCHAR data from a previous Oracle8 or Oracle8i export file. Make sure your have a backup (export file) of your Oracle8 or Oracle8i NCHAR data, in case you need to downgrade your database in the future.

To take advantage of the new Unicode NCHAR datatypes, you can also use the Export and Import utilities to migrate SQL CHAR columns to SQL NCHAR columns:

Export the SQL CHAR columns that you want to convert to SQL NCHAR.
Drop the columns that were just exported.
Import the columns as SQL NCHAR columns.
See Also:

Oracle9i Database Utilities for a description of export and import procedures

Oracle9i Database Migration for NCHAR migration information

Designing Database Schemas to Support Multiple Languages

In addition to choosing a Unicode solution, the following should also be taken into consideration when the database schema is designed to support multiple languages:

Specifying Column Limits

When you use NCHAR and NVARCHAR2 datatypes for storing multilingual data, the column limit specified for a column is always in character semantics (which is in terms of the number of Unicode code units). The following table shows the maximum size of the NCHAR and NVARCHAR2 datatypes for the ALT16UTF16 and UTF8 national character sets.

National Character Set	Maximum Size of NCHAR Datatype	Maximum Size of NVARCHAR2 Datatype
ALT16UTF16	1000 characters	2000 characters
UTF8	2000 bytes	4000 bytes

When you use CHAR and VARCHAR2 datatypes for storing multilingual data, the column limit specified for each column is, by default, in number of bytes. If the database needs to support Thai, Arabic, or multibyte languages such as Chinese and Japanese, the limits for the CHAR, VARCHAR, and VARCHAR2 columns may need to be extended. This is because the number of bytes required to encode these languages in UTF8 or AL32UTF8 may be significantly larger than those for English and Western European languages. For example, one Thai character in the Thai character set requires 3 bytes in UTF8 or AL32UTF8. In addition, the maximum limits for CHAR, VARCHAR, and VARCHAR2 datatypes are 2000 bytes, 4000 bytes, and 4000 bytes respectively. If applications need to store more than 4000 bytes, you should use the CLOB datatype for the data.

Storing Data of Multiple Languages

The Unicode character set includes characters of most written languages around the world, but it does not tell you the language to which a given character belongs. In other words, a character such as ä does not contain information about whether it is a French or German character. In order to provide information in the language a user desires, data stored in a Unicode database should accompany the language information to which the data belongs.

There are many ways for a database schema to relate data to a language. Here is one example.

Store Language Information with the Data

For data such as product descriptions or product names, you can add a language column (language_id) of CHAR or VARCHAR2 datatype to the product table to identify the language of the corresponding product information. This enables accessing applications to retrieve the information in the desired language. The possible values for this language column are the 3-letter abbreviations of the valid NLS_LANGUAGE values of the database.

See Also:

Appendix A, "Locale Data", for a list of NLS_LANGUAGE values and their abbreviations

You can also create a view to select the data of the current language. For example:

CREATE OR REPLACE VIEW product AS 
   SELECT product_id, product_name 
   FROM   products_table 
   WHERE  language_id = sys_context('USERENV','LANG');

Select Translated Data Using Fine-Grained Access Control

Fine-grained access control allows you to limit the degree to which a user can view information in a table or view. Typically, this is done by appending a WHERE clause. Once you add a WHERE clause as a fine-grained access policy to a table or view, Oracle9i automatically appends the WHERE clause to any SQL statements on the table at run time so that only those rows satisfying the WHERE clause can be accessed.

You can use this feature to avoid specifying the desired language of an user in the WHERE clause in each and every SELECT statement in your applications. The following WHERE clause limits the view of a table to the rows corresponding to the desired language of a user:

WHERE language_id = sys_context('userenv', 'LANG')

When you specify this WHERE clause as a fine-grained access policy for your product_table as follows:

DBMS_RLS.ADD_POLICY ('scott', 'product_table', 'lang_policy', 'scott',
    'language_id = sys_context('userenv', 'LANG')', 'select');

Then any SELECT statement on the table product_table will automatically append the WHERE clause.

Storing Documents in LOBs

You can store documents in multiple languages in CLOB, NCLOB or BLOB and set up Oracle Text to enable content search for the documents.

Data in CLOB columns is always stored as UTF-16 internally when the database character set is of varying width, such as UTF8 or AL32UTF8. Document contents are converted to UTF-16 when they are inserted into a CLOB column. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document, such as Japanese, in a CLOB column requires less storage space than the same document in a LONG column using UTF8 (typically around 30% less, depending on the contents of the document).
Documents in NCLOB are also stored as UTF-16 regardless of the database character set or national character set. The storage space requirement is the same as in CLOB. Document contents are converted to UTF-16 when they are inserted into a NCLOB column. If you want to store multilingual documents in a non-Unicode database, you should choose NCLOB. However, content search on NCLOB is not yet supported.
Documents in BLOB format are stored as they are. No data conversion occurs during insert and retrieval. However, SQL string manipulation functions (such as LENGTH or SUBSTR) and collation functions (such as NLS_SORT and ORDER BY) are not applicable to the BLOB datatype.

The following table lists the advantages and disadvantages for datatypes when storing documents:

Table 5-3 Comparison of Datatypes for Document Storage

Datatypes	Advantages	Disadvantages
`CLOB`	Content search support String manipulation support	Dependent on database character set Data conversion Cannot store binary documents
`NCLOB`	Independent on database character set String manipulation support	No content search support Data conversion Cannot store binary documents
`BLOB`	Independent on database character set Content search support No data conversion, data store as is Can store binary documents such as Word or Excel	No string manipulation support

Creating Indexes for Document Content Search

Oracle Text allows you to build indexes for content search on multilingual documents stored as CLOBs and BLOBs. It uses a language-specific lexer to parse the CLOB or BLOB data and produces a list of searchable keywords.

You need to create a multi-lexer for multilingual document searching to work. The multi-lexer chooses a language-specific lexer for each row, based on a language column. This section describe the high level steps to create indexes for documents in multiple languages.

See Also:

Oracle Text Reference

Creating Multi-Lexers

The first step in creating the multi-lexer is the creation of language-specific lexer preferences for each language supported. The following example creates English, French, and Japanese lexers with PL/SQL procedures:

ctx_ddl.create_preference('english_lexer', 'basic_lexer'); 
ctx_ddl.set_attribute('english_lexer','index_themes','yes'); 
ctx_ddl.create_preference('german_lexer', 'basic_lexer'); 
ctx_ddl.set_attribute('german_lexer','composite','german'); 
ctx_ddl.set_attribute('german_lexer','alternate_spelling','german'); 
ctx_ddl.set_attribute('german_lexer','mixed_case','yes'); 
ctx_ddl.create_preference('japanese_lexer', 'JAPANESE_VGRAM_LEXER');

Once the language-specific lexer preferences are created, they need to be gathered together under a single multi-lexer preference. First, create the multi-lexer preference, using the MULTI_LEXER object:

ctx_ddl.create_preference('global_lexer','multi_lexer');

Now we must add the language-specific lexers to the multi-lexer preference using the add_sub_lexer call:

ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer'); 
ctx_ddl.add_sub_lexer('global_lexer', 'japanese', 'japanese_lexer'); 
ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');

This nominates the german_lexer preference to handle German documents, the japanese_lexer preference to handle French documents, and the english_lexer preference to handle everything else, using DEFAULT as the language.

Building Indexes for Documents Stored as CLOBs

The multi-lexer decides which lexer to use for each row based on a language column. This is a character column in the table which stores the language of the document in the text column. You should use the Oracle language name to identify the language of a document in this column. For instance, if you use CLOBs to store your documents, then you must add the language column to the table where the documents are stored:

CREATE TABLE globaldoc ( 
  doc_id    NUMBER       PRIMARY KEY, 
  language  VARCHAR2(30), 
  text      CLOB 
  );

To create an index for this table, use the multi-lexer preference and specify the name of the language column:

CREATE INDEX globalx ON globaldoc(text) 
  indextype IS ctxsys.context 
  parameters ('lexer global_lexer 
               language column language');

Creating Indexes for Documents Stored as BLOBs

In addition to the language column, the character set and format columns must be added in the table where your documents are stored. The character set column stores the character set of the documents using the Oracle character set names. The format column specifies whether a document is a text or binary document. For instance, your table would looks like:

CREATE TABLE globaldoc ( 
   doc_id       NUMBER       PRIMARY KEY, 
   language     VARCHAR2(30), 
   characterset VARCHAR2(30), 
   format       VARCHAR2(10), 
   text         BLOB 
  );

With the format column, you may put word-processing or spreadsheet documents into the table and specify binary in the format column. For text documents such as HTML, XML and text, you may put them into the table and specify text in the format column. With the character set column, you can store text documents in different character sets.

When you create the index, specify the names of the format and character set columns:

CREATE INDEX globalx ON globaldoc(text) 
  indextype is ctxsys.context 
  parameters ('filter inso_filter 
                lexer global_lexer 
                language column language 
                format  column format 
                charset column characterset');

You may use the charset_filter if all documents are in text format.

5 Supporting Multilingual Databases with Unicode

Overview of Unicode

What is Unicode?

Unicode Encoding

UTF-16 Encoding

UTF-8 Encoding

Surrogate Pairs

Sample UTF-16 and UTF-8 Encodings

Figure 5-1 Sample UTF-16 and UTF-8 Encodings

Implementing a Unicode Solution in the Database

Enabling Multilingual Support with Unicode Databases

Example 5-1 Creating a Database with a Unicode Character Set

Enabling Multilingual Support with Unicode Datatypes

Example 5-2 Creating a Database with a National Character Set

How to Choose Between a Unicode Database and a Unicode Datatype Solution

When Should You Use a Unicode Database?

When Should You Use Unicode Datatypes?

Comparison of Unicode Solutions

Table 5-1 Unicode Database Solutions

Table 5-2 Unicode Datatype Solutions

Unicode Case Studies

Example 5-3 Unicode Solution with a Unicode Database

Example 5-4 Unicode Solution with Unicode Datatypes

Example 5-5 Unicode Solution with Both a Unicode Database and Unicode Datatypes

Migrating Data to Unicode

Migrating to a Unicode Database

Full Export and Import

The ALTER DATABASE CHARACTER SET Statement

The ALTER DATABASE CHARACTER SET Statement and Selective Imports and Exports

Migrating to Use the NCHAR Datatypes

Designing Database Schemas to Support Multiple Languages

Specifying Column Limits

Storing Data of Multiple Languages

Store Language Information with the Data

Select Translated Data Using Fine-Grained Access Control

Storing Documents in LOBs

Table 5-3 Comparison of Datatypes for Document Storage

Creating Indexes for Document Content Search

Creating Multi-Lexers

Building Indexes for Documents Stored as CLOBs

Creating Indexes for Documents Stored as BLOBs

5
Supporting Multilingual Databases with Unicode