|Oracle9i Globalization Support Guide
Release 1 (9.0.1)
Part Number A90236-02
This chapter illustrates how to use Unicode in an Oracle database environment. It includes the following topics:
Dealing with many different languages in the same application or database has been complicated and difficult for a long time. To overcome the limitations of existing character encodings, several organizations began working on the creation of a global character set in the late 1980s. The need for this became even greater with the development of the World Wide Web in the mid-1990s. The Internet has changed how we do business today, with an emphasis on the global market that has made a universal character set a major requirement. This global character set needs to contain all major living scripts, support legacy data and implementations, and be simple enough that a single implementation of a product is sufficient for worldwide use. This character set should also support multilingual users and organizations, conform to international standards, and enable world-wide interchange of data. This global character set exists, is in wide use, and is called Unicode.
Unicode is a universal encoded character set that allows you to store information from any language using a single character set. Unicode provides a unique code value for every character, regardless of the platform, program, or language.
Oracle started supporting Unicode as a database character set in Oracle7. In Oracle9i, Unicode support has been expanded so that you can find the right solution for your globalization needs. Oracle9i supports Unicode 3.0, the third version of the Unicode standard.
There are two different ways to encode Unicode 3.0 characters:
UTF-16 encoding is the 16-bit encoding of Unicode in which the character codes 0x0000 through 0x007F contain the ASCII characters. One Unicode character can be 2 bytes or 4 bytes in this encoding. Characters from both European and most Asian scripts are represented in 2 bytes. Surrogate pairs (described below) are represented in 4 bytes.
UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding in which the character codes 0x00 through 0x7F have the same meaning as ASCII. One Unicode character can be 1 byte, 2 bytes, 3 bytes or 4 bytes in this encoding. Characters from the European scripts are represented in either 1 or 2 bytes, while characters from most Asian scripts are represented in 3 bytes. Surrogate pairs (described below) are represented in 4 bytes.
You can extend Unicode to encode more than 1 million characters. These extended characters are called surrogate pairs. Surrogate pairs are designed to allow representation of characters in Unicode 3.0 and future extensions of the Unicode standard. Surrogate pairs require 4 bytes in UTF-8 and UTF-16 encoding.
Figure 5-1 shows some characters and their character codes in UTF-16 and UTF-8 encoding. The last character is a treble clef (a music symbol), a surrogate pair that has been added to the Unicode 3.0 standard.
You can store Unicode characters in an Oracle9i database in two ways. The following sections explain how to use the two Unicode solutions and how to choose between them:
You can create a Unicode database that allows you to store UTF-8 encoded characters as SQL
CHAR datatypes (
LONG). If you prefer to implement Unicode support incrementally, you can store Unicode data in either the UTF-16 or UTF-8 encoding form in SQL
NCHAR datatypes (
NCLOB). The SQL
NCHAR datatypes are called Unicode datatypes because they are used for storing Unicode data only.
The Oracle9i database has the concept of a database character set, which specifies the encoding to be used in the SQL
CHAR datatypes as well as the metadata such as table names, column names, and SQL statements. A Unicode database is a database with UTF-8 as the database character set. There are three Oracle character sets that implement the UTF-8 encoding. The first two are designed for ASCII-based platforms while the third one should be used on EBCDIC platforms.
The AL32UTF8 character set encodes characters in one to three bytes. Surrogate pairs require four bytes. It is for ASCII-based platforms.
The UTF8 character set encodes characters in one to three bytes. Surrogate pairs require six bytes. It is for ASCII-based platforms.
The UTFE character set should be used as the database character set on EBCDIC platforms to support the UTF-8 encoding.
To create a database with the UTF8 character set, use the
DATABASE statement. For example:
CREATE DATABASE myunicodedatabase CONTROL FILE REUSE LOGFILE '/u01/oracle/utfdb/redo01.log' SIZE 1M REUSE '/u01/oracle/utfdb/redo02.log' SIZE 1M REUSE DATAFILE '/u01/oracle/utfdb/system01.dbf' SIZE 10M REUSE AUTOEXTENT ON NEXT 10M MAXSIZE 200M CHARACTER SET UTF8
An alternative to storing Unicode data in the database is to use the SQL
NCHAR datatypes. You can store Unicode characters into columns of these datatypes regardless of how the database character set has been defined. The
NCHAR datatype has been redefined in Oracle9i to be a Unicode datatype exclusively. In other words, it stores data in the Unicode encoding only. You can use the SQL
NCHAR datatypes in the same way you use the SQL
CHAR datatypes. You can create a table using the
NCHAR datatypes as follows:
The encoding used in the SQL
NCHAR datatypes is specified as the national character set of the database. You can specify one of the following two Oracle character sets as the national character set:
This is the default character set for SQL
NCHAR datatypes. The character set encodes Unicode data in the UTF-16 encoding.
When UTF8 is specified for SQL
NCHAR datatypes, the data stored in the SQL datatypes is in UTF-8 encoding.
By default, data is stored in the UTF-16 encoding in the SQL
NCHAR datatypes, and the length specified in the
NVARCHAR2 columns is always in the number of characters instead of the number of bytes.
You can specify the national character set for the SQL
NCHAR datatypes when you create a database using the
DATABASE statement. The following command creates a database with WE8ISO8859P1 as the database character set and AL16UTF16 as the national character set.
CREATE DATABASE myunicodedatabase CONTROL FILE REUSE LOGFILE '/u01/oracle/utfdb/redo01.log' SIZE 1M REUSE '/u01/oracle/utfdb/redo02.log' SIZE 1M REUSE DATAFILE '/u01/oracle/utfdb/system01.dbf' SIZE 10M REUSE AUTOEXTENT ON NEXT 10M MAXSIZE 200M CHARACTER SET WE8ISO8859P1 NATIONAL CHARACTER SET AL16UTF16
This example also illustrates the fact that you can store Unicode data in a non-Unicode database by using SQL
"Migrating to Use the NCHAR Datatypes" for more information about migrating data to the
In order to choose the right Unicode solution for your database, you need to consider the following:
What are the main programming languages used in your applications? How do they support Unicode?
How easily can your data and applications be migrated to take advantage of the Unicode solution?
How much performance overhead are you willing to accept in order to use Unicode in the database?
Is your data mostly Asian or European? Do you need to store multilingual documents into
What type of applications are you implementing: a packaged application or a customized end-user application?
This section describes some general guidelines for choosing a Unicode database or a Unicode datatype solution. The final decision largely depends on your exact environment and requirements.
You should use a Unicode database when:
If your existing application is mainly written in Java and PL/SQL and your main concern is to minimize the code change required to support multiple languages, you may want to use a Unicode database solution. As long as the datatypes used to stored your data remain as the SQL
CHAR datatypes, the Java and PL/SQL accessing these columns do not need to change.
If the database character set of your existing database is US7ASCII, you may want to choose a Unicode database solution because you can migrate your database using a single
DATABASE statement. No data conversion is required because ASCII is a subset of UTF-8.
If the multilingual data will be evenly distributed in your existing schema tables and you are not sure which ones will contain multilingual data, then you should use the Unicode database solution because it does not require you to identify which columns store what data.
If you need to write SQL statements and PL/SQL code in Unicode, you must use the Unicode database solution. SQL statements and PL/SQL code are converted into the database character set before being processed. If your SQL statements and PL/SQL code contain characters that cannot be converted to your database character set, those characters will be lost. A common place where you would use Unicode data in a SQL statement is in a string literal.
BLOBs and use Oracle Text for content searching.
You must use a Unicode database in this case. The
BLOB data is converted to the database character set before being indexed by Oracle Text. If your database character set is not UTF8, then data will be lost when the documents contain characters that cannot be converted to the database character set.
After you decide to use a Unicode database solution, you need to decide which UTF-8 character set is best for your database. See Table 5-1 for advantages and disadvantages of Unicode database solutions.
You should use Unicode datatypes when:
If you want to add Unicode support to your existing database without migrating the character set, you should consider using Unicode datatypes to store Unicode. You can add columns of the SQL
NCHAR datatypes to existing tables or new tables to support multiple languages incrementally.
If you are building a packaged application that will be sold to customers, you may want to build the application using SQL
NCHAR datatypes. This is because the SQL
NCHAR datatype is a reliable Unicode datatype in which the data is always stored in Unicode, and the length of the data is always specified in UTF-16 code units. As a result, you need only test the application once, and your application will run on customer databases with any database character set.
If performance is your biggest concern, you should consider using a single-byte database character set and storing Unicode data in the SQL
NCHAR datatypes. Databases using a multibyte database character set such as UTF8 have a performance overhead.
If your applications are written in Visual C/C++ or Visual Basic running on Windows, you may want to use the SQL
NCHAR datatypes because you can store UTF-16 data in these datatypes in the same way that you store it in the
wchar_t buffer in Visual C/C++ and
string buffer in Visual Basic. You can avoid buffer overflow in your client applications because the length of the
string datatypes match the length of the SQL
NCHAR datatypes in the database.
Once you decide to use a Unicode datatype solution, you need to decide which character set to choose as your national character set. See Table 5-2 for advantages and disadvantages of Unicode datatype solutions.
Oracle9i provides two solutions to store Unicode characters in the database: a Unicode database solution and a Unicode datatype solution. After you select the Unicode database solution, the Unicode datatype solution or the combination of both, you need to determine the character set to be used in the Unicode database and/or the Unicode datatype. Different Unicode character sets have different advantages and disadvantages. Table 5-1 and Table 5-2 illustrate advantages and disadvantages for different Unicode solutions:
|Database Character Set||Advantages||Disadvantages|
|National Character Set||Advantages||Disadvantages|
This section describes a few typical scenarios for storing Unicode characters in an Oracle9i database.
An American company running a Java application would like to add German and French support in the next release of their application, and add Japanese support at a later time. The company currently has the following system configuration:
In this case, a typical solution is to choose UTF8 for the database character set because of the following reasons:
DATABASEstatement. No data conversion is required because US7ASCII is a subset of UTF8.
A European company running its applications mainly on Windows platforms wants to add new Windows applications written in Visual C/C++, which will use the existing database to support Japanese and Chinese customer names. The company currently has the following system configuration:
In this case, a typical solution is to use
NVARCHAR2 datatypes to store Unicode characters, and keep WE8ISO8859P1 as the database character set and use AL16UTF16 as the national character set. The reasons for this are:
NCHARcolumn meets the customer's requirements without migrating the entire database.
NCHARdatatypes. This is the same as the way they are treated when using
wchar_tstrings in Windows C/C++ programs. This reduces programming complexity.
A Japanese company wants to develop a new Java application on Oracle9i. The company projects that the application will support as many languages as possible in the long run.
BLOBdatatype to store documents of multiple languages.
In this case, the typical solution is to create a Unicode database using AL32UTF8 as the database character set and use the SQL
NCHAR datatypes to store multilingual data. The national character set should be set to AL16UTF16. The reasons for this solution are:
BLOBs, Oracle Text requires the database character set to be one of the UTF-8 character sets. Because the applications may retrieve relational data as UTF-8 XML format (where surrogate characters are stored as four bytes), AL32UTF8 should be used as the database character set to avoid redundant data conversion when UTF-8 data is retrieved or inserted.
NCHARdatatype for its relational data as both Java and Windows support the UTF-16 character datatype and the length of a character string is always measured in the number of characters.
NCHARdatatypes because AL16UTF16 offers better performance and storage efficiency.
It is important to separate the task of character set migration from the task of database version migration, For example, if you have an Oracle8i non-Unicode database and you want to migrate it to an Oracle9i Unicode database, you must first migrate it to Oracle9i, then migrate the data to Unicode.
This section describes how to migrate your data to Unicode in Oracle9i. "Migrating to Use the NCHAR Datatypes" describes how to migrate non-Unicode SQL
CHAR datatypes to SQL
NCHAR datatypes. It also describes how to migrate pre-Oracle9i SQL
NCHAR datatypes to Oracle9i SQL
Before you actually migrate your data to Unicode, you need to identify areas of possible data character set conversions and truncation of data. Oracle strongly recommends that you analyze your database using the Character Set Scanner Utility for possible problems before actually migrating the database.
There are three general approaches when migrating data from non-Unicode character set to Unicode:
In most cases, a full export and import is recommended to properly convert all data to a Unicode character set. It is important to be aware of data truncation issues because character datatype columns might need to be extended before importing to handle the increase of data byte size.
The steps to migrate to Unicode using a full export and import are:
If, and only if, the current database character set is US7ASCII and all the data is in the 7-bit range, you can use the
SET statement to expedite migration to a Unicode database character set. Note that this approach cannot be taken on EBCDIC platforms because UTFE is not a strict superset of any EBCDIC character set.
Use the following steps to migrate to Unicode using the
Another approach is to issue an
SET statement followed by selective imports. This methods can be used when the distributions of convertible data are known and they are stored within a small number of tables.
The steps to migrate to Unicode using selective imports are:
The Oracle Server introduced in release 8.0 a national character (
NCHAR) datatype that allows for a second, alternate character set in addition to the original database character set.
NCHAR supports a number of special, fixed-width Asian character sets that were introduced to provide for higher performance processing of Asian character data.
In Oracle9i, the SQL
NCHAR datatypes are limited to the Unicode character set encoding (UTF8 and AL16UTF16) only. Any other Oracle8 Server character sets that were available under the
NCHAR datatype, including Asian character sets (for example, JA16SJISFIXED), will no longer be supported.
The migration steps for existing
NCLOB columns through export and import are as follows:
NCHARcolumns from the Oracle8 or Oracle8i database.
NCHARcolumns into Oracle9i.
The Oracle9i migration utility can also convert your Oracle8 and Oracle8i
NCHAR columns to 9i
NCHAR columns. A SQL
NCHAR upgrade script called
utlchar.sql is supplied with the migration utility. You should run it at the end of the migration to convert your Oracle8 and Oracle8i
NCHAR columns to the new Oracle9i
NCHAR columns. Once the script has been executed the data cannot be downgraded, because there is no downgrade SQL
NCHAR script. The only way for you move back to Oracle8 or Oracle8i is to drop all
NCHAR columns, downgrade the database, and import the old
NCHAR data from a previous Oracle8 or Oracle8i export file. Make sure your have a backup (export file) of your Oracle8 or Oracle8i
NCHAR data, in case you need to downgrade your database in the future.
To take advantage of the new Unicode
NCHAR datatypes, you can also use the Export and Import utilities to migrate SQL
CHAR columns to SQL
CHARcolumns that you want to convert to SQL
In addition to choosing a Unicode solution, the following should also be taken into consideration when the database schema is designed to support multiple languages:
When you use
NVARCHAR2 datatypes for storing multilingual data, the column limit specified for a column is always in character semantics (which is in terms of the number of Unicode code units). The following table shows the maximum size of the NCHAR and NVARCHAR2 datatypes for the ALT16UTF16 and UTF8 national character sets.
|National Character Set||Maximum Size of NCHAR Datatype||Maximum Size of NVARCHAR2 Datatype|
When you use
VARCHAR2 datatypes for storing multilingual data, the column limit specified for each column is, by default, in number of bytes. If the database needs to support Thai, Arabic, or multibyte languages such as Chinese and Japanese, the limits for the
VARCHAR2 columns may need to be extended. This is because the number of bytes required to encode these languages in UTF8 or AL32UTF8 may be significantly larger than those for English and Western European languages. For example, one Thai character in the Thai character set requires 3 bytes in UTF8 or AL32UTF8. In addition, the maximum limits for
VARCHAR2 datatypes are 2000 bytes, 4000 bytes, and 4000 bytes respectively. If applications need to store more than 4000 bytes, you should use the
CLOB datatype for the data.
The Unicode character set includes characters of most written languages around the world, but it does not tell you the language to which a given character belongs. In other words, a character such as
ä does not contain information about whether it is a French or German character. In order to provide information in the language a user desires, data stored in a Unicode database should accompany the language information to which the data belongs.
There are many ways for a database schema to relate data to a language. Here is one example.
For data such as product descriptions or product names, you can add a language column (
VARCHAR2 datatype to the product table to identify the language of the corresponding product information. This enables accessing applications to retrieve the information in the desired language. The possible values for this language column are the 3-letter abbreviations of the valid
NLS_LANGUAGE values of the database.
Appendix A, "Locale Data", for a list of
You can also create a view to select the data of the current language. For example:
CREATE OR REPLACE VIEW product AS SELECT product_id, product_name FROM products_table WHERE language_id = sys_context('USERENV','LANG');
Fine-grained access control allows you to limit the degree to which a user can view information in a table or view. Typically, this is done by appending a
WHERE clause. Once you add a
WHERE clause as a fine-grained access policy to a table or view, Oracle9i automatically appends the
WHERE clause to any SQL statements on the table at run time so that only those rows satisfying the
WHERE clause can be accessed.
You can use this feature to avoid specifying the desired language of an user in the
WHERE clause in each and every
SELECT statement in your applications. The following
WHERE clause limits the view of a table to the rows corresponding to the desired language of a user:
When you specify this
WHERE clause as a fine-grained access policy for your
product_table as follows:
DBMS_RLS.ADD_POLICY ('scott', 'product_table', 'lang_policy', 'scott', 'language_id = sys_context('userenv', 'LANG')', 'select');
SELECT statement on the table
product_table will automatically append the
You can store documents in multiple languages in
BLOB and set up Oracle Text to enable content search for the documents.
CLOBcolumns is always stored as UTF-16 internally when the database character set is of varying width, such as UTF8 or AL32UTF8. Document contents are converted to UTF-16 when they are inserted into a
CLOBcolumn. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document, such as Japanese, in a
CLOBcolumn requires less storage space than the same document in a
LONGcolumn using UTF8 (typically around 30% less, depending on the contents of the document).
NCLOBare also stored as UTF-16 regardless of the database character set or national character set. The storage space requirement is the same as in
CLOB. Document contents are converted to UTF-16 when they are inserted into a
NCLOBcolumn. If you want to store multilingual documents in a non-Unicode database, you should choose
NCLOB. However, content search on
NCLOBis not yet supported.
BLOBformat are stored as they are. No data conversion occurs during insert and retrieval. However, SQL string manipulation functions (such as
SUBSTR) and collation functions (such as
ORDER BY) are not applicable to the
The following table lists the advantages and disadvantages for datatypes when storing documents:
Oracle Text allows you to build indexes for content search on multilingual documents stored as
BLOBs. It uses a language-specific lexer to parse the
BLOB data and produces a list of searchable keywords.
You need to create a multi-lexer for multilingual document searching to work. The multi-lexer chooses a language-specific lexer for each row, based on a language column. This section describe the high level steps to create indexes for documents in multiple languages.
The first step in creating the multi-lexer is the creation of language-specific lexer preferences for each language supported. The following example creates English, French, and Japanese lexers with PL/SQL procedures:
ctx_ddl.create_preference('english_lexer', 'basic_lexer'); ctx_ddl.set_attribute('english_lexer','index_themes','yes'); ctx_ddl.create_preference('german_lexer', 'basic_lexer'); ctx_ddl.set_attribute('german_lexer','composite','german'); ctx_ddl.set_attribute('german_lexer','alternate_spelling','german'); ctx_ddl.set_attribute('german_lexer','mixed_case','yes'); ctx_ddl.create_preference('japanese_lexer', 'JAPANESE_VGRAM_LEXER');
Once the language-specific lexer preferences are created, they need to be gathered together under a single multi-lexer preference. First, create the multi-lexer preference, using the
Now we must add the language-specific lexers to the multi-lexer preference using the
ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer'); ctx_ddl.add_sub_lexer('global_lexer', 'japanese', 'japanese_lexer'); ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');
This nominates the
german_lexer preference to handle German documents, the
japanese_lexer preference to handle French documents, and the
english_lexer preference to handle everything else, using
DEFAULT as the language.
The multi-lexer decides which lexer to use for each row based on a language column. This is a character column in the table which stores the language of the document in the text column. You should use the Oracle language name to identify the language of a document in this column. For instance, if you use
CLOBs to store your documents, then you must add the language column to the table where the documents are stored:
To create an index for this table, use the multi-lexer preference and specify the name of the language column:
CREATE INDEX globalx ON globaldoc(text) indextype IS ctxsys.context parameters ('lexer global_lexer language column language');
In addition to the language column, the character set and format columns must be added in the table where your documents are stored. The character set column stores the character set of the documents using the Oracle character set names. The format column specifies whether a document is a text or binary document. For instance, your table would looks like:
CREATE TABLE globaldoc ( doc_id NUMBER PRIMARY KEY, language VARCHAR2(30), characterset VARCHAR2(30), format VARCHAR2(10), text BLOB );
With the format column, you may put word-processing or spreadsheet documents into the table and specify
binary in the format column. For text documents such as HTML, XML and text, you may put them into the table and specify
text in the format column. With the character set column, you can store text documents in different character sets.
When you create the index, specify the names of the format and character set columns:
CREATE INDEX globalx ON globaldoc(text) indextype is ctxsys.context parameters ('filter inso_filter lexer global_lexer language column language format column format charset column characterset');
You may use the
charset_filter if all documents are in text format.