Oracle9i Globalization Support Guide Release 1 (9.0.1) Part Number A90236-02 |
|
This chapter illustrates how to use Unicode in an Oracle database environment. It includes the following topics:
Dealing with many different languages in the same application or database has been complicated and difficult for a long time. To overcome the limitations of existing character encodings, several organizations began working on the creation of a global character set in the late 1980s. The need for this became even greater with the development of the World Wide Web in the mid-1990s. The Internet has changed how we do business today, with an emphasis on the global market that has made a universal character set a major requirement. This global character set needs to contain all major living scripts, support legacy data and implementations, and be simple enough that a single implementation of a product is sufficient for worldwide use. This character set should also support multilingual users and organizations, conform to international standards, and enable world-wide interchange of data. This global character set exists, is in wide use, and is called Unicode.
Unicode is a universal encoded character set that allows you to store information from any language using a single character set. Unicode provides a unique code value for every character, regardless of the platform, program, or language.
The Unicode standard has been adopted by many software and hardware vendors. Many operating systems and browsers now support Unicode. Unicode is required by modern standards such as XML, Java, JavaScript, LDAP, CORBA 3.0, and WML. It is also synchronized with the ISO/IEC 10646 standard.
Oracle started supporting Unicode as a database character set in Oracle7. In Oracle9i, Unicode support has been expanded so that you can find the right solution for your globalization needs. Oracle9i supports Unicode 3.0, the third version of the Unicode standard.
There are two different ways to encode Unicode 3.0 characters:
UTF-16 encoding is the 16-bit encoding of Unicode in which the character codes 0x0000 through 0x007F contain the ASCII characters. One Unicode character can be 2 bytes or 4 bytes in this encoding. Characters from both European and most Asian scripts are represented in 2 bytes. Surrogate pairs (described below) are represented in 4 bytes.
UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding in which the character codes 0x00 through 0x7F have the same meaning as ASCII. One Unicode character can be 1 byte, 2 bytes, 3 bytes or 4 bytes in this encoding. Characters from the European scripts are represented in either 1 or 2 bytes, while characters from most Asian scripts are represented in 3 bytes. Surrogate pairs (described below) are represented in 4 bytes.
You can extend Unicode to encode more than 1 million characters. These extended characters are called surrogate pairs. Surrogate pairs are designed to allow representation of characters in Unicode 3.0 and future extensions of the Unicode standard. Surrogate pairs require 4 bytes in UTF-8 and UTF-16 encoding.
Figure 5-1 shows some characters and their character codes in UTF-16 and UTF-8 encoding. The last character is a treble clef (a music symbol), a surrogate pair that has been added to the Unicode 3.0 standard.
You can store Unicode characters in an Oracle9i database in two ways. The following sections explain how to use the two Unicode solutions and how to choose between them:
You can create a Unicode database that allows you to store UTF-8 encoded characters as SQL CHAR
datatypes (CHAR
, VARCHAR2
, CLOB
, and LONG)
. If you prefer to implement Unicode support incrementally, you can store Unicode data in either the UTF-16 or UTF-8 encoding form in SQL NCHAR
datatypes (NCHAR
, NVARCHAR2
, and NCLOB
). The SQL NCHAR
datatypes are called Unicode datatypes because they are used for storing Unicode data only.
The Oracle9i database has the concept of a database character set, which specifies the encoding to be used in the SQL CHAR
datatypes as well as the metadata such as table names, column names, and SQL statements. A Unicode database is a database with UTF-8 as the database character set. There are three Oracle character sets that implement the UTF-8 encoding. The first two are designed for ASCII-based platforms while the third one should be used on EBCDIC platforms.
The AL32UTF8 character set encodes characters in one to three bytes. Surrogate pairs require four bytes. It is for ASCII-based platforms.
The UTF8 character set encodes characters in one to three bytes. Surrogate pairs require six bytes. It is for ASCII-based platforms.
The UTFE character set should be used as the database character set on EBCDIC platforms to support the UTF-8 encoding.
To create a database with the UTF8 character set, use the CREATE
DATABASE
statement. For example:
CREATE DATABASE myunicodedatabase CONTROL FILE REUSE LOGFILE '/u01/oracle/utfdb/redo01.log' SIZE 1M REUSE '/u01/oracle/utfdb/redo02.log' SIZE 1M REUSE DATAFILE '/u01/oracle/utfdb/system01.dbf' SIZE 10M REUSE AUTOEXTENT ON NEXT 10M MAXSIZE 200M CHARACTER SET UTF8
An alternative to storing Unicode data in the database is to use the SQL NCHAR
datatypes. You can store Unicode characters into columns of these datatypes regardless of how the database character set has been defined. The NCHAR
datatype has been redefined in Oracle9i to be a Unicode datatype exclusively. In other words, it stores data in the Unicode encoding only. You can use the SQL NCHAR
datatypes in the same way you use the SQL CHAR
datatypes. You can create a table using the NVARCHAR2
and NCHAR
datatypes as follows:
CREATE TABLE product (id NUMBER ename NCHAR(100) description NVARCHAR2(1000))
The encoding used in the SQL NCHAR
datatypes is specified as the national character set of the database. You can specify one of the following two Oracle character sets as the national character set:
This is the default character set for SQL NCHAR
datatypes. The character set encodes Unicode data in the UTF-16 encoding.
When UTF8 is specified for SQL NCHAR
datatypes, the data stored in the SQL datatypes is in UTF-8 encoding.
By default, data is stored in the UTF-16 encoding in the SQL NCHAR
datatypes, and the length specified in the NCHAR
and NVARCHAR2
columns is always in the number of characters instead of the number of bytes.
You can specify the national character set for the SQL NCHAR
datatypes when you create a database using the CREATE
DATABASE
statement. The following command creates a database with WE8ISO8859P1 as the database character set and AL16UTF16 as the national character set.
CREATE DATABASE myunicodedatabase CONTROL FILE REUSE LOGFILE '/u01/oracle/utfdb/redo01.log' SIZE 1M REUSE '/u01/oracle/utfdb/redo02.log' SIZE 1M REUSE DATAFILE '/u01/oracle/utfdb/system01.dbf' SIZE 10M REUSE AUTOEXTENT ON NEXT 10M MAXSIZE 200M CHARACTER SET WE8ISO8859P1 NATIONAL CHARACTER SET AL16UTF16
This example also illustrates the fact that you can store Unicode data in a non-Unicode database by using SQL NCHAR
datatypes.
See Also:
"Migrating to Use the NCHAR Datatypes" for more information about migrating data to the |
In order to choose the right Unicode solution for your database, you need to consider the following:
What are the main programming languages used in your applications? How do they support Unicode?
How easily can your data and applications be migrated to take advantage of the Unicode solution?
How much performance overhead are you willing to accept in order to use Unicode in the database?
Is your data mostly Asian or European? Do you need to store multilingual documents into LOB
columns?
What type of applications are you implementing: a packaged application or a customized end-user application?
This section describes some general guidelines for choosing a Unicode database or a Unicode datatype solution. The final decision largely depends on your exact environment and requirements.
You should use a Unicode database when:
If your existing application is mainly written in Java and PL/SQL and your main concern is to minimize the code change required to support multiple languages, you may want to use a Unicode database solution. As long as the datatypes used to stored your data remain as the SQL CHAR
datatypes, the Java and PL/SQL accessing these columns do not need to change.
If the database character set of your existing database is US7ASCII, you may want to choose a Unicode database solution because you can migrate your database using a single ALTER
DATABASE
statement. No data conversion is required because ASCII is a subset of UTF-8.
If the multilingual data will be evenly distributed in your existing schema tables and you are not sure which ones will contain multilingual data, then you should use the Unicode database solution because it does not require you to identify which columns store what data.
If you need to write SQL statements and PL/SQL code in Unicode, you must use the Unicode database solution. SQL statements and PL/SQL code are converted into the database character set before being processed. If your SQL statements and PL/SQL code contain characters that cannot be converted to your database character set, those characters will be lost. A common place where you would use Unicode data in a SQL statement is in a string literal.
BLOB
s and use Oracle Text for content searching.
You must use a Unicode database in this case. The BLOB
data is converted to the database character set before being indexed by Oracle Text. If your database character set is not UTF8, then data will be lost when the documents contain characters that cannot be converted to the database character set.
After you decide to use a Unicode database solution, you need to decide which UTF-8 character set is best for your database. See Table 5-1 for advantages and disadvantages of Unicode database solutions.
You should use Unicode datatypes when:
If you want to add Unicode support to your existing database without migrating the character set, you should consider using Unicode datatypes to store Unicode. You can add columns of the SQL NCHAR
datatypes to existing tables or new tables to support multiple languages incrementally.
If you are building a packaged application that will be sold to customers, you may want to build the application using SQL NCHAR
datatypes. This is because the SQL NCHAR
datatype is a reliable Unicode datatype in which the data is always stored in Unicode, and the length of the data is always specified in UTF-16 code units. As a result, you need only test the application once, and your application will run on customer databases with any database character set.
If performance is your biggest concern, you should consider using a single-byte database character set and storing Unicode data in the SQL NCHAR
datatypes. Databases using a multibyte database character set such as UTF8 have a performance overhead.
If your applications are written in Visual C/C++ or Visual Basic running on Windows, you may want to use the SQL NCHAR
datatypes because you can store UTF-16 data in these datatypes in the same way that you store it in the wchar_t
buffer in Visual C/C++ and string
buffer in Visual Basic. You can avoid buffer overflow in your client applications because the length of the wchar_t
and string
datatypes match the length of the SQL NCHAR
datatypes in the database.
Once you decide to use a Unicode datatype solution, you need to decide which character set to choose as your national character set. See Table 5-2 for advantages and disadvantages of Unicode datatype solutions.
Oracle9i provides two solutions to store Unicode characters in the database: a Unicode database solution and a Unicode datatype solution. After you select the Unicode database solution, the Unicode datatype solution or the combination of both, you need to determine the character set to be used in the Unicode database and/or the Unicode datatype. Different Unicode character sets have different advantages and disadvantages. Table 5-1 and Table 5-2 illustrate advantages and disadvantages for different Unicode solutions:
This section describes a few typical scenarios for storing Unicode characters in an Oracle9i database.
An American company running a Java application would like to add German and French support in the next release of their application, and add Japanese support at a later time. The company currently has the following system configuration:
In this case, a typical solution is to choose UTF8 for the database character set because of the following reasons:
ALTER
DATABASE
statement. No data conversion is required because US7ASCII is a subset of UTF8.
A European company running its applications mainly on Windows platforms wants to add new Windows applications written in Visual C/C++, which will use the existing database to support Japanese and Chinese customer names. The company currently has the following system configuration:
In this case, a typical solution is to use NCHAR
and NVARCHAR2
datatypes to store Unicode characters, and keep WE8ISO8859P1 as the database character set and use AL16UTF16 as the national character set. The reasons for this are:
NCHAR
column meets the customer's requirements without migrating the entire database.
NCHAR
datatypes. This is the same as the way they are treated when using wchar_t
strings in Windows C/C++ programs. This reduces programming complexity.
A Japanese company wants to develop a new Java application on Oracle9i. The company projects that the application will support as many languages as possible in the long run.
BLOB
datatype to store documents of multiple languages.
In this case, the typical solution is to create a Unicode database using AL32UTF8 as the database character set and use the SQL NCHAR
datatypes to store multilingual data. The national character set should be set to AL16UTF16. The reasons for this solution are:
BLOBs
, Oracle Text requires the database character set to be one of the UTF-8 character sets. Because the applications may retrieve relational data as UTF-8 XML format (where surrogate characters are stored as four bytes), AL32UTF8 should be used as the database character set to avoid redundant data conversion when UTF-8 data is retrieved or inserted.
NCHAR
datatype for its relational data as both Java and Windows support the UTF-16 character datatype and the length of a character string is always measured in the number of characters.
NCHAR
datatypes because AL16UTF16 offers better performance and storage efficiency.
It is important to separate the task of character set migration from the task of database version migration, For example, if you have an Oracle8i non-Unicode database and you want to migrate it to an Oracle9i Unicode database, you must first migrate it to Oracle9i, then migrate the data to Unicode.
This section describes how to migrate your data to Unicode in Oracle9i. "Migrating to Use the NCHAR Datatypes" describes how to migrate non-Unicode SQL CHAR
datatypes to SQL NCHAR
datatypes. It also describes how to migrate pre-Oracle9i SQL NCHAR
datatypes to Oracle9i SQL NCHAR
datatypes.
Before you actually migrate your data to Unicode, you need to identify areas of possible data character set conversions and truncation of data. Oracle strongly recommends that you analyze your database using the Character Set Scanner Utility for possible problems before actually migrating the database.
There are three general approaches when migrating data from non-Unicode character set to Unicode:
In most cases, a full export and import is recommended to properly convert all data to a Unicode character set. It is important to be aware of data truncation issues because character datatype columns might need to be extended before importing to handle the increase of data byte size.
The steps to migrate to Unicode using a full export and import are:
If, and only if, the current database character set is US7ASCII and all the data is in the 7-bit range, you can use the ALTER
DATABASE
CHARACTER
SET
statement to expedite migration to a Unicode database character set. Note that this approach cannot be taken on EBCDIC platforms because UTFE is not a strict superset of any EBCDIC character set.
Use the following steps to migrate to Unicode using the ALTER
DATABASE
CHARACTER
SET
statement:
ALTER
DATABASE
CHARACTER
SET
statement.
"Changing the Character Set After Database Creation" for the steps to change the database character set
See Also:
Another approach is to issue an ALTER
DATABASE
CHARACTER
SET
statement followed by selective imports. This methods can be used when the distributions of convertible data are known and they are stored within a small number of tables.
The steps to migrate to Unicode using selective imports are:
ALTER
DATABASE
CHARACTER
SET
statement.
The Oracle Server introduced in release 8.0 a national character (NCHAR
) datatype that allows for a second, alternate character set in addition to the original database character set. NCHAR
supports a number of special, fixed-width Asian character sets that were introduced to provide for higher performance processing of Asian character data.
In Oracle9i, the SQL NCHAR
datatypes are limited to the Unicode character set encoding (UTF8 and AL16UTF16) only. Any other Oracle8 Server character sets that were available under the NCHAR
datatype, including Asian character sets (for example, JA16SJISFIXED), will no longer be supported.
The migration steps for existing NCHAR
, NVARCHAR
, and NCLOB
columns through export and import are as follows:
NCHAR
columns from the Oracle8 or Oracle8i database.
NCHAR
columns.
NCHAR
columns into Oracle9i.
The Oracle9i migration utility can also convert your Oracle8 and Oracle8i NCHAR
columns to 9i NCHAR
columns. A SQL NCHAR
upgrade script called utlchar.sql
is supplied with the migration utility. You should run it at the end of the migration to convert your Oracle8 and Oracle8i NCHAR
columns to the new Oracle9i NCHAR
columns. Once the script has been executed the data cannot be downgraded, because there is no downgrade SQL NCHAR
script. The only way for you move back to Oracle8 or Oracle8i is to drop all NCHAR
columns, downgrade the database, and import the old NCHAR
data from a previous Oracle8 or Oracle8i export file. Make sure your have a backup (export file) of your Oracle8 or Oracle8i NCHAR
data, in case you need to downgrade your database in the future.
To take advantage of the new Unicode NCHAR
datatypes, you can also use the Export and Import utilities to migrate SQL CHAR
columns to SQL NCHAR
columns:
CHAR
columns that you want to convert to SQL NCHAR
.
NCHAR
columns.
See Also:
NCHAR
migration information
In addition to choosing a Unicode solution, the following should also be taken into consideration when the database schema is designed to support multiple languages:
When you use NCHAR
and NVARCHAR2
datatypes for storing multilingual data, the column limit specified for a column is always in character semantics (which is in terms of the number of Unicode code units). The following table shows the maximum size of the NCHAR and NVARCHAR2 datatypes for the ALT16UTF16 and UTF8 national character sets.
National Character Set | Maximum Size of NCHAR Datatype | Maximum Size of NVARCHAR2 Datatype |
ALT16UTF16 |
1000 characters |
2000 characters |
UTF8 |
2000 bytes |
4000 bytes |
When you use CHAR
and VARCHAR2
datatypes for storing multilingual data, the column limit specified for each column is, by default, in number of bytes. If the database needs to support Thai, Arabic, or multibyte languages such as Chinese and Japanese, the limits for the CHAR
, VARCHAR
, and VARCHAR2
columns may need to be extended. This is because the number of bytes required to encode these languages in UTF8 or AL32UTF8 may be significantly larger than those for English and Western European languages. For example, one Thai character in the Thai character set requires 3 bytes in UTF8 or AL32UTF8. In addition, the maximum limits for CHAR
, VARCHAR
, and VARCHAR2
datatypes are 2000 bytes, 4000 bytes, and 4000 bytes respectively. If applications need to store more than 4000 bytes, you should use the CLOB
datatype for the data.
The Unicode character set includes characters of most written languages around the world, but it does not tell you the language to which a given character belongs. In other words, a character such as ä
does not contain information about whether it is a French or German character. In order to provide information in the language a user desires, data stored in a Unicode database should accompany the language information to which the data belongs.
There are many ways for a database schema to relate data to a language. Here is one example.
For data such as product descriptions or product names, you can add a language column (language_id
) of CHAR
or VARCHAR2
datatype to the product table to identify the language of the corresponding product information. This enables accessing applications to retrieve the information in the desired language. The possible values for this language column are the 3-letter abbreviations of the valid NLS_LANGUAGE
values of the database.
You can also create a view to select the data of the current language. For example:
CREATE OR REPLACE VIEW product AS SELECT product_id, product_name FROM products_table WHERE language_id = sys_context('USERENV','LANG');
Fine-grained access control allows you to limit the degree to which a user can view information in a table or view. Typically, this is done by appending a WHERE
clause. Once you add a WHERE
clause as a fine-grained access policy to a table or view, Oracle9i automatically appends the WHERE
clause to any SQL statements on the table at run time so that only those rows satisfying the WHERE
clause can be accessed.
You can use this feature to avoid specifying the desired language of an user in the WHERE
clause in each and every SELECT
statement in your applications. The following WHERE
clause limits the view of a table to the rows corresponding to the desired language of a user:
WHERE language_id = sys_context('userenv', 'LANG')
When you specify this WHERE
clause as a fine-grained access policy for your product_table
as follows:
DBMS_RLS.ADD_POLICY ('scott', 'product_table', 'lang_policy', 'scott', 'language_id = sys_context('userenv', 'LANG')', 'select');
Then any SELECT
statement on the table product_table
will automatically append the WHERE
clause.
You can store documents in multiple languages in CLOB
, NCLOB
or BLOB
and set up Oracle Text to enable content search for the documents.
CLOB
columns is always stored as UTF-16 internally when the database character set is of varying width, such as UTF8 or AL32UTF8. Document contents are converted to UTF-16 when they are inserted into a CLOB
column. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document, such as Japanese, in a CLOB
column requires less storage space than the same document in a LONG
column using UTF8 (typically around 30% less, depending on the contents of the document).
NCLOB
are also stored as UTF-16 regardless of the database character set or national character set. The storage space requirement is the same as in CLOB
. Document contents are converted to UTF-16 when they are inserted into a NCLOB
column. If you want to store multilingual documents in a non-Unicode database, you should choose NCLOB
. However, content search on NCLOB
is not yet supported.
BLOB
format are stored as they are. No data conversion occurs during insert and retrieval. However, SQL string manipulation functions (such as LENGTH
or SUBSTR
) and collation functions (such as NLS_SORT
and ORDER BY
) are not applicable to the BLOB
datatype.
The following table lists the advantages and disadvantages for datatypes when storing documents:
Datatypes | Advantages | Disadvantages |
---|---|---|
|
||
|
||
|
Oracle Text allows you to build indexes for content search on multilingual documents stored as CLOBs
and BLOBs
. It uses a language-specific lexer to parse the CLOB
or BLOB
data and produces a list of searchable keywords.
You need to create a multi-lexer for multilingual document searching to work. The multi-lexer chooses a language-specific lexer for each row, based on a language column. This section describe the high level steps to create indexes for documents in multiple languages.
The first step in creating the multi-lexer is the creation of language-specific lexer preferences for each language supported. The following example creates English, French, and Japanese lexers with PL/SQL procedures:
ctx_ddl.create_preference('english_lexer', 'basic_lexer'); ctx_ddl.set_attribute('english_lexer','index_themes','yes'); ctx_ddl.create_preference('german_lexer', 'basic_lexer'); ctx_ddl.set_attribute('german_lexer','composite','german'); ctx_ddl.set_attribute('german_lexer','alternate_spelling','german'); ctx_ddl.set_attribute('german_lexer','mixed_case','yes'); ctx_ddl.create_preference('japanese_lexer', 'JAPANESE_VGRAM_LEXER');
Once the language-specific lexer preferences are created, they need to be gathered together under a single multi-lexer preference. First, create the multi-lexer preference, using the MULTI_LEXER
object:
ctx_ddl.create_preference('global_lexer','multi_lexer');
Now we must add the language-specific lexers to the multi-lexer preference using the add_sub_lexer
call:
ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer'); ctx_ddl.add_sub_lexer('global_lexer', 'japanese', 'japanese_lexer'); ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');
This nominates the german_lexer
preference to handle German documents, the japanese_lexer
preference to handle French documents, and the english_lexer
preference to handle everything else, using DEFAULT
as the language.
The multi-lexer decides which lexer to use for each row based on a language column. This is a character column in the table which stores the language of the document in the text column. You should use the Oracle language name to identify the language of a document in this column. For instance, if you use CLOBs
to store your documents, then you must add the language column to the table where the documents are stored:
CREATE TABLE globaldoc ( doc_id NUMBER PRIMARY KEY, language VARCHAR2(30), text CLOB );
To create an index for this table, use the multi-lexer preference and specify the name of the language column:
CREATE INDEX globalx ON globaldoc(text) indextype IS ctxsys.context parameters ('lexer global_lexer language column language');
In addition to the language column, the character set and format columns must be added in the table where your documents are stored. The character set column stores the character set of the documents using the Oracle character set names. The format column specifies whether a document is a text or binary document. For instance, your table would looks like:
CREATE TABLE globaldoc ( doc_id NUMBER PRIMARY KEY, language VARCHAR2(30), characterset VARCHAR2(30), format VARCHAR2(10), text BLOB );
With the format column, you may put word-processing or spreadsheet documents into the table and specify binary
in the format column. For text documents such as HTML, XML and text, you may put them into the table and specify text
in the format column. With the character set column, you can store text documents in different character sets.
When you create the index, specify the names of the format and character set columns:
CREATE INDEX globalx ON globaldoc(text) indextype is ctxsys.context parameters ('filter inso_filter lexer global_lexer language column language format column format charset column characterset');
You may use the charset_filter
if all documents are in text format.
|
Copyright © 1996-2001, Oracle Corporation. All Rights Reserved. |
|