1 Overview of Migrating Databases to Unicode

This chapter discusses the database character set migration process. First, it explains what a database character set is and what it means to migrate one. It discusses the basic issues related to the migration process and the tools available to help with those tasks.

This chapter contains the following sections:

What Is a Database Character Set?
Introduction to Character Set Migration
Why Unicode Is the Right Choice
About Character Set Migration Tools
Overview of Character Set Migration Considerations

What Is a Database Character Set?

In Oracle, a database ch aracter set defines a set of characters that can be stored in the database and defines how each character maps to a particular sequence of bytes stored in memory or disk to represent this character. What Oracle calls a character set is also known in the industry as a character encoding, which can be thought of as a set of characters plus their encoding in bytes.

A database character set is used for VARCHAR2, CHAR, LONG, and CLOB data, as well as for SQL, PL/SQL, and Java stored code in the database. The CLOB data is stored in the database character set only if the character set is single-byte. Otherwise, it is stored in AL16UTF16 (that is, Unicode UTF-16 encoding in big-endian form, which is abbreviated as UTF-16BE).

A database character set may be single-byte, where each character has one byte, or multibyte varying width, where each character may be 1, 2, 3, or 4 bytes with the maximum dependent on the particular character set. Single-byte character sets are easier and faster to process but have a very limited capability of 256 codes -- too few even for all European languages. Multibyte character sets require more complex string processing but can accommodate thousands of characters.

Almost all character sets other than Unicode (frequently called legacy character sets) were each designed for a specific language or group of languages. For single-byte character sets, this was logical because of the limited number of available codes. For multibyte character sets, this was presumably because of limited requirements of isolated local databases, the desire to limit the maximum required width of a character code, or expertise required to design a character set for multiple unrelated language groups.

An important concept when working with character sets is that of subsets and supersets. We say that character set A is a superset of character set B and that character set B is a subset of the character set A, if A contains or defines all characters of B plus some other characters. We say that character set A is a strict or binary superset of character set B and that character set B is a strict or binary subset of the character set A, if A contains or defines all characters of B and each such character has identical byte representation in both A and B. A binary subset or superset is also called a strict subset or superset.

When character data is transmitted from an environment where it is encoded in one character set to an environment where it is encoded in another character set, for example, from one database to another or from client to server, the data must undergo the process of character set conversion. That is, the code of a character in the source character set must be replaced with the code of the same character in the target character set. Characters that are not available in the target character set are converted to replacement characters. The replacement character of a given source character may be defined explicitly in the target character set. For example, ä (a with an umlaut) is replaced by a. when converted to the character set US7ASCII. If no explicit replacement character is defined, the default replacement character of the target character set is used. This character is usually the question mark ? or the inverted question mark ¿. The default replacement character in Unicode character sets is the character U+FFFD.

By definition, if data is converted from a character set to its superset, no replacement characters are used because all characters are available in the superset. Also per definition, if data is converted from a character set to its binary (strict) superset, no character codes actually change. Oracle Database usually performs the conversion even to a binary superset, so that if the text to be converted contains byte sequences that do not form valid character codes in the source character set, these sequences are converted to the replacement characters in the target character set.

Introduction to Character Set Migration

A database defined in a legacy character set can store data only in one or a few languages as determined by its character set. If a need arises to store data in other languages in this database, the character set of the database must be changed to a superset that defines both the existing and the new required characters. The need may come from internationalization of operations, moving to an Internet sales model, or from consolidating databases of international subsidiaries or buying and acquired companies into one database.

If the identified superset is also a binary superset, the change of the database character set declaration must happen only in metadata (schema information) in the data dictionary. All existing data, per definition, remains valid in the binary superset. The change is simple and fast. If the identified superset is not a binary superset, binary character codes (byte representation) must be converted from the representation in the old character set to the representation in the new character set, that is, re-encoded.

An important observation, relevant to determining if existing data requires conversion, is that even if the old database character set is not a binary subset of the new database character set, many characters may actually have the same binary codes in both character sets. In particular, all standard ASCII characters with codes between 0 and 127 form a binary subset of each Oracle database character sets supported on ASCII-based platforms (that is, all platforms except EBCDIC-based IBM z/OS and Fujitsu BS2000). Therefore, even if the planned database character set change is, for example, from WE8ISO8859P1 to AL32UTF8 - which are not in a binary subset-superset relationship - but all characters that are in the database have codes in the range 0 to 127, then no characters really need to be converted, because all characters will have the same codes after the change.

If character data in a database to be converted can be classified into data that needs no conversion, called changeless throughout this guide, and data that requires conversion, called convertible, a lot or resources can be saved by skipping the conversion of the changeless data.

When the character codes are converted, various issues may arise. The most common one is that the new representation may have more bytes causing the converted value to no longer satisfy the containing column's length constraint or even the maximum length of its data type. For conversion to succeed, those issues have to be identified and cleaned up first, for example, by lengthening columns, shortening texts, or migrating to a larger data type.

The process of identifying the suitable superset, classifying character data as changeless or convertible, checking for issues and conversion requirements, fixing the issues, changing the declaration in the data dictionary and converting data as needed is called character set migration. The issues and the subset/superset relationship of existing data in the database may be identified with the help of the Database Character Set Scanner utility, discussed in Oracle Database Globalization Support Guide, or with the help of the Database Migration Assistant for Unicode.

The migration process can be performed as an A-to-B migration, when data is copied from a source database into a new database defined with the target character set, or as an inline migration, when data is converted inside a database and the character set information of the database is changed.

The character set migration should not be confused with the repair process of updating the database character set declaration to match the encoding of data in the database. In certain incorrect configurations of Oracle client software, known as pass-through or garbage-in garbage-out configurations, the character data in the database may be interpreted by applications in a different character set from the one declared for the database. The goal of the repair process is to match the database character set declaration with the expectations of the applications. Such repair processes, per definition, do not involve conversion of character codes.

Why Unicode Is the Right Choice

An important step in the character set migration process is to gather the requirements for what you need to achieve. For example, a very common situation is that your company is internationalizing its operation and you now must support data storage of more world languages than are available with your existing database character set. Because many legacy computer systems initially required support for only one or possibly a few languages, the original character set chosen might have had a limited repertoire of characters that could be supported. A common case was in America, where 7-bit ASCII was satisfactory for supporting English data exclusively. Another common situation occurred in Europe, where a variety of 8-bit character sets supported European languages plus English. These choices were reasonable and provided a balance of economy and performance. Today, however, these choices are a barrier to supporting the global market, thus necessitating character set migration. It is common to need to consolidate multiple servers into a single instance, with the existing servers having various character sets. In this case, you would logically choose Unicode as the standard.

The character set defined in the Unicode Standard supports all contemporary written languages with significant use and a few historical scripts. It also supports various symbols, for example, those used in technical, scientific, and musical notations. It is the native or recommended character set of many technologies, such as Java, Windows, HTML, or XML. There is no other character set that is so universal. In addition, Unicode adoption is increasing rapidly with great support from within the industry.

Oracle supports two encodings of Unicode as the database character set: UTF-8 through the AL32UTF8 character set and CESU-8 through the UTF8 character set. (Note the use of the hyphen in the name of Unicode encoding and the lack of a hyphen in the name of Oracle character set. This differentiation is used throughout Oracle documentation.) UTF-8 is a multibyte varying width Unicode encoding using 1 to 4 bytes per character. CESU-8 is a compatibility-only encoding, discouraged for information exchange by the Unicode standard. CESU-8 is very similar to UTF-8 except that so-called supplementary characters encoded by UTF-8 in 4 bytes, CESU-8 encodes as pairs of 3-byte codes. Oracle Database has deprecated the UTF8 character set, so it should not be used as a database character set except when explicitly required by an application, such as Oracle E-Business Suite Release 11i.

AL32UTF8 encodes ASCII characters in 1 byte, characters from European, and Middle East languages in 2 bytes, characters from South and East Asian languages in 3 bytes. Therefore, storage requirements of Unicode are usually higher than storage requirements of a legacy character set for the same language. This is valid for all languages except those requiring only ASCII (English, Dutch, Indonesian, Swahili, and so on).

When looking for a superset to satisfy business requirements, you might find that there is a legacy character set that fulfills those requirements. This could happen, for example, if the database is US7ASCII storing English, which can also be stored in any other Oracle database character set, and the new languages to be added to the database all belong to a single group of compatible languages, supported by some legacy character set. For example, all the new languages are Western European and can be stored in WE8MSWIN252. Another example is if a few new characters must be stored in the database, such as the Euro sign and smart quotes ("), and the new characters are supported by an extended variant of the existing database character set. An example of such an extended variant is WE8MSWIN1252, which is a single-byte binary superset of WE8ISO8859P1.

If more than one language group is to be stored, Unicode is the only choice.

Oracle recommends migrating to Unicode for its universality and compatibility with contemporary and future technologies and language requirements. Migration from one legacy character set to another legacy character set should be considered only if all of the following statements are true for the environment to be migrated:

The new character set is a binary superset of the old character set, requiring only a simple declaration change, and no data conversion.
The new character sets is single-byte, thus offering better performance and more compact storage than Unicode.
No future language requirements are expected in a foreseeable future that will not be compatible with the new character set.
Migration to AL32UTF8 would require significant resources due to availability and/or compatibility requirements.

Two more reasons to consider when choosing the new character set are:

If the database grows fast, a conversion in the future will be much more expensive than today. Therefore, as Unicode seems to be the ultimate goal in the industry, it is best to migrate to it as soon as possible.
If the application is an Internet application, or it uses Java, or some other Unicode-based technology, then Unicode as the database character set will provide a more consistent, safer and more efficient environment, where there is no risk of losing data because of the inability of the system to store characters that may unexpectedly come from the application, and where character set conversions are limited to efficient transformations between Unicode character encodings.

Another Unicode encoding, UTF-16BE, named AL16UTF16 by Oracle Database, is used as the national character set for NCHAR, NVARCHAR2, and NCLOB data types. Data requiring the Unicode character set may be stored in columns of those data types, even if the database character set is not migrated to AL32UTF8. But those data types have some restrictions, for example, they are not supported by Oracle Text and require changes to client API calls (Oracle Call Interface (OCI) or Java Database Connectivity (JDBC)). Oracle does not recommend the use of national character set data types, and NCHAR data types are not discussed in this guide. See Oracle Database Globalization Support Guide for more coverage of NCHAR.

About Character Set Migration Tools

You can change the database character set of an existing database. The following sections discuss how to achieve this:

Identifying Migration Issues with the Database Character Set Scanner
Changing Character Set Metadata Using the CSALTER Script
Converting Character Data Using the Export/Import and Data Pump Utilities
Migrating a Database Using the Database Migration Assistant for Unicode

Identifying Migration Issues with the Database Character Set Scanner

The Database Character Set Scanner (CSSCAN) is a command-line utility that assesses the feasibility of migrating an Oracle database to a new database character set. The Database Character Set Scanner checks all character data in the database and tests for the effects and problems of changing the character set encoding. A summary report is generated at the end of the scan that shows the scope of work required to convert the database to a new character set.

The Database Character Set Scanner is available with all database editions except the Express Edition. Most of its scanning functionality has been included in the Database Migration Assistant for Unicode, discussed in this Guide, but the Database Character Set Scanner must still be used if any of the following is true:

Migration is to a legacy character set.
The national (NCHAR) character set is being migrated.
The database character set declaration is being repaired (see "Introduction to Character Set Migration").
The scan results are required by the CSALTER script.

See Also:

Oracle Database Globalization Support Guide for more information about the Database Character Set Scanner

Changing Character Set Metadata Using the CSALTER Script

The CSALTER script is part of the Database Character Set Scanner utility. The script has two tasks:

Change the character set metadata information stored in various system tables in the data dictionary to the new character set.
Convert character data in the CLOB columns in the data dictionary tables, if the database character set is changed from single-byte to multibyte or from multibyte to single-byte.

The CSALTER script regards more than just tables owned by SYS as the data dictionary. It considers most of the schemas present in default template databases, which are created during Oracle Database software installation, as the data dictionary as well.

The CSALTER script can be run only if none of the character data in the database, with the exception of CLOB data in the data dictionary tables, needs re-encoding to remain valid in the new character set. To assure the integrity of the database, the CSALTER script requires that the Database Character Set Scanner be run first to confirm that no data requires conversion, that is, the target character set is a binary superset of the data in the database. CSALTER aborts with an error message if the scan results contain information about any "exceptional" data anywhere in the database or any "convertible" data outside of the data dictionary CLOB columns.

The CLOB columns in the data dictionary are treated especially because AL16UTF16, which is used to store CLOB data in multibyte databases, is never a binary superset or subset of any single-byte character set. So even if the target character set is a superset, the CLOBs still need to be converted. As data dictionary tables cannot be dropped or truncated, no single-byte database character set could ever be changed to any multibyte character set without CSALTER converting the data dictionary CLOBs.

The CLOB columns in user tables are not converted by CSALTER; the CSALTER script will not run if CLOB data is in user tables and the character set change is from single-byte to multibyte.

Most of the CSALTER functionality has been included in the Database Migration Assistant for Unicode, discussed in this guide, but CSALTER must still be used, if:

migration is to a legacy character set, or
the national (NCHAR) character set is being migrated, or
the database character set declaration is being repaired (see "Introduction to Character Set Migration")

See Also:

Oracle Database Globalization Support Guide for information about how to change the database character set using the CSALTER script

Converting Character Data Using the Export/Import and Data Pump Utilities

If the Database Character Set Scanner finds some data requiring conversion in the database, the CSALTER script cannot change the database character set. Some other method must also be applied to convert the data.One of the common ways to convert the character data is to use the Export utility to export database objects from the database and then the Import utility to import these objects either to the same or to another database. If the database character set at the time of the export differs from the database character set at the time of the import, the Import utility automatically converts the data as required.The advantage of using the Export and Import utilities to convert data compared to other methods, such as spooling data and loading it back with SQL*Loader, is that they automatically take care of most steps needed to re-create exported objects in their original form.

See Also:

Oracle Database Globalization Support Guide for information about how to change the database character set using a full export and import process

Full Versus Selective Export/Import

Export and import may be applied as steps in the following two methods of the character set migration:

Migration through full database export and import

In this method, an empty database in the target character set is created first. Then, all contents of the database is exported from the source database and imported into the target database. The created copy of the original database has all character data converted as required. The source database is dropped and the copy becomes the production database.
Migration through selective export and import

In this method, all tables identified by the Database Character Set Scanner utility as containing convertible data – and only these tables – are exported, and then truncated. The Database Character Set Scanner is run again and this time, it finds no convertible data. The CSALTER script can now be executed to change the database character set. Then, the previously exported data is imported back into the database, undergoing the required character set conversion.

Circumstances that favor the full export/import method are:

The data dictionary of the database contains convertible data. For example, table and/or column names (identifiers) contain non-ASCII characters requiring conversion. As data dictionary tables cannot be truncated, the selective export/import approach cannot be used.
The number of tables requiring conversion is high, making manual selection of tables to export and truncate a costly task.
The character set migration process is combined with database upgrade to a higher release and/or migration to another platform.
A lot of tables in the database are fragmented and the character set migration process is an opportunity to perform defragmentation and cleanup.

The following circumstances point to the selective export/import method instead:

The full export/import converts all text in the database, more than may be needed; it also needs to insert all non-textual data. Therefore, the full export/import method is usually much too slow for the short maintenance windows of large production databases.
The selective export/import method requires temporary disk space for the exported tables only. The full method requires temporary storage for the database copy and for the full database export file. This amounts to additional space almost twice the size of the source database.

Data Pump Utility

The Export and Import utilities come in two versions, the original Export (exp) and Import (imp) and the Data Pump Export (expdp) and Import (impdp). The Data Pump versions were introduced in Oracle Database Release 10g. The original versions are desupported for general use as of Oracle Database Release 11g. Both versions are command-line tools and both can be used for the purpose discussed here.

The original utilities are not compatible with Data Pump utilities. Files produced by exp cannot be imported by impdp and files produced by expdp cannot be imported by imp.

The original Export and Import are pure client-side utilities. The Export utility retrieves data over an Oracle Net network connection and creates the export file on the client. Data Pump Export contains a small client-control utility, but the actual export job is performed by a database server process and the export file is produced on the database server host. Similarly, the original Import utility reads the export file on the client and inserts data over an Oracle Net network connection into the target database, while the Data Pump utility reads the export file directly on the database server host. The Data Pump architecture makes the Data Pump utility export and import process much faster, but if the export file is to be created or read on a computer other than the source database host, the target system must have a dummy database installed.

See Also:

Oracle Database Utilities for detailed information about the original and Data Pump Export/Import utilities

Conversion Issues

Conversion, as mentioned in "Introduction to Character Set Migration" and described in "Character Set Migration: Data Integrity", is associated with possible issues, mainly involving length problems. Those issues should have been identified by the Database Character Set Scanner and handled by the user before the import. Otherwise, in both methods discusses, the import process might fail.

When using the Export/Import utilities, restrictions regarding objects not supported by the given utility version must be considered, as well. For example, Oracle Data Pump Release 10g does not support exporting an XML schema. Such objects might need to be moved manually.

See Also:

Oracle Database Globalization Support Guide for more information about the database migration process with Export/Import utilities

Migrating a Database Using the Database Migration Assistant for Unicode

The DMU offers an intuitive and user-friendly GUI that helps you streamline the migration process through an interface that minimizes the workload and ensures that all migration issues are addressed, along with guaranteeing that the data conversion is carried out correctly and efficiently.

Some advantages of the DMU are that it does the following:

Guides you through the workflow

An important advantage of the DMU is that it offers a logical workflow to guide you through the entire process of migrating character sets.
Offers suggestions for handling certain problems

The DMU can help you when you run into certain problems, such as errors or failures during the scanning or cleansing of the data.
Supports selective conversion of data

The DMU enables you to convert only the data that must be converted, at the table, column, and row level.
Offers progress monitoring

The DMU provides a GUI to visualize how the steps are progressing.
Offers interactive visualization features

The DMU enables you to analyze data and see the results in the GUI in an interactive way. It also enables you to see the data itself in the GUI and cleanse it interactively from identified migration issues.
Provides the only supported tool for inline conversion

With the DMU, Oracle Database supports inline conversion of database contents. This offers performance and security advantage over other existing conversion methods.
Allows cleansing actions to be scheduled for later execution during the conversion step

Postponing of cleansing actions, such as data type migration, ensures that the production database and applications are not affected until the actual migration downtime window.

This release of the Database Migration Assistant has a few restrictions with respect to what databases it can convert. For example, in most cases, it does not convert databases with convertible data in the data dictionary. The export/import migration methods could be used to overcome these limitations.

See Also:

Chapter 4, "Performing Basic DMU Tasks" for more information

Overview of Character Set Migration Considerations

The following sections discuss the matters you should consider before migrating your current character set using the Database Migration Assistant for Unicode:

Character Set Migration: Data Integrity
Character Set Migration: Dependent Objects
Character Set Migration: Read-Only and Inaccessible Objects
Character Set Migration: Downtime
Character Set Migration: Failure Recovery
Character Set Migration: Application Impact

Character Set Migration: Data Integrity

There are several data integrity issues that might arise from re-encoding of character data in the database, including the following:

Data Expansion
Invalid Binary Storage Representation of Data
Partitioning
Maximum Index Key Size
Unique Keys and Primary Keys
Derived or Encrypted Data
Character Data Stored in Binary Data Types

Data Expansion

When you migrate from a legacy encoding to Unicode, character values will likely expand in conversion because their encodings will have more bytes. This results in increased space requirements for the database. A further issue is that the widths for CHAR and VARCHAR2 columns may not be sufficient after the character set has been migrated to Unicode. Thus, there is a risk of the data being truncated. The column length constraints have to be increased, or, if they are already at the data type limit, the columns might need to be migrated to the CLOB data type.

A related issue is with non-ASCII object names, which typically have a 30-byte limit. The names of such objects must be manually truncated or changed to use only ASCII characters, in both the database and affected applications, because it is impossible to increase the limit of non-ASCII object names.

Invalid Binary Storage Representation of Data

For user data, a common problem is that the data in a column is not actually in the declared database character set. Instead, it is in another character set or it is binary (for example, it is composed of images, documents in proprietary word processor formats, or text encrypted with custom methods), or perhaps there are multiple character sets in a single column.

The preceding situation is possible in the pass-through configuration. In this configuration, client character set is defined (usually through the NLS_LANG client setting) to be equal to the database character set. Because the character sets are equal, the Oracle communication protocol does not attempt any client/server character set conversion, logically assuming that the conversion is not needed. This configuration is correct as long as data coming from a client application is indeed in the declared character set. But because no character set conversion nor character validation is performed on the data, any sequence of bytes can actually be stored in the database and retrieved unchanged. As long as only the client application interprets the data, the stored byte sequences could correspond to a character encoding different from the database character set or they may not correspond to readable text at all.

If default character set conversion is applied to such incorrect data during migration, some of the data does not survive the conversion, and produces garbage. This results from the default application of a code mapping from the old database character set to the new character set to convert the stored sequences of bytes. This mapping is certainly incorrect for binary data, which should not change at all, and usually incorrect for data interpreted by applications as being in a character set different from the old database character set.

To avoid the problem, there are a few things you can do. One is to change the data type of the column to RAW or LONG RAW. This approach is recommended for binary data. Another possibility is to skip conversion of a column with suspected problems and let the application continue using the pass-through configuration after updating the client character set declaration to the new database character set. This approach is generally not recommended but might be the only choice if data in a single column is stored in multiple actual character sets and discovering these character sets automatically is not possible. Finally, the DMU enables you to declare the actual character set of any column as being different from the source database character set. In this way, you can tell DMU how to perform the conversion according to correct code mapping for data in the column.

Partitioning

Partitioning by range compares partitioning key values with partition boundary values to distribute rows into partitions. This comparison looks at the binary storage representation of character values. The binary storage representation of both the keys and the boundary values could change when converted to Unicode in a way that would not preserve their ordering. This could lead to three problems:

A given partitioning key might not sort between the lower and the upper boundaries of the partition that the key is in.
Partitions are internally numbered in the same order as the order of their boundary values. If the boundary values change their sort order during conversion to Unicode, the internal numbering would become invalid.
In rare cases, two partition boundaries could become equal because in some multibyte Eastern Asian character sets two different characters may map to the same Unicode character. Unicode does not separately encode certain shape variants of certain Chinese characters, which are encoded in legacy character sets.

A partitioned table might need to be re-created using modified DDL, if its partitioning boundaries use non-ASCII characters. If only the partitioning key values were affected, rows would need to be moved between partitions, so row movement must be enabled for the table.

Maximum Index Key Size

The key size of an index is calculated using maximum lengths of key columns. The maximum supported index key size is limited by the database block size. To ensure that expanded column data can be held in a given column after conversion to Unicode, users may try to expand the column byte length or alter the column length semantics from BYTE to CHAR. If an index was created on such a column, the resulting larger index key might exceed the supported maximum index key size.

You can try to solve this problem by:

Increasing the block size
Dropping the index
Removing some columns from a composite index

Unique Keys and Primary Keys

There are two situations where uniqueness of key values of a unique constraint or a primary key constraint might be violated after conversion of the key values to Unicode, even though the pre-conversion key values satisfy the constraint:

Two different text values might become the same value after conversion to AL32UTF8 because some characters in some character sets map to the same Unicode code point. For example, two JA16SJIS code points 0xED40 and 0xFA5C are mapped to U+7E8A.
Two text values differing only at their ends might become the same if, during conversion, they were truncated because of their post-conversion length exceeding the maximum column length.

Usually, the key values must be modified manually to solve these issues.

Derived or Encrypted Data

Applications might store information that has been derived from character data in a way that depends on the binary storage representation of this data. For example, an application might store text length separately from the text in a numerical column. If the text expands, such a column will not be updated automatically. Or, an application might calculate and store a checksum or an encrypted form of a character value. The checksum or cipher text, if based on the binary storage representation in the database character set, becomes invalid when the representation changes because of character set conversion.

Such derived values must be resynchronized after the migration. For encrypted text, this usually means that the text must be decrypted just before the conversion, then converted, and then immediately re-encrypted. For security reasons, the time during which the text remains decrypted must be kept to minimum and the database server might require extra protection to prevent a security breach.

Oracle Transparent Data Encryption functionality is generally transparent for the migration process as well, and manual decryption and re-encryption are not needed.

Character Data Stored in Binary Data Types

Applications might store character data in binary data types, such as RAW, LONG RAW, and BLOB. The applications might use the database character set to interpret and process this data. If the database character set is migrated but the character data in the binary data types is not re-encoded to match the new database character set, the applications will fail. Automatic migration tools, such as the Database Migration Assistant for Unicode described in this guide, have no information about which binary columns contain character data and might need character set conversion. Furthermore, character data may be intermixed with binary data in binary structures, so that simple conversion of entire column values is not possible. Therefore, if required, you must manually locate and convert any character data in binary data types.

Character Set Migration: Dependent Objects

When data in a table is modified during the conversion process, the modification impacts all objects that store a copy of this data, such as indexes (standard, functional, and domain) and materialized views. Depending on the conversion method, these objects might need to be dropped and re-created or they will be synchronized automatically by the database. For performance reasons, it might be more efficient to drop the dependent objects and re-create them after migration, instead of relying on the automatic synchronization.

Triggers are also affected, because they might react to modifications in a way that is not acceptable for business requirements. For example, a trigger sets the last modification date and the modifying user ID to that of the migration process, while the business rules requires that only user-initiated modifications are audited. Therefore, triggers must be disabled before the underlying table can be converted.

Most of the preceding issues are automatically handled by the DMU.

Character Set Migration: Read-Only and Inaccessible Objects

Certain tables are read-only, such as t ables explicitly marked as read-only, external tables, and tables stored in read-only tablespaces. Such tables cannot be converted before they are made read write. Except for the explicit marking of a table as read-only, which the migration process can temporarily remove, the other mentioned reasons have to be dealt with manually.

External tables do not usually require conversion. If the character set of the source file for an external table (SQL*Loader or Data Pump) is properly declared, the database will automatically convert character data to the new database character set each time the data is fetched from the file. For performance reasons, consider converting the file permanently to avoid this conversion. If the character set of the source file is not defined correctly, and the file content can be fetched only due to a pass-through configuration, similar to what is described in "Invalid Binary Storage Representation of Data", the character set declaration of the file must be corrected. In the case of Data Pump files, this is not a trivial task.

Read-only tablespaces might be stored on read-only media, such as CD-ROM or DVD-ROM. If convertible data is present in such a tablespace, you must move the tablespace to standard disk storage, make it read/write, convert the database, make the tablespace read-only again, and copy it back to the read-only media. If there are many such tablespaces in a database, for example, used for archival, and making them all read/write before the conversion is not feasible due to disk storage requirements, consider creating a dedicated archival database in the old database character set, and moving the read-only tablespaces one-by-one to this new database using the transportable tablespace feature. Then, the original database can be migrated. The read-only data in the archival database can be accessed through a database link, which will take care of the necessary character set conversion when the data is retrieved.

Tables can be stored in tablespaces or data files that have been put into the offline state. Data in such tables are not accessible. You can neither scan the data for migration issues nor actually convert it. Offline tablespaces and data files containing tables with character columns must be brought online before the character set migration process can begin.

Character Set Migration: Downtime

The downtime window is the time when all applications accessing the production database are shut down and the database is accessible only to migration tools. A downtime window is required to perform the actual conversion of the database. Scanning for issues and cleansing them may be performed, to a great extent, parallel to standard production work, though preferably during off-peak hours.

Keep the downtime windows as short as possible during the migration process. In addition to the actual time that the migration takes, consider the possibility of something going wrong, and how long resolving the problem might take. The tolerated downtime can be anything from a matter of hours for mission-critical applications to several days for a common data store.

The migration methods usually offer some flexibility regarding the accepted time between the last scan for data issues and the actual conversion of the database. If the last scan is performed in a production database before the downtime window starts, the downtime window is smaller but there is some risk that new issues might be introduced in the time between this last scan and the downtime window. By analyzing the downtime requirements and the database access patterns, you can find the balance between the risk of introducing data integrity issues and the inaccessibility of the database for production use.

The DMU offers the flexibility to scan only selected tables, which enables you to scan only those tables in the downtime window for which the risk of introducing new issues is significant.

Character Set Migration: Failure Recovery

When you perform a migration, the process may abruptly terminate for many reasons, such as a software defect or hardware malfunction. In that case, the database could be left in an inconsistent state. If the migration utility is not able to resume the migration, you may need to bring your database back to a consistent state, usually by recovering from a backup or using the Flashback Database feature. Dealing with failures during the migration process is a necessary part of your migration planning.

See Also:

Oracle Database Backup and Recovery Advanced User's Guide for more information about how to recover a database

Character Set Migration: Application Impact

The character set migratio n of a database, especially from a single-byte to a multibyte character set, may have a significant impact on applications using this database. The following could affect the application:

If table columns are lengthened or migrated to other data types to solve length expansion issues, buffer variables might need to be enlarged in the application code and application logic might need to be adapted to process CLOB data.
An application could make an assumption about maximum byte width of a character code in the database character set.
An application could rely on a specific binary ordering of character values. In Unicode, the ordering of non-ASCII values frequently differs from the ordering in a legacy character set.
To take full advantage of a new multibyte database character set, it might be necessary to adapt single-byte applications to process multibyte text correctly. This is not a trivial change.

All the preceding application issues must usually be handled manually. Application developers have to review and test their applications with a migrated test copy of the production database.