1 Overview of Migrating Databases to Unicode

This chapter discusses the process of database character set migration.

Besides explaining what a database character set is and what it means to migrate one, the chapter discusses the basic issues related to migration and the tools available to help with these tasks.

1.1 What Is a Database Character Set?

A database character set defines a set of characters that can be stored in the database and defines how each character maps to a sequence of bytes stored in memory or disk to represent this character.

A character set is also known in the industry as a character encoding. You can think of a character encoding as a set of characters plus their encoding in bytes.

A database character set is used for the following:

  • VARCHAR2

  • CHAR

  • LONG

  • CLOB

  • SQL, PL/SQL, and Java stored code in the database

The CLOB data is stored in the database character set only if the character set is single-byte. Otherwise, it is stored in AL16UTF16, which is Unicode UTF-16 encoding in big-endian form, abbreviated as UTF-16BE.

A database character set has either of the following characteristics:

  • Single-byte

    Each character has one byte. Single-byte character sets are easier and faster to process but have a limited capability of 256 codes, which is too few even for all European languages.

  • Multibyte

    Each character may be 1, 2, 3, or 4 bytes. The maximum depends on the particular character set. Multibyte character sets require more complex string processing but can accommodate thousands of characters.

Almost all character sets other than Unicode, which are frequently called legacy character sets, were designed for a specific language or group of languages. For single-byte character sets, this design was logical because of the limited number of available codes. For multibyte character sets, this was because of limited requirements of isolated local databases, the desire to limit the maximum required width of a character code, or expertise required to design a character set for multiple unrelated language groups.

If character set A contains or defines all characters of set B plus some other characters, then A is a superset of B, and B is a subset of A. If A contains or defines all characters of B, and if each character has identical byte representation in both A and B, then A is a strict or binary superset of B, and B is a strict or binary subset of the A. A binary subset or superset is also called a strict subset or superset.

When character data is transmitted from an environment where it is encoded in one character set to an environment where it is encoded in another character set, for example, from one database to another or from client to server, the data must undergo character set conversion. The code of a character in the source character set must be replaced with the code of the same character in the target.

Characters that are not available in the target character set are converted to replacement characters. The replacement of a given source character may be defined explicitly in the target character set. For example, ä (a with an umlaut) is replaced by a when converted to the character set US7ASCII. If no explicit replacement character is defined, then the default replacement character of the target character set is used. This character is usually the question mark ? or the inverted question mark ¿. The default replacement character in Unicode character sets is the character U+FFFD.

By definition, if data is converted from a character set to its superset, no replacement characters are used because all characters are available in the superset. Also, if data is converted from a character set to its binary (strict) superset, no character codes actually change. Oracle Database usually performs the conversion even to a binary superset. Consequently, if the text to be converted contains byte sequences that do not form valid character codes in the source character set, then the database converts these sequences to the replacement characters in the target character set.

1.2 Introduction to Character Set Migration

A database defined in a legacy character set can store data only in the languages determined by its character set. To store data in other languages, the character set must be changed to a superset that defines both the existing and new required characters.

You may have various reasons to migrate character sets, including the following:

  • Internationalization of operations

  • Moving to an online sales model

  • Consolidating databases of international subsidiaries or acquired companies

If the superset is a binary superset, then the database character set declaration must change only in metadata (schema information) in the data dictionary. Existing data remains valid in the binary superset. The change is simple and fast. If the identified superset is not a binary superset, however, then binary character codes (byte representation) must be re-encoded. The representation in the old character set must be converted to the representation in the new character set.

Even if the old database character set is not a binary subset of the new character set, many characters may have the same binary codes. In particular, standard ASCII characters with codes between 0 and 127 form a binary subset of every Oracle database character set supported on ASCII-based platforms (that is, all platforms except EBCDIC-based IBM z/OS and Fujitsu BS2000). For example, if WE8ISO8859P1 must change to AL32UTF8 (which are not in a binary subset-superset relationship), and if all characters have codes between 0 and 127, then no conversion is necessary. All characters will have the same codes after the change.

When no conversion is needed for the database character data to be converted, it is called changeless. Data that requires conversion is called convertible. You can save significant resources by skipping the conversion of changeless data.

When character codes are converted, issues may arise. Most commonly, the new representation may have more bytes, causing the converted value to no longer satisfy the containing column's length constraint or even the maximum length of its data type. For conversion to succeed, you must identify and clean up these issues. For example, you may need to lengthen columns, shorten text, or migrate to a larger data type.

Character set migration is a process that includes the following tasks:

  • Identifying the suitable superset

  • Classifying character data as changeless or convertible

  • Checking for issues and conversion requirements

  • Fixing any issues

  • Changing the declaration in the data dictionary and converting data as needed

You can identify the issues and the subset/superset relationship of existing data in the database with the help of the Database Character Set Scanner utility or the Database Migration Assistant for Unicode (DMU).

You can perform an A-to-B migration, where data is copied from a source database into a new database defined with the target character set. Alternatively, you can perform an inline migration, where data is converted inside a database and the character set information is changed.

The character set migration is distinct from the repair process of updating the database character set declaration to match the encoding of data in the database. In certain incorrect configurations of Oracle client software, known as pass-through or garbage-in garbage-out configurations, the character data in the database may be interpreted by applications in a different character set from the one declared for the database. The repair goal is to match the database character set declaration with the expectations of the applications. This type of repair does not involve conversion of character codes.

See Also:

Oracle Database Globalization Support Guide to learn more about the Database Character Set Scanner utility

1.3 Why Unicode Is the Right Choice

Oracle recommends migrating to Unicode for its universality and compatibility with contemporary and future technologies and language requirements.

An important step in the character set migration process is gathering requirements for what you need to achieve. A typical situation is a company that is expanding worldwide and must support data storage of languages unavailable with the existing character set. Because many legacy systems initially supported only one or possibly a few languages, the original character set might have had a limited repertoire of supported characters. Common cases were America, where 7-bit ASCII was satisfactory for supporting English data exclusively, and Europe, where a variety of 8-bit character sets supported European languages plus English. These choices were reasonable and provided a balance of economy and performance.

Today, a limited character set can be a barrier to supporting the global market, thus necessitating character set migration. It is common to need to consolidate multiple servers into a single instance, with the existing servers having various character sets. In this case, Unicode is the logical standard.

The character set defined in the Unicode standard supports all contemporary written languages with significant use and a few historical scripts. It also supports symbols such as those used in technical, scientific, and musical notations. Unicode is the native or recommended character set of many technologies, such as Java, Windows, HTML, or XML. There is no character set more universal than Unicode. Adoption of Unicode is increasing rapidly with great support from within the industry.

Oracle supports two encodings of Unicode as the database character set:

  • UTF-8 through the AL32UTF8 character set

    UTF-8 is a multibyte varying width Unicode encoding using 1 to 4 bytes per character. AL32UTF8 encodes ASCII characters in 1 byte, characters from European, and Middle East languages in 2 bytes, characters from South and East Asian languages in 3 bytes. Therefore, storage requirements of Unicode are usually higher than storage requirements of a legacy character set for the same language. This is valid for all languages except those requiring only ASCII (English, Dutch, Indonesian, Swahili, and so on).

  • CESU-8 through the UTF8 character set

    CESU-8 is a compatibility-only encoding, discouraged for information exchange by the Unicode standard. CESU-8 is similar to UTF-8 except that so-called supplementary characters encoded by UTF-8 in 4 bytes, CESU-8 encodes as pairs of 3-byte codes. Oracle Database has deprecated the UTF8 character set, so it should not be used as a database character set except when explicitly required by an application, such as Oracle E-Business Suite Release 11i.

Note:

A hyphen appears in the name of Unicode encoding (for example, UTF-8) and does not appear in the name of Oracle character set (for example, UTF8). This differentiation is used throughout Oracle documentation.

When looking for a superset to satisfy business requirements, you might find that there is a legacy character set that fulfills those requirements. This could happen, for example, if the database is US7ASCII storing English, which can also be stored in any other Oracle database character set, and the new languages to be added to the database all belong to a single group of compatible languages, supported by some legacy character set. For example, all the new languages are Western European and can be stored in WE8MSWIN252. Another example is if a few new characters must be stored in the database, such as the Euro sign and smart quotes ("), and the new characters are supported by an extended variant of the existing database character set. An example of such an extended variant is WE8MSWIN1252, which is a single-byte binary superset of WE8ISO8859P1.

If more than one language group is to be stored, then Unicode is the only choice.

Migration from one legacy character set to another legacy character set should be considered only if all of the following statements are true for the environment to be migrated:

  • The new character set is a binary superset of the old character set, requiring only a simple declaration change, and no data conversion.

  • The new character sets is single-byte, thus offering better performance and more compact storage than Unicode.

  • No future language requirements are expected in a foreseeable future that will not be compatible with the new character set.

  • Migration to AL32UTF8 would require significant resources due to availability and/or compatibility requirements.

Two more reasons to consider when choosing the new character set are:

  • If the database grows fast, a conversion in the future will be much more expensive than today. Therefore, as Unicode seems to be the ultimate goal in the industry, it is best to migrate to it as soon as possible.

  • If the application is an Internet application, or it uses Java, or some other Unicode-based technology, then Unicode as the database character set will provide a more consistent, safer and more efficient environment, where there is no risk of losing data because of the inability of the system to store characters that may unexpectedly come from the application, and where character set conversions are limited to efficient transformations between Unicode character encodings.

Another Unicode encoding, UTF-16BE, named AL16UTF16 by Oracle Database, is used as the national character set for NCHAR, NVARCHAR2, and NCLOB data types. Data requiring the Unicode character set may be stored in columns of those data types, even if the database character set is not migrated to AL32UTF8. But those data types have some restrictions, for example, they are not supported by Oracle Text and require changes to client API calls (Oracle Call Interface (OCI) or Java Database Connectivity (JDBC)). Oracle does not recommend the use of national character set data types, and NCHAR data types are not discussed in this guide.

See Also:

Oracle Database Globalization Support Guide for more coverage of NCHAR

1.4 About Character Set Migration Tools

Oracle Database provides various tools to help you change the database character set of an existing database.

1.4.1 Identification of Migration Issues with the Database Character Set Scanner (CSSCAN)

The Database Character Set Scanner (CSSCAN) is a command-line utility that assesses the feasibility of migrating an Oracle database to a new database character set.

Note:

The Database Character Set Scanner (CSSCAN) is not available in Oracle Database 12c.

The Database Character Set Scanner checks all character data in the database and tests for the effects and problems of changing the character set encoding. A summary report is generated at the end of the scan that shows the scope of work required to convert the database to a new character set.

The Database Character Set Scanner is available with all database editions except the Express Edition. Most of its scanning functionality has been included in the Database Migration Assistant for Unicode, discussed in this guide, but the Database Character Set Scanner must still be used if any of the following is true:

  • Migration is to a legacy character set.

  • The national (NCHAR) character set is being migrated.

  • The scan results are required by the CSALTER script.

1.4.2 Changing Character Set Metadata Using the CSALTER Script

The CSALTER script changes character set metadata in the data dictionary to the new character set. If the character set changes from single-byte to multibyte (or vice versa), then the script converts CLOB column data in the data dictionary tables.

Note:

The CSALTER script is part of the Database Character Set Scanner utility. CSALTER is not available in Oracle Database 12c.

The CSALTER script does not limit the data dictionary to SYS tables. The scripts also considers most schemas in default template databases, which are created during database installation, as the data dictionary.

You can run CSALTER only if no character data in the database, except CLOB data in the data dictionary tables, needs re-encoding to remain valid in the new character set. To assure the integrity of the database, CSALTER requires that you confirm that no data requires conversion by running Database Character Set Scanner. If the scan results contain information about "exceptional" data anywhere in the database or "convertible" data outside of the data dictionary CLOB columns, then CSALTER fails with an error.

CLOB columns in the data dictionary are treated specially because AL16UTF16, which stores CLOB data in multibyte databases, is never a binary superset or subset of any single-byte character set. Even if the target character set is a superset, CLOB data must be converted. Because data dictionary tables cannot be dropped or truncated, no single-byte database character set could be changed to any multibyte character set without CSALTER converting the data dictionary CLOB data.

The CLOB columns in user tables are not converted by CSALTER; the CSALTER script will not run if CLOB data is in user tables and the character set change is from single-byte to multibyte.

Most CSALTER functionality has been included in the Database Migration Assistant for Unicode, but you must still use CSALTER in the following situations:

  • You migrate to a legacy character set.

  • The national (NCHAR) character set is being migrated.

1.4.3 Converting Character Data Using the Export/Import and Data Pump Utilities

When the Database Character Set Scanner finds data requiring conversion, the CSALTER script cannot change the database character set. Oracle Data Pump solves this problem.

A common way to convert character data is to use the Export utility to export database objects, and then the Import utility to import them to the same or to another database. If the database character set differs during import and export, then the Import utility automatically converts the data as required. The advantage of using this technique, as opposed to techniques such as spooling data and loading it back with SQL*Loader, is that it automatically takes care of most steps needed to re-create exported objects in their original form.

See Also:

Oracle Database Globalization Support Guide for information about how to change the database character set using a full export and import process

1.4.3.1 Full Versus Selective Export/Import

You can migrate character sets either through full database export and export, or selective export and import.

The migration techniques are as follows:

  • Migration through full database export and import

    In this technique, you first create an empty database using the target character set. Afterward, you export the contents of the source database, and then import it into the target database. The created copy of the original database has all character data converted as required. You drop the source database, making the copy the new production database.

  • Migration through selective export and import

    In this technique, you export only tables identified by the Database Character Set Scanner utility as containing convertible data, and then truncate them. You run the Database Character Set Scanner again. This time, the scan finds no convertible data. You can now change the database character set by executing CSALTER. Finally, you import the previously exported data back into the database, undergoing the required character set conversion.

The following circumstances favor the full export/import method:

  • The data dictionary of the database contains convertible data. For example, table and/or column names (identifiers) contain non-ASCII characters requiring conversion. As data dictionary tables cannot be truncated, the selective export/import approach cannot be used.

  • The number of tables requiring conversion is high, making manual selection of tables to export and truncate a costly task.

  • The character set migration process is combined with database upgrade to a higher release and/or migration to another platform.

  • A lot of tables in the database are fragmented and the character set migration process is an opportunity to perform defragmentation and cleanup.

The following circumstances favor the selective export/import method:

  • The full export/import converts all text in the database, more than may be needed. It also must insert all non-textual data. Therefore, the full export/import method is usually much too slow for the short maintenance windows of large production databases.

  • The selective export/import method requires temporary disk space for the exported tables only. The full method requires temporary storage for the database copy and for the full database export file. This amounts to additional space almost twice the size of the source database.

1.4.3.2 Data Pump Utility

The Export and Import utilities come in two versions: original Export (exp) and Import (imp), and Data Pump Export (expdp) and Import (impdp).

The Data Pump versions were introduced in Oracle Database Release 10g. The original versions are desupported for general use as of Oracle Database Release 11g. You can use both versions for migration.

The original utilities are not compatible with Data Pump utilities. You cannot import files produced by exp using impdp, and you cannot import files produced by expdp using imp.

Original Export and Import are client-side utilities. Export retrieves data over an Oracle Net connection and creates the export file on the client. Data Pump Export contains a small client-control utility, but the export is performed by a database server process and the export file is produced on the server host. Similarly, original Import reads the export file on the client and inserts data over Oracle Net into the target database, whereas Data Pump reads the export file directly on the server host. Its architecture makes the Data Pump export and import process much faster, but if the export file must be created or read on a host other than the source, then the target system must have a dummy database installed.

See Also:

Oracle Database Utilities for detailed information about the original and Data Pump Export/Import utilities

1.4.3.3 Conversion Issues

Conversion is associated with possible issues, mainly involving length problems.

Those issues should have been identified by the Database Character Set Scanner and handled by the user before the import. Otherwise, in both techniques discussed, the import might fail.

When using the Export and Import, you must consider restrictions regarding objects not supported by the given utility version. For example, Oracle Data Pump Release 10g does not support exporting an XML schema. You might need to move such objects manually.

See Also:

Oracle Database Globalization Support Guide for more information about the database migration process with Export/Import utilities

1.4.4 Migrating a Database Using the Database Migration Assistant for Unicode

The Database Migration Assistant for Unicode (DMU) offers an intuitive and user-friendly GUI that helps you streamline the migration process.

The DMU interface minimizes the workload and ensures that all migration issues are addressed. The DMU guarantees that the data conversion is carried out correctly and efficiently.

Advantages of the DMU include the following:

  • Guides you through the workflow

    An important advantage of the DMU is that it offers a logical workflow to guide you through the entire process of migrating character sets.

  • Offers suggestions for handling certain problems

    The DMU can help you when you run into certain problems, such as errors or failures during the scanning or cleansing of the data.

  • Supports selective conversion of data

    The DMU enables you to convert only the data that must be converted, at the table, column, and row level.

  • Offers progress monitoring

    The DMU provides a GUI to visualize how the steps are progressing.

  • Offers interactive visualization features

    The DMU enables you to analyze data and see the results in the GUI in an interactive way. It also enables you to see the data itself in the GUI and cleanse it interactively from identified migration issues.

  • Provides the only supported tool for inline conversion

    With the DMU, Oracle Database supports inline conversion of database contents. This offers performance and security advantage over other existing conversion methods.

  • Allows cleansing actions to be scheduled for later execution during the conversion step

    Postponing of cleansing actions, such as data type migration, ensures that the production database and applications are not affected until the actual migration downtime window.

This release of the Database Migration Assistant has a few restrictions with respect to what databases it can convert. For example, in most cases, it does not convert databases with convertible data in the data dictionary. The export/import migration methods could be used to overcome these limitations.

See Also:

Performing Basic DMU Tasks for more information

1.5 Overview of Character Set Migration Considerations

Before migrating your current character set using the DMU, consider data integrity, dependent objects, downtime and failure recovery, and several other important issues.

1.5.1 Character Set Migration: Data Integrity

Several data integrity issues might arise from re-encoding character data, including data expansion, invalid binary representation, and several others.

1.5.1.1 Data Expansion

When you migrate from a legacy encoding to Unicode, character values may expand because their encodings have more bytes, resulting in increased space consumption.

Also, widths for CHAR and VARCHAR2 columns may be insufficient after the migration. Thus, there is a risk of truncated data. The column length constraints must be increased. Alternatively, if the constraints are at the data type limit, you might need to migrate the columns to the CLOB type.

Non-ASCII object names have a related issue. Typically, these names have a 30-byte limit in releases before Oracle Database 12c Release 2 (12.2). In both the database and affected applications, you must manually truncate the names of such objects or change them to use only ASCII characters because it is impossible to increase the limit of non-ASCII object names.

Note:

Starting with Oracle Database 12c Release 2 (12.2), the maximum limit for an identifier is 128 bytes. For the earlier Oracle Database versions, the maximum limit for an identifier is 30 bytes.

1.5.1.2 Invalid Binary Storage Representation of Data

Column data may not be in the declared database character set: it may be binary, in another character set, or in multiple character sets.

Examples include column data that it is composed of images, documents in proprietary word processor formats, or text encrypted with custom methods.

In a pass-through configuration, the client character set is defined (usually through the NLS_LANG client setting) as equal to the database character set. The Oracle communication protocol does not attempt client/server character set conversion because conversion is assumed to be unnecessary. This configuration is correct when data from a client application is in the declared character set. But because no character set conversion or character validation is performed, any sequence of bytes can be stored and retrieved unchanged. If only the client application interprets the data, then the stored byte sequences could correspond to a character encoding different from the database character set, or not correspond to readable text at all.

If default character set conversion is applied to such incorrect data during migration, some of the data does not survive the conversion, and produces garbage. This results from the default application of a code mapping from the old database character set to the new character set to convert the stored sequences of bytes. This mapping is certainly incorrect for binary data, which should not change at all, and usually incorrect for data interpreted by applications as being in a character set different from the old database character set.

To avoid the problem, there are a few things you can do. One is to change the data type of the column to RAW or LONG RAW. This approach is recommended for binary data. Another possibility is to skip conversion of a column with suspected problems and let the application continue using the pass-through configuration after updating the client character set declaration to the new database character set. This approach is generally not recommended but might be the only choice if data in a single column is stored in multiple actual character sets and discovering these character sets automatically is not possible. Finally, the DMU enables you to declare the actual character set of any column as being different from the source database character set. In this way, you can tell DMU how to perform the conversion according to correct code mapping for data in the column.

1.5.1.3 Partitioning

To distribute rows, range partitioning uses the binary storage representation of character values to compare key values with boundary values. When converted to Unicode, the binary representation of these values could change, failing to preserve the ordering.

This lack of order preservation could lead to three problems:

  1. A given partitioning key might not sort between the lower and the upper boundaries of the partition that the key is in.

  2. Partitions are internally numbered in the same order as the order of their boundary values. If the boundary values change their sort order during conversion to Unicode, the internal numbering would become invalid.

  3. In rare cases, two partition boundaries could become equal because in some multibyte Eastern Asian character sets two different characters may map to the same Unicode character. Unicode does not separately encode certain shape variants of certain Chinese characters, which are encoded in legacy character sets.

If its partitioning boundaries use non-ASCII characters, then a partitioned table might need to be re-created using modified DDL. If only the partitioning key values were affected, then rows would need to be moved between partitions, so row movement must be enabled for the table.

1.5.1.4 Maximum Index Key Size

In some cases, a large index key might exceed the maximum supported size.

This situation can occur because the index key size is calculated using maximum lengths of key columns. These are limited by database block size. To ensure that expanded data can reside in a column after conversion to Unicode, users may try to expand the column byte length. Alternatively, they may alter the column length semantics from BYTE to CHAR. If an index was created on such a column, then the resulting larger index key might exceed the maximum.

You can try to solve this problem by:

  • Increasing the block size

  • Dropping the index

  • Removing some columns from a composite index

1.5.1.5 Unique Keys and Primary Keys

Uniqueness of the key values of a unique or primary constraint might be violated after conversion to Unicode, even though the preconversion key values satisfy the constraint.

The situations are as follows:

  • Two different text values might become the same value after conversion to AL32UTF8 because characters in some character sets map to the same Unicode code point. For example, two JA16SJIS code points 0xED40 and 0xFA5C map to U+7E8A.

  • Two text values differing only at their ends might become the same if, during conversion, they were truncated because of their post-conversion length exceeding the maximum column length.

Usually, the key values must be modified manually to solve these issues.

1.5.1.6 Derived or Encrypted Data

Applications might store information derived from character data that depends on its binary storage representation. Such derived values must be resynchronized after the migration.

For example, an application might store text length separately from numerical column text. If the text expands, then this column is not updated automatically. Alternatively, an application might calculate and store a checksum or an encrypted form of a character value. If the checksum or cipher text is based on the binary storage representation in the database character set, then it becomes invalid when the representation changes because of character set conversion.

For encrypted text, you must decrypt it just before the conversion, convert it, and then immediately re-encrypt it. Keep the time of the decrypted text to a minimum. To prevent a security breach, the database server might require extra protection.

Oracle Transparent Data Encryption functionality is generally transparent for the migration process. Manual decryption and re-encryption are not needed.

1.5.1.7 Character Data Stored in Binary Data Types

Applications might store character data in binary data types such as RAW, LONG RAW, and BLOB, and then use the database character set to interpret and process it. If the database character set is migrated but the data in the binary types is not re-encoded to match the new character set, then the applications fail.

Automatic migration tools such as the DMU have no information about which binary columns contain character data and might need character set conversion. Furthermore, character data may be intermixed with binary data in binary structures, so that simple conversion of entire column values is not possible. If required, you must manually locate and convert character data in binary data types.

1.5.2 Character Set Migration: Dependent Objects

Converting table data affects all objects that store a copy of it. Depending on the conversion method, you may need to drop and re-create dependent objects, or they will be synchronized automatically.

Dependent objects include indexes (standard, functional, and domain) and materialized views. Instead of relying on the automatic synchronization, performance may improve if you drop the dependent objects, and then re-create them after migration.

Triggers are also affected because they might react to modifications in a way that is not acceptable for business requirements. For example, a business rules requires that only user-initiated modifications are audited. However, a trigger sets the last modification date and the modifying user ID to the date and ID of the migration process. Therefore, you must disable triggers before converting the underlying table.

Most of the preceding issues are automatically handled by the DMU.

1.5.3 Character Set Migration: Read-Only and Inaccessible Objects

Read-only tables cannot be converted before they are made read/write.

Read-only tables are tables explicitly marked as read-only, external tables, and tables stored in read-only tablespaces. Except for the explicit marking of a table as read-only, which the migration process can temporarily remove, you must manage the other cases manually.

External tables do not usually require conversion. If the character set of the source file for an external table (SQL*Loader or Data Pump) is properly declared, then the database automatically converts character data to the new database character set when data is fetched from the file. To improve performance, consider converting the file permanently to avoid this conversion. If the character set of the source file is not defined correctly, and if the file content can be fetched only due to a pass-through configuration, similar to what is described in "Invalid Binary Storage Representation of Data", then you must correct the character set declaration of the file. For Data Pump files, this is not a trivial task.

Read-only tablespaces may be stored on read-only media. If convertible data is present in such a tablespace, then perform the following steps:

  1. Move the tablespace to standard disk storage.

  2. Make the tablespace read/write.

  3. Convert the database.

  4. Make the tablespace read-only again.

  5. Copy it back to the read-only media.

Many such tablespaces may exist in a database. Making them read/write before conversion may not be feasible due to disk storage requirements. In this case, consider the following strategy:

  1. Create a dedicated archival database in the old database character set.

  2. Move the read-only tablespaces one-by-one to the new database using the transportable tablespace feature.

  3. Migrate the original database.

  4. Access the read-only data in the archival database through a database link. This takes care of the necessary character set conversion when the data is retrieved.

Tables may be stored in tablespaces or data files that have been put into the offline state. Data in such tables is not accessible. You can neither scan the data for migration issues nor actually convert it. Before the character set migration can begin, you must bring the offline tablespaces and data files containing tables with character columns online.

1.5.4 Character Set Migration: Downtime

A downtime window is required to perform the conversion of the database.

In the downtime window, applications accessing the production database are shut down, and the database is accessible only to migration tools. You can scan for issues and cleanse them parallel to standard production work. Preferably, the downtime window occurs during off-peak hours.

Keep the downtime windows as short as possible during migration. Consider the possibility of something going wrong, and how long resolving the problem might take. The tolerated downtime can be anything from a matter of hours for mission-critical applications to several days for a common data store.

The migration methods usually offer some flexibility regarding the accepted time between the last scan for data issues and the conversion of the database. If the last scan is performed in a production database before the downtime window starts, then the downtime window is smaller. However, there is some risk that new issues might be introduced in the time between this last scan and the downtime window. By analyzing the downtime requirements and the database access patterns, you can find the balance between the risk of introducing data integrity issues and the inaccessibility of the database for production use.

The DMU offers the flexibility to scan only selected tables. This feature enables you to scan only those tables in the downtime window for which the risk of introducing new issues is significant.

1.5.5 Character Set Migration: Failure Recovery

When you perform a migration, the process may abruptly terminate, leaving the database in an inconsistent state. Dealing with failures during the migration process is a necessary part of migration planning.

Many possible events may trigger termination, including a software defect or hardware malfunction. If the DMU cannot resume the migration, then you may need to bring your database back to a consistent state. Typical responses are recovering from a backup or using Oracle Flashback Database.

See Also:

Oracle Database Backup and Recovery User’s Guide for more information about how to recover a database

1.5.6 Character Set Migration: Application Impact

The character set migration of a database, especially from a single-byte to a multibyte character set, may have a significant effect on applications.

The following issues could affect the application:

  • If table columns are lengthened or migrated to other data types to solve length expansion issues, then you may need to enlarge buffer variables in the application code and adapt application logic to process CLOB data.

  • An application could make an assumption about maximum byte width of a character code in the database character set.

  • An application could rely on a specific binary ordering of character values. In Unicode, the ordering of non-ASCII values frequently differs from the ordering in a legacy character set.

  • To take full advantage of a new multibyte database character set, you may need to adapt single-byte applications to process multibyte text correctly. This is not a trivial change.

You must handle the preceding application issues manually. Application developers must review and test their applications with a migrated test copy of the production database.