|Oracle® Database Migration Assistant for Unicode Guide
|PDF · Mobi · ePub|
A type of character set migration. In an A to B conversion, you create a new Unicode database and move all data from the existing database to the new one using the Export and Import utilities or Data Pump.
See inline conversion.
A wizard that enables you to perform three things in bulk:
alter the length semantics of
VARCHAR2 table columns from byte semantics to character semantics in bulk
You can perform the bulk cleansing for migrating to character length semantics in either immediate or scheduled cleansing mode. In the scheduled mode, the character length semantics migration will take place during the conversion phase without any interruption to the production environment. When the DMU is used in validation mode, only the immediate cleansing option will be available
Set the "Allow Conversion of Data with Issues" column conversion property
If the scan results of the column show that some values have conversion issues, that is, the converted values will contain replacement characters or they will be truncated, you can still let the DMU convert the data by setting this property to Yes. This might be useful if you want to automatically truncate data that exceeds the column or data type limit or you are not concerned about a few corrupted values that happen to exist in the column but are of no real significance for applications that use this data.
Replace or remove occurrences of a byte or character pattern in column values
In some cases, after analyzing the nature of individual data exceptions, you may decide to modify the data by replacing the offending bytes or characters with a set of valid equivalent bytes or characters or removing them altogether. If the same problem symptom exists in multiple database objects, it may be desirable to perform such replacement operation in bulk. The pattern-based replacement cleansing option allows you to replace all occurrences of a source pattern with a target pattern for the selected database objects. You can perform the replacement for single bytes or characters or specify substrings of bytes or characters.
Character semantics is useful for defining the storage requirements for multibyte strings of varying widths. For example, in a Unicode database (AL32UTF8), suppose that you need to define a
VARCHAR2 column that can store up to five Chinese characters together with five English characters. Using byte semantics, this column requires 15 bytes for the Chinese characters, which are three bytes long, and 5 bytes for the English characters, which are one byte long, for a total of 20 bytes. Using character semantics, the column requires 10 characters. Character semantics allows the specification of the column size in terms of characters independent of the database character set.
The process of moving from one database character set to another. In this guide, we discuss moving to Unicode.
You can check for and install the latest version of the DMU by using the DMU live update feature. This is done with a Check for Updates Wizard, and it enables you to find out if a new version of the DMU is available from the DMU update center.
The process of identifying and resolving the data issues that were found in the scanning process. These issues must be resolved before the database can be converted
CSALTER script is part of the Database Character Set Scanner utility. It is the most straightforward way to migrate a character set, but it can be used only if all of the schema data is a strict subset of the new character set.
CSREPAIR script enables you to repair the database character set metadata when using the pass-through configuration. In the pass-through configuration, the client character set is defined (usually through the
NLS_LANG client setting) to be equal to the database character set, though the character data in your database could be stored in a different character set from the declared database character set. The
CSREPAIR script works with the DMU client by accessing the DMU repository. Note that it only changes the database character set declaration in the data dictionary metadata and does not convert any database data.
The Database Character Set Scanner (
CSSCAN) is a command-line utility that assesses the feasibility of migrating an Oracle database to a new database character set. The Database Character Set Scanner checks all character data in the database and tests for the effects and problems of changing the character set encoding. A summary report is generated at the end of the scan that shows the scope of work required to convert the database to a new character set.
An intuitive and user-friendly GUI tool to migrate your character set. It helps you streamline the migration process through an interface that minimizes the workload and ensures that all migration issues are addressed.
The Database Scan Report contains the output from a scan of the database in a form that enables you to view results in many different ways. You can customize this report as well as filter by the status of the data. As an example, you can filter the report to only display all objects that contain some data that exceeds its data type limit after conversion.
A wizard that enables you to create a diagnostic package for later analysis by Oracle to help resolve any problems that occurred while using the DMU.
When you migrate from a legacy encoding to Unicode, character values will likely expand in conversion because their encodings will have more bytes. This results in increased space requirements for the database. A further issue is that the widths for
VARCHAR2 columns may not be sufficient after the character set has been migrated to Unicode. Thus, there is a risk of the data being truncated. The column length constraints have to be increased, or, if they are already at the data type limit, the columns might need to be migrated to the
CLOB data type.
External tables do not usually require conversion. If the character set of the source file for an external table (SQL*Loader or Data Pump) is properly declared, the database will automatically convert character data to the new database character set each time the data is fetched from the file. For performance reasons, consider converting the file permanently to avoid this conversion.
A type of migration where you identify non-ASCII data first, and then only that data and its metadata is updated during the conversion phase of the migration. The amount of modified database values is usually only a small percentage of all values stored in the database.
See A to B conversion.
The process of installing required patches and supporting packages, as well as creating a tablespace for the repository, a database connection, and then installing the migration repository.
A common problem with user data is that the data in a column may not actually be in the declared database character set. Instead, it may be in another character set or it is binary (for example, if it is composed of images, documents in proprietary word processor formats, or text encrypted with custom methods), or perhaps there are multiple character sets in a single column. This problem is possible in a pass-through configuration, which has the client character set defined (usually through the
NLS_LANG client setting) to be equal to the database character set.
The typical steps, and their order, for migrating the database character set to Unicode.
A repository that contains the objects to be processed, details on data that had an error flagged, the progress of a scan or conversion, and some other items necessary to migrate to Unicode.
In a pass-through configuration, client character set is defined (usually through the
NLS_LANG client setting) to be equal to the database character set. Because the character sets are equal, the Oracle communication protocol does not attempt any client/server character set conversion, logically assuming that the conversion is not needed. This configuration is correct as long as data coming from a client application is indeed in the declared character set. But because no character set conversion nor character validation is performed on the data, any sequence of bytes can actually be stored in the database and retrieved unchanged. As long as only the client application interprets the data, the stored byte sequences could correspond to a character encoding different from the database character set or they may not correspond to readable text at all.
Data preparation is the process to ensure that no database data to be migrated will cause problems during or after the actual conversion. The elements of data preparation are scanning and cleansing.
Tables and tablespaces that are read-only cannot be converted unless they are made read-write. Certain tables are read-only, such as tables explicitly marked as read-only, external tables, and tables stored in read-only tablespaces. Such tables cannot be converted before they are made read-write. Except for the explicit marking of a table as read-only, which the migration process can temporarily remove, the other mentioned reasons have to be dealt with manually.
A wizard that enables you to manage the information necessary for each step of the migration through the use of a repository. For any database without a migration repository, a repository is automatically created each time you connect to that database. You can also start the wizard from the DMU user interface.
See Migration Repository.
Scanning is the process of assessing the feasibility of migrating the data to Unicode. As part of this process, the DMU reads character values from the database, converts them to the target character set, and counts how many values change in conversion, do not fit into their columns, do not fit into their data types, or contain invalid character codes. Additional statistics, such as the maximum post-conversion length of values in a column, are calculated as well.
A wizard that guides you through the scanning process, and informs you if any data needs cleansing before it can be converted permanently without data loss.
A database character set that is included within a superset character set. An example of this is ASCII being contained within Unicode.
A database character set that includes the subset character set. An example of this is that Unicode contains the subset of ASCII.
To shorten something by cutting off or removing a part. In the context of data migration, the important consideration is that there is a risk of data being truncated when
VARCHAR2 columns are not sufficiently wide after the migration.
UTF8 (CESU-8) is a varying-width character set, where a character can be 1-3 bytes, and supplementary characters are represented by a pair of 3 bytes.
AL32UTF8 is a varying-width character set, which means that a character can be 1, 2, 3, or 4 bytes long, and supplementary characters are represented as 4 bytes.