Glossary

A to B conversion

A type of character set migration. In an A to B conversion, you create a new Unicode database and move all data from the existing database to the new one using the Export and Import utilities or Data Pump.

See inline conversion.

Bulk Cleansing Wizard

A wizard that enables you to perform three things in bulk:

  • alter the length semantics of CHAR and VARCHAR2 table columns from byte semantics to character semantics in bulk

    You can perform the bulk cleansing for migrating to character length semantics in either immediate or scheduled cleansing mode. In the scheduled mode, the character length semantics migration will take place during the conversion phase without any interruption to the production environment. When the DMU is used in validation mode, only the immediate cleansing option will be available

  • Set the "Allow Conversion of Data with Issues" column conversion property

    If the scan results of the column show that some values have conversion issues, that is, the converted values will contain replacement characters or they will be truncated, you can still let the DMU convert the data by setting this property to Yes. This might be useful if you want to automatically truncate data that exceeds the column or data type limit or you are not concerned about a few corrupted values that happen to exist in the column but are of no real significance for applications that use this data.

  • Replace or remove occurrences of a byte or character pattern in column values

    In some cases, after analyzing the nature of individual data exceptions, you may decide to modify the data by replacing the offending bytes or characters with a set of valid equivalent bytes or characters or removing them altogether. If the same problem symptom exists in multiple database objects, it may be desirable to perform such replacement operation in bulk. The pattern-based replacement cleansing option allows you to replace all occurrences of a source pattern with a target pattern for the selected database objects. You can perform the replacement for single bytes or characters or specify substrings of bytes or characters.

character semantics

Character semantics is useful for defining the storage requirements for multibyte strings of varying widths. For example, in a Unicode database (AL32UTF8), suppose that you need to define a VARCHAR2 column that can store up to five Chinese characters along with five English characters. Using byte semantics, this column requires 15 bytes for the Chinese characters, which are three bytes long, and 5 bytes for the English characters, which are one byte long, for a total of 20 bytes. Using character semantics, the column requires 10 characters. Character semantics allows the specification of the column size in terms of characters independent of the database character set.

character set migration

The process of moving from one database character set to another. In this guide, we discuss moving to Unicode.

Check for Updates Wizard

You can check for and install the latest version of the DMU by using the DMU live update feature. This is done with a Check for Updates Wizard, and it enables you to find out if a new version of the DMU is available from the DMU update center.

cleansing

The process of identifying and resolving the data issues that were found in the scanning process. These issues must be resolved before the database can be converted

CSALTER script

The CSALTER script is part of the Database Character Set Scanner utility. It is the most straightforward way to migrate a character set, but it can be used only if all of the schema data is a strict subset of the new character set.

CSREPAIR script

The CSREPAIR script enables you to repair the database character set metadata when using the pass-through configuration. In the pass-through configuration, the client character set is defined (usually through the NLS_LANG client setting) to be equal to the database character set, though the character data in your database could be stored in a different character set from the declared database character set. The CSREPAIR script works with the DMU client by accessing the DMU repository. Note that it only changes the database character set declaration in the data dictionary metadata and does not convert any database data.

CSSCAN script

The Database Character Set Scanner (CSSCAN) is a command-line utility that assesses the feasibility of migrating an Oracle database to a new database character set. The Database Character Set Scanner checks all character data in the database and tests for the effects and problems of changing the character set encoding. A summary report is generated at the end of the scan that shows the scope of work required to convert the database to a new character set.

Database Migration Assistant for Unicode (DMU)

An intuitive and user-friendly GUI tool to migrate your character set. It helps you streamline the migration process through an interface that minimizes the workload and ensures that all migration issues are addressed.

Database Scan Report

The Database Scan Report contains the output from a scan of the database in a form that enables you to view results in many different ways. You can customize this report as well as filter by the status of the data. As an example, you can filter the report to only display all objects that contain some data that exceeds its data type limit after conversion.

Diagnostic Package Creation Wizard

A wizard that enables you to create a diagnostic package for later analysis by Oracle to help resolve any problems that occurred while using the DMU.

expansion

When you migrate from a legacy encoding to Unicode, character values will likely expand in conversion because their encodings will have more bytes. This results in increased space requirements for the database. A further issue is that the widths for CHAR and VARCHAR2 columns may not be sufficient after the character set has been migrated to Unicode. Thus, there is a risk of the data being truncated. The column length constraints have to be increased, or, if they are already at the data type limit, the columns might need to be migrated to the CLOB data type.

external tables

External tables do not usually require conversion. If the character set of the source file for an external table (SQL*Loader or Data Pump) is properly declared, the database will automatically convert character data to the new database character set each time the data is fetched from the file. For performance reasons, consider converting the file permanently to avoid this conversion.

inline migration

A type of migration where you identify non-ASCII data first, and then only that data and its metadata is updated during the conversion phase of the migration. The amount of modified database values is usually only a small percentage of all values stored in the database.

See A to B conversion.

initialization

The process of installing required patches and supporting packages, as well as creating a tablespace for the repository, a database connection, and then installing the migration repository.

invalid binary representation

A common problem with user data is that the data in a column may not actually be in the declared database character set. Instead, it may be in another character set or it is binary (for example, if it is composed of images, documents in proprietary word processor formats, or text encrypted with custom methods), or perhaps there are multiple character sets in a single column. This problem is possible in a pass-through configuration, which has the client character set defined (usually through the NLS_LANG client setting) to be equal to the database character set.

migration

See character set migration.

migration workflow

The typical steps, and their order, for migrating the database character set to Unicode.

Migration Repository

A repository that contains the objects to be processed, details on data that had an error flagged, the progress of a scan or conversion, and some other items necessary to migrate to Unicode.

pass-through configuration

In a pass-through configuration, client character set is defined (usually through the NLS_LANG client setting) to be equal to the database character set. Because the character sets are equal, the Oracle communication protocol does not attempt any client/server character set conversion, logically assuming that the conversion is not needed. This configuration is correct as long as data coming from a client application is indeed in the declared character set. But because no character set conversion nor character validation is performed on the data, any sequence of bytes can actually be stored in the database and retrieved unchanged. As long as only the client application interprets the data, the stored byte sequences could correspond to a character encoding different from the database character set or they may not correspond to readable text at all.

preparation

Data preparation is the process to ensure that no database data to be migrated will cause problems during or after the actual conversion. The elements of data preparation are scanning and cleansing.

read-only objects

Tables and tablespaces that are read-only cannot be converted unless they are made read-write. Certain tables are read-only, such as tables explicitly marked as read-only, external tables, and tables stored in read-only tablespaces. Such tables cannot be converted before they are made read-write. Except for the explicit marking of a table as read-only, which the migration process can temporarily remove, the other mentioned reasons have to be dealt with manually.

Repository Configuration Wizard

A wizard that enables you to manage the information necessary for each step of the migration through the use of a repository. For any database without a migration repository, a repository is automatically created each time you connect to that database. You can also start the wizard from the DMU user interface.

See Migration Repository.

scanning

Scanning is the process of assessing the feasibility of migrating the data to Unicode. As part of this process, the DMU reads character values from the database, converts them to the target character set, and counts how many values change in conversion, do not fit into their columns, do not fit into their data types, or contain invalid character codes. Additional statistics, such as the maximum post-conversion length of values in a column, are calculated as well.

Scan Wizard

A wizard that guides you through the scanning process, and informs you if any data needs cleansing before it can be converted permanently without data loss.

subset

A database character set that is included within a superset character set. An example of this is ASCII being contained within Unicode.

superset

A database character set that includes the subset character set. An example of this is that Unicode contains the subset of ASCII.

truncation

To shorten something by cutting off or removing a part. In the context of data migration, the important consideration is that there is a risk of data being truncated when CHAR and VARCHAR2 columns are not sufficiently wide after the migration.

Unicode UTF-8

UTF8 (CESU-8) is a varying-width character set, where a character can be 1-3 bytes, and supplementary characters are represented by a pair of 3 bytes.

Unicode AL32UTF8

AL32UTF8 is a varying-width character set, which means that a character can be 1, 2, 3, or 4 bytes long, and supplementary characters are represented as 4 bytes.

validation

The DMU can validate the contents of an existing AL32UTF8 or UTF8 database. Such a database might have been converted to Unicode earlier or initially created in Unicode. In either case, you can check the data for current problems.