Localization, 6 of 9

Choosing an Instance Character Set

Assessing your requirements

Before you can select the appropriate instance character set, you must assess the requirements for that particular instance. You must take into account the natural language and geographic location of your users and the character sets used by existing data sources and text data. The character set that you choose must be complete enough to represent all the data that users will want to store or manipulate, whether it is originating in the database, an external file, or a persistent analytic workspace. However, it should not be more complete than necessary.

Default character set: UTF-8

UTF-8 is a variable-width character set encoding that can represent all of the characters included in the Unicode standard, including graphical characters for Japanese, Korean, Chinese, English, Greek, Hebrew, and Russian, using 1, 2, or 3 bytes for each character.

Performance considerations

Processing data stored in UTF-8 is slower than processing data stored in either a single-byte character set or a smaller, fixed width, multi-byte character set.

You can improve processing time very quickly and easily just by changing the character set used by OLAP Services. However, if your Oracle database stores data from around the world, or your OLAP applications access NCHAR or NVARCHAR columns, then OLAP Services will need to use UTF-8.

Rules for choosing a character set

You need to identify a supported character set that is extensive enough to represent all of the data accessed by your OLAP applications, yet no larger than necessary. The basic rule is this:

The OLAP Services instance character set should be the same as the RDBMS database character set whenever possible.

In the following cases, OLAP Services might not have the same character set as the database:

If OLAP applications are going to access any NCHAR or NVARCHAR columns in the database, then specify UTF8.
If data will be read from external files or spreadsheets, and the data cannot be represented by the database character set, then specify UTF8. Note that this data cannot be stored in the database because of incompatibility between the data sets. It can only be stored in an analytic workspace or an external file.

However, in these cases, the database character set can be used:
- The external files and spreadsheets use the same character set as the database.
- The external files or spreadsheets use a different character set than the database, but all characters can be represented in the database character set (that is, the character set of the file is a subset of the character set of the database.) For example, if the database uses Japanese EUC, and the files use Japanese Shift-JIS, the OLAP Services should use Japanese EUC.
If the database uses a non-ASCII character set, then use the ASCII equivalent if possible. For example, if the database character set is WE8EBCDIC37 (8-bit EDBDIC, Western European code page 37), then specify WE8ISO8859P1 (8-bit ISO Latin-1). If there is no ASCII equivalent, then specify UTF8.

To specify the character set, set the NLS_LANG configuration parameter, described in "Specifying the instance character set".

What happens when different character sets are used?

The following table indicates what happens when data moves from one part of the software to another, such as between OLAP Services and a client application or between a text file and OLAP Services.

IF the character set of a client or external file . . . THEN . . .

uses the same numeric codes to represent the same graphic characters as the instance character set,

no translation is required. This is the optimal situation since no processing time is required to convert the data.

uses the same range of numeric codes but uses them to represent different graphic characters,

OLAP Services must know the identity of the other character set so that it can map the numeric codes to the correct graphic characters. No data is lost, but some processing time is required to convert the data from one character set to another.

contains numeric codes for graphic characters that are not defined in the instance character set,

data may be lost because it cannot be represented in OLAP Services. The instance character set must be a super set of the character sets used by other parts of the system.

IF the character set of a client or external file . . .	THEN . . .
uses the same numeric codes to represent the same graphic characters as the instance character set,	no translation is required. This is the optimal situation since no processing time is required to convert the data.
uses the same range of numeric codes but uses them to represent different graphic characters,	OLAP Services must know the identity of the other character set so that it can map the numeric codes to the correct graphic characters. No data is lost, but some processing time is required to convert the data from one character set to another.
contains numeric codes for graphic characters that are not defined in the instance character set,	data may be lost because it cannot be represented in OLAP Services. The instance character set must be a super set of the character sets used by other parts of the system.