Character Set Length Semantics Affect Data Storage
Character length and byte length semantics are supported to resolve potential ambiguity regarding column length and storage size. Multibyte encoding character sets are supported, such as UTF-8 or AL32UTF8
. Multibyte encodings require varying amounts of storage per character depending on the character. For example, a UTF-8 character may require from 1 to 4 bytes. If, for example, a column is defined as CHAR (10)
, all 10 characters fit in this column regardless of character set encoding. However, for UTF-8
character set encoding, up to 40 bytes are required.
Character semantics is useful for defining the storage requirements for multibyte strings of varying widths. For example, in a Unicode database (AL32UTF8
), suppose that you need to define a VARCHAR2
column that can store up to five Chinese characters together with five English characters. Using byte semantics, this column requires 15 bytes for the Chinese characters, where each are three bytes long, and 5 bytes for the English characters, where each are one byte long, for a total of 20 bytes. Using character semantics, the column requires 10 characters.
The expressions in the following list use byte semantics. Note the BYTE
qualifier in the CHAR
and VARCHAR2
expressions.
-
CHAR (5 BYTE)
-
VARCHAR2(20 BYTE)
The expressions in the following list use character semantics. Note the CHAR
qualifier in the VARCHAR2
expression.
-
VARCHAR2(20 CHAR)
-
SUBSTR(
string
, 1, 20)
By default, the CHAR
and VARCHAR2
character data types are specified in bytes, not characters. Therefore, the specification CHAR(20)
in a table definition allows 20 bytes for storing character data.
The NCHAR
and NVARCHAR2
character data types are encoded as UTF-16, which requires at least 2 bytes for each character. Thus, NCHAR(20)
in a table definition allows 40 bytes for storing character data.
The NLS_LENGTH_SEMANTICS
general connection attribute determines
whether a new column of character data type uses byte or character semantics. It enables
you to create CHAR
and VARCHAR2
columns using either
byte-length or character-length semantics without having to add the explicit qualifier.
NCHAR
and NVARCHAR2
columns are always
character-based. Existing columns are not affected.
The default value for NLS_LENGTH_SEMANTICS
is BYTE
. Specifying the BYTE
or CHAR
qualifier in a data type expression overrides the NLS_LENGTH_SEMANTICS
value.