Character Set Length Semantics Affect Data Storage
Character length and byte length semantics are supported to resolve potential ambiguity regarding column length and storage size. Multibyte encoding character sets are supported, such as UTF-8 or AL32UTF8. Multibyte encodings require varying amounts of storage per character depending on the character. For example, a UTF-8 character may require from 1 to 4 bytes. If, for example, a column is defined as CHAR (10), all 10 characters fit in this column regardless of character set encoding. However, for UTF-8 character set encoding, up to 40 bytes are required.
Character semantics is useful for defining the storage requirements for multibyte strings of varying widths. For example, in a Unicode database (AL32UTF8), suppose that you need to define a VARCHAR2 column that can store up to five Chinese characters together with five English characters. Using byte semantics, this column requires 15 bytes for the Chinese characters, where each are three bytes long, and 5 bytes for the English characters, where each are one byte long, for a total of 20 bytes. Using character semantics, the column requires 10 characters.
The expressions in the following list use byte semantics. Note the BYTE qualifier in the CHAR and VARCHAR2 expressions.
-
CHAR (5 BYTE) -
VARCHAR2(20 BYTE)
The expressions in the following list use character semantics. Note the CHAR qualifier in the VARCHAR2 expression.
-
VARCHAR2(20 CHAR) -
SUBSTR(string, 1, 20)
By default, the CHAR and VARCHAR2 character data types are specified in bytes, not characters. Therefore, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.
The NCHAR and NVARCHAR2 character data types are encoded as UTF-16, which requires at least 2 bytes for each character. Thus, NCHAR(20) in a table definition allows 40 bytes for storing character data.
The NLS_LENGTH_SEMANTICS general connection attribute determines
whether a new column of character data type uses byte or character semantics. It enables
you to create CHAR and VARCHAR2 columns using either
byte-length or character-length semantics without having to add the explicit qualifier.
NCHAR and NVARCHAR2 columns are always
character-based. Existing columns are not affected.
The default value for NLS_LENGTH_SEMANTICS is BYTE. Specifying the BYTE or CHAR qualifier in a data type expression overrides the NLS_LENGTH_SEMANTICS value.