Character-Set Considerations for SODA for C

SODA for C and Character-Set Encodings for JSON Data: Client and Database

SODA for C involves two kinds of JSON-data character-set encodings: client-side and database.

By the standard defining JSON, JSON data is encoded with a Unicode character set; that is, JSON data is Unicode data, by definition. But on the client side SODA for C relaxes the restriction that JSON data must be Unicode; you can use data that has other encodings but otherwise has JSON syntax.

On the client side:

The non-Unicode encodings that you can use with a SODA for C client are all of those allowed by Oracle Call Interface (OCI), with the exception of EBCDIC: you cannot use an EBCDIC character set for SODA documents.
The Unicode encodings that you can use with a SODA for C client are UTF-8, UTF-16 LE (little-endian), and UTF-16 BE (big-endian). These correspond to Oracle Database character sets AL32UTF8, AL32UTF16, and AL32UTF16LE, respectively. You cannot use UTF-32 — it is not an OCI client-side encoding.

On the database side (that is, for the content column of a collection):

Oracle recommends that you use AL32UTF8, which implements Unicode UTF-8, as the database character set.
The encoding used for JSON data in the content column of a collection depends on the SQL type:
- VARCHAR2 — The documents are encoded as AL32UTF8. VARCHAR2 data is always stored in the database character set.
- BLOB — The documents are encoded as UTF-8, UTF-16 BE, or UTF-16 LE. Which of these Unicode encodings is used depends on how the input documents were encoded on the client side, as is explained in Writing JSON Documents To the Database From the Client.
- CLOB — The documents are encoded as UCS-2. A CLOB instance is encoded as UCS-2 whenever the database character set is multibyte (as is AL32UTF8).

If client-side and database-side encodings are the same (they are both Unicode) then no conversion is needed from one to the other.

But if they differ then SODA automatically converts from one character set to the other. If a character used in a document on the client side has no corresponding Unicode character then conversion to the database character set when writing the document is lossy. Similarly, if a character used in a document on the database side has no corresponding character in the client-side character set then conversion when reading the document is lossy.

For example:

Suppose that your client-side encoding is JA16SJIS, and the content column for your SODA collection is configured to store JSON data using SQL data type VARCHAR2. When you write data to your collection SODA automatically converts it from JA16SJIS to the database character set (AL32UTF8).
Suppose that your client-side encoding is AL16UTF16LE, and your collection is configured to store JSON data using SQL data type BLOB. Because data type BLOB supports encoding AL16UTF16LE, no conversion is needed.

By default, the character set used by OCI is defined by environment variable NLS_LANG. You can override this for a given OCI client using OCI function OCIEnvNlsCreate() with parameter charset.

In particular, you can use OCIEnvNlsCreate() to create an environment handle that defines the character set used by a given client as OCI_UTF16ID (UTF-16), which cannot be set from NLS_LANG. Character set OCI_UTF16ID designates a UTF-16 encoding whose endianness (big-endian or little-endian) depends on the platform where the client is run.

When a document is written to the database from a client application, or a document is read from the database to a client application, the application tells OCI what client-side encoding to use for the document. It does this by way of parameter docFlags, which is passed to either a document-handle creation function or a convenience function for writing content into a document without providing a document handle. That is, parameter docFlags controls the encoding of documents on the client side.

Writing JSON Documents To the Database From the Client

SODA for C functions that create a document handle are named with prefix OCISodaDocCreate. They all accept parameter docFlags.

SODA for C also provides convenience functions for writing JSON content to the database without providing a document handle. These functions are named with suffix WithCtnt (standing for “with content”). They also accept parameter docflags.

For writing, parameter docFlags can have either of these values:

OCI_DEFAULT — Use the character set defined by the environment handle, or by environment variable NLS_LANG, if not set for the handle.

You must supply document content in the encoding that is specified by the environment handle or NLS_LANG. Otherwise, the result of a write operation is unpredictable.

The character set can be any that is valid for OCI (Unicode or non-Unicode), with the exception of EBCDIC. (If it is OCI_UTF16 then you must supply the document with a UTF 16 encoding whose endianness matches the endianness of the platform where the client runs.)

If you write a document that is not encoded as Unicode to a BLOB column using OCI_DEFAULT then SODA converts the content to UTF-8 before writing.
OCI_SODA_DETECT_JSON_ENC — Automatically detect the encoding of the document content as UTF-8, UTF-16 LE (little-endian), or UTF-16 BE (big-endian)

You must supply document content in one of those encodings. Otherwise, the result of a write operation is unpredictable.

Use cases for working with JSON data on the client side:

To work in a non-Unicode encoding or in a single Unicode encoding, use OCI_DEFAULT.
To work in a mix of Unicode encodings (UTF-8, UTF-16 LE, UTF-16 BE) in the same application, use OCI_SODA_DETECT_JSON_ENC. (With OCI_DEFAULT, documents are assumed to be in the single encoding specified by the environment handle or NLS_LANG.)
To work in a UTF-16 encoding that has a different endianness from that of the client-side platform, use OCI_SODA_DETECT_JSON_ENC.

If the client-side character set differs from the character set of the content column in the database, SODA converts the document, when writing, to the character set of the content column. To avoid any such conversion, use BLOB as the content data type (BLOB is the default), and supply the content with encoding UTF-8 or UTF-16 (BE or LE). If you do this then it does not matter which value (OCI_DEFAULT or OCI_SODA_DETECT_JSON_ENC) you use for parameter docFlags.

Reading JSON Documents From the Database To the Client

SODA for C functions (such as OCISodaFindOneWithKey()) that read content into a client-side document also provide parameter docFlags, which you use to specify the client-side encoding to use for the retrieved content.

For reading, parameter docFlags can have any of these values:

OCI_DEFAULT — Use the character set defined by the environment handle, or by environment variable NLS_LANG, if not set for the handle. (This is the same as for document writes to the database.)
OCI_SODA_AS_STORED — Use the same encoding used to store the document in the database. This value is valid only for use with a collection that uses BLOB storage; otherwise, an error is raised.
OCI_SODA_AS_AL32UTF8 — Use UTF-8 as the encoding.

If the client-side character set differs from the character set of the content column in the database, SODA converts the document, when reading, to the character set specified for the client. To avoid any such conversion, use BLOB as the content data type (BLOB is the default), and use OCI_SODA_AS_STORED for parameter docFlags.