DBMS_VECTOR

211 DBMS_VECTOR_CHAIN

The DBMS_VECTOR_CHAIN package provides APIs to support advanced operations with Oracle AI Vector Search.

These functions accept their respective input parameters in JSON format.

Related Topics

Oracle Database AI Vector Search User's Guide

211.1 Summary of DBMS_VECTOR_CHAIN Subprograms

The DBMS_VECTOR_CHAIN package enables advanced operations with Oracle AI Vector Search, such as chunking and embedding data along with text generation and summarization capabilities. It is more suitable for text processing with similarity search, using functionality that can be pipelined together for an end-to-end search.

This table lists the DBMS_VECTOR_CHAIN subprograms and briefly describes them.

Table 211-1 DBMS_VECTOR_CHAIN Package Subprograms

Subprogram	Description
Chainable Utility (UTL) Functions: These functions are a set of modular and flexible functions within vector utility PL/SQL packages. You can chain these together to automate end-to-end data transformation and similarity search operations.
UTL_TO_TEXT	Extracts plain text data from documents
UTL_TO_CHUNKS	Splits data into smaller pieces or chunks
UTL_TO_EMBEDDING and UTL_TO_EMBEDDINGS	Converts data to one or more vector embeddings
UTL_TO_SUMMARY	Extracts a summary from documents
UTL_TO_GENERATE_TEXT	Generates text for a prompt (input string)
Credential Helper Procedures: These procedures enable you to securely manage authentication credentials in the database. You require these credentials to enable access to third-party service providers for making REST calls.
CREATE_CREDENTIAL	Creates a credential name
DROP_CREDENTIAL	Drops an existing credential name
Chunker Helper Procedures: These procedures enable you to configure vocabulary and language data (abbreviations), to be used with the `VECTOR_CHUNKS` SQL function or `UTL_TO_CHUNKS` PL/SQL function.
CREATE_VOCABULARY	Loads your token vocabulary file into the database
DROP_VOCABULARY	Removes existing vocabulary data
CREATE_LANG_DATA	Loads your language data file into the database
DROP_LANG_DATA	Removes existing abbreviation data

Note:

The DBMS_VECTOR_CHAIN package requires you to install the CONTEXT component of Oracle Text, an Oracle Database technology that provides indexing, term extraction, text analysis, text summarization, word and theme searching, and other utilities.

Due to underlying dependance on the text processing capabilities of Oracle Text, note that both the UTL_TO_TEXT and UTL_TO_SUMMARY chainable utility functions and all the chunker helper procedures are available only in this package through Oracle Text.

211.1.1 CREATE_CREDENTIAL

Use the DBMS_VECTOR_CHAIN.CREATE_CREDENTIAL credential helper procedure to create a credential name for storing user authentication details in Oracle Database.

Purpose

To securely manage authentication credentials in the database. You require these credentials to enable access during REST API calls to your chosen third-party service provider, such as Cohere, Google AI, Hugging Face, Oracle Cloud Infrastructure (OCI) Generative AI, OpenAI, or Vertex AI.

A credential name holds authentication parameters, such as user name, password, access token, private key, or fingerprint.

Note that if you are using Oracle Database as the service provider, then you do not need to create a credential.

WARNING:

Certain features of the database may allow you to access services offered separately by third-parties, for example, through the use of JSON specifications that facilitate your access to REST APIs.

Your use of these features is solely at your own risk, and you are solely responsible for complying with any terms and conditions related to use of any such third-party services. Notwithstanding any other terms and conditions related to the third-party services, your use of such database features constitutes your acceptance of that risk and express exclusion of Oracle's responsibility or liability for any damages resulting from such access.

Syntax

DBMS_VECTOR_CHAIN.CREATE_CREDENTIAL (
    CREDENTIAL_NAME     IN VARCHAR2,
    PARAMS              IN JSON DEFAULT NULL
);

CREDENTIAL_NAME

Specify a name of the credential that you want to create for holding authentication parameters.

PARAMS

Specify authentication parameters in JSON format, based on your chosen service provider.

Generative AI requires the following authentication parameters:

{ 
"user_ocid": "<user ocid>",
"tenancy_ocid": "<tenancy ocid>",
"compartment_ocid": "<compartment ocid>",
"private_key": "<private key>",
"fingerprint": "<fingerprint>" 
}

Cohere, Google AI, Hugging Face, OpenAI, and Vertex AI require the following authentication parameter:

{ "access_token": "<access token>" }

Table 211-2 Parameter Details

Parameter	Description
`user_ocid`	Oracle Cloud Identifier (OCID) of the user, as listed on the User Details page in the OCI console.
`tenancy_ocid`	OCID of your tenancy, as listed on the Tenancy Details page in the OCI console.
`compartment_ocid`	OCID of your compartment, as listed on the Compartments information page in the OCI console.
`private_key`	OCI private key. Note: The generated private key may appear as: `-----BEGIN RSA PRIVATE KEY----- <private key string> -----END RSA PRIVATE KEY-----` You pass the `<private key string>` value (excluding the `BEGIN` and `END` lines), either as a single line or as multiple lines.
`fingerprint`	Fingerprint of the OCI profile key, as listed on the User Details page under API Keys in the OCI console.
`access_token`	Access token obtained from your third-party service provider.

Required Privilege

You need the CREATE CREDENTIAL privilege to call this API.

Examples

For Generative AI:

declare
  jo json_object_t;
begin
  jo := json_object_t();
  jo.put('user_ocid','ocid1.user.oc1..aabbalbbaa1112233aabbaabb1111222aa1111bb');
  jo.put('tenancy_ocid','ocid1.tenancy.oc1..aaaaalbbbb1112233aaaabbaa1111222aaa111a');
  jo.put('compartment_ocid','ocid1.compartment.oc1..ababalabab1112233abababab1111222aba11ab');
  jo.put('private_key','AAAaaaBBB11112222333...AAA111AAABBB222aaa1a/+');
  jo.put('fingerprint','01:1a:a1:aa:12:a1:12:1a:ab:12:01:ab:a1:12:ab:1a');
  dbms_vector_chain.create_credential(
    credential_name   => 'OCI_CRED',
    params            => json(jo.to_string));
end;
/

For Cohere:

declare
  jo json_object_t;
begin
  jo := json_object_t();
  jo.put('access_token', 'A1Aa0abA1AB1a1Abc123ab1A123ab123AbcA12a');
  dbms_vector_chain.create_credential(
    credential_name   => 'COHERE_CRED',
    params            => json(jo.to_string));
end;
/

To run an end-to-end example scenario using this procedure, see Oracle Database AI Vector Search User's Guide.

211.1.2 CREATE_LANG_DATA

Use the DBMS_VECTOR_CHAIN.CREATE_LANG_DATA chunker helper procedure to load your own language data file into the database.

Purpose

To create custom language data for your chosen language (specified using the LANGUAGE chunking parameter).

A language data file contains language-specific abbreviation tokens. You can supply this data to the chunker to help in accurately determining sentence boundaries of chunks, by using knowledge of the input language's end-of-sentence (EOS) punctuations, abbreviations, and contextual rules.

Usage Notes

All supported languages are distributed with the default language-specific abbreviation dictionaries. You can create a language data based on the abbreviation tokens loaded in the schema.table.column, using a user-specified language data name (PREFERENCE_NAME).
After loading your language data, you can use language-specific chunking by specifying the LANGUAGE chunking parameter with VECTOR_CHUNKS or UTL_TO_CHUNKS.
You can query these data dictionary views to access existing language data:
- ALL_VECTOR_LANG displays all available languages data.
- USER_VECTOR_LANG displays languages data from the schema of the current user.
- ALL_VECTOR_ABBREV_TOKENS displays abbreviation tokens from all available language data.
- USER_VECTOR_ABBREV_TOKENS displays abbreviation tokens from the language data owned by the current user.

Syntax

DBMS_VECTOR_CHAIN.CREATE_LANG_DATA (
    PARAMS       IN JSON default NULL
);

PARAMS

Specify the input parameters in JSON format:

{
    TABLE_NAME, 
    COLUMN_NAME, 
    LANGUAGE,
    PREFERENCE_NAME
}

Table 211-3 Parameter Details

Parameter	Description	Required	Default Value
`TABLE_NAME`	Name of the table (along with the optional table owner) in which you want to load the language data	Yes	No value
`COLUMN_NAME`	Column name in the language data table in which you want to load the language data	Yes	No value
`LANGUAGE`	Any supported language name, as listed in Supported Languages and Data File Locations	Yes	No value
`PREFERENCE_NAME`	User-specified preference name for this language data	Yes	No value

Example

declare
    params CLOB := '{"TABLE_NAME"      : "eos_data_1",
                     "COLUMN_NAME"     : "TOKEN",
                     "LANGUAGE"        : "INDONESIAN",
                     "PREFERENCE_NAME" : "my_lang_1"}';
begin
    DBMS_VECTOR_CHAIN.CREATE_LANG_DATA(
        JSON (params));
end;
/

To run an end-to-end example scenario using this procedure, see Create and Use Custom Language Data in Oracle Database AI Vector Search User's Guide.

Related Topics

211.1.3 CREATE_VOCABULARY

Use the DBMS_VECTOR_CHAIN.CREATE_VOCABULARY chunker helper procedure to load your own token vocabulary file into the database.

Purpose

To create custom token vocabulary that is recognized by the tokenizer used by your vector embedding model.

A vocabulary contains a set of tokens (words and word pieces) that are collected during a model's statistical training process. You can supply this data to the chunker to help in accurately selecting the text size that approximates the maximum input limit imposed by the tokenizer of your embedding model.

Usage Notes

Usually, the supported vocabulary files (containing recognized tokens) are included as part of a model's distribution. Oracle recommends to use the vocabulary files associated with your model.
If a vocabulary file is not available, then you may download one of the following files depending on the tokenizer type:
- WordPiece:
  
  Vocabulary file (vocab.txt) for the "bert-base-uncased" (English) or "bert-base-multilingual-cased" model
- Byte-Pair Encoding (BPE):
  
  Vocabulary file (vocab.json) for the "GPT2" model
  Use the following python script to extract the file:
```
import json
import sys
 
with open(sys.argv[1], encoding="utf-8") as f:
  d = json.load(f)
  for term in d:
    print(term)
```
- SentencePiece:
  
  Vocabulary file (tokenizer.json) for the "xlm-roberta-base" model
  Use the following python script to extract the file:
```
import json
import sys
 
with open(sys.argv[1], encoding="utf-8") as f:
  d = json.load(f)
  for entry in d["model"]["vocab"]:
    print(entry[0])
```
Ensure to save your vocabulary files in UTF-8 encoding.
You can create a vocabulary based on the tokens loaded in the schema.table.column, using a user-specified vocabulary name (VOCABULARY_NAME).

After loading your vocabulary data, you can use the BY VOCABULARY chunking mode (with VECTOR_CHUNKS or UTL_TO_CHUNKS) to split input data by counting the number of tokens.
You can query these data dictionary views to access existing vocabulary data:
- ALL_VECTOR_VOCAB displays all available vocabularies.
- USER_VECTOR_VOCAB displays vocabularies from the schema of the current user.
- ALL_VECTOR_VOCAB_TOKENS displays a list of tokens from all available vocabularies.
- USER_VECTOR_VOCAB_TOKENS displays a list of tokens from the vocabularies owned by the current user.

Syntax

DBMS_VECTOR_CHAIN.CREATE_VOCABULARY(
    PARAMS      IN JSON default NULL
);

PARAMS

Specify the input parameters in JSON format:

{
    TABLE_NAME, 
    COLUMN_NAME, 
    VOCABULARY_NAME,
    FORMAT,
    CASED
}

Table 211-4 Parameter Details

Parameter	Description	Required	Default Value
`TABLE_NAME`	Name of the table (along with the optional table owner) in which you want to load the vocabulary file	Yes	No value
`COLUMN_NAME`	Column name in the vocabulary table in which you want to load the vocabulary file	Yes	No value
`VOCABULARY_NAME`	User-specified name of the vocabulary, along with the optional owner name (if other than the current owner)	Yes	No value
`FORMAT`	`XLM` for SentencePiece tokenization `BERT` for WordPiece tokenization `GPT2` for BPE tokenization	Yes	No value
`CASED`	Character-casing of the vocabulary, that is, vocabulary to be treated as cased or uncased	No	`FALSE`

Example

DECLARE
  params clob := '{"table_name"       : "doc_vocabtab",
                   "column_name"      : "token",
                   "vocabulary_name"  : "doc_vocab",
                   "format"           : "bert",
                   "cased"            : false}';

BEGIN
  dbms_vector_chain.create_vocabulary(json(params));
END;
/

To run an end-to-end example scenario using this procedure, see Create and Use Custom Vocabulary in Oracle Database AI Vector Search User's Guide.

Related Topics

211.1.4 DROP_CREDENTIAL

Use the DBMS_VECTOR_CHAIN.DROP_CREDENTIAL credential helper procedure to drop an existing credential name from the data dictionary.

Syntax

DBMS_VECTOR_CHAIN.DROP_CREDENTIAL (
    CREDENTIAL_NAME      IN VARCHAR2
);

CREDENTIAL_NAME

Specify the credential name that you want to drop.

Examples

For Generative AI:

exec dbms_vector_chain.drop_credential('OCI_CRED');

For Cohere:

exec dbms_vector_chain.drop_credential('COHERE_CRED');

211.1.5 DROP_LANG_DATA

Use the DBMS_VECTOR_CHAIN.DROP_LANG_DATA chunker helper procedure to remove abbreviation data from the data dictionary.

Syntax

DBMS_VECTOR_CHAIN.DROP_LANG_DATA(
    PREF_NAME     IN VARCHAR2
);

LANG

Specify the name of the language data that you want to drop for a given language.

Example

DBMS_VECTOR_CHAIN.DROP_LANG_DATA(
    'INDONESIAN'
);

211.1.6 DROP_VOCABULARY

Use the DBMS_VECTOR_CHAIN.DROP_VOCABULARY chunker helper procedure to remove vocabulary data from the data dictionary.

Syntax

DBMS_VECTOR_CHAIN.DROP_VOCABULARY(
    VOCABULARY_NAME    IN VARCHAR2   
);

VOCAB_NAME

Specify the name of the vocabulary that you want to drop, in the form:

vocabulary_name

owner.vocabulary_name

Example

DBMS_VECTOR_CHAIN.DROP_VOCABULARY(
    'MY_VOCAB_1'
);

211.1.7 UTL_TO_CHUNKS

Use the DBMS_VECTOR_CHAIN.UTL_TO_CHUNKS chainable utility function to split a larger plain text document into smaller chunks (pieces of words, sentences, or paragraphs).

Purpose

To perform a text to chunks transformation, by internally calling the VECTOR_CHUNKS SQL function for the operation.

Syntax

DBMS_VECTOR_CHAIN.UTL_TO_CHUNKS (
    DATA         IN CLOB | VARCHAR2
    PARAMS       IN JSON default NULL
) return VECTOR_ARRAY_T;

DATA

This function accepts the input data type as CLOB or VARCHAR2.

It returns an array of CLOBs, where each CLOB contains a chunk along with its metadata in JSON format, as follows:

{
    "chunk_id":NUMBER,
    "chunk_offset":NUMBER,
    "chunk_length":NUMBER, 
    "chunk_data":"VARCHAR2(4000)"
}

For example:

{"chunk_id":1,"chunk_offset":1,"chunk_length":6,"chunk_data":"sample"}

Where,

chunk_id specifies the chunk ID for each chunk.
chunk_offset specifies the original position of each chunk in the source document, relative to the start of document which has a position of 1.
chunk_length specifies the character length of each chunk.
chunk_data displays text pieces from each chunk.

PARAMS

Specify the input parameters in JSON format.

{
    "by”           :     mode,
    "max”          :     max,
    "overlap"      :     overlap,
    "split"        :     split_condition,
    "custom_list"  :     [ split_chars1, ... ],
    "vocabulary"   :     vocabulary_name,
    "language"     :     nls_language,
    "normalize"    :     normalize_mode,
    "norm_options" :     [ normalize_option1, ... ],
    "extended"     :     boolean
}

For example:

JSON('{ "by":"vocabulary",
        "vocabulary":"myvocab",
        "max":"100",
        "overlap":"0",
        "split":"custom",
        "custom_list": [ "<p>" , "<s>" ],
        "language":"american",
        "normalize":"options",
        "norm_options": [ "WHITESPACE" ] }')

All these parameters are aligned with the VECTOR_CHUNKS SQL function, as follows:

by, max, overlap, split, language, and extended specify the same values as that of VECTOR_CHUNKS.
vocabulary is the name of the custom vocabulary; same as the vocabulary name that you specify when using the by vocabulary mode.
custom_list is an array of the custom split characters; same as the split by custom condition.
norm_options is an array of normalization options; same as the normalize parameter.

For a complete description of these parameters, see VECTOR_CHUNKS in Oracle Database SQL Language Reference.

Example

SELECT D.id doc,
    JSON_VALUE(C.column_value, '$.chunk_id' RETURNING NUMBER) AS id,
    JSON_VALUE(C.column_value, '$.chunk_offset' RETURNING NUMBER) AS pos,
    JSON_VALUE(C.column_value, '$.chunk_length' RETURNING NUMBER) AS siz,
    JSON_VALUE(C.column_value, '$.chunk_data') AS txt
FROM docs D,
   dbms_vector_chain.utl_to_chunks(D.text,
   JSON('{ "by":"words",
           "max":"100",
           "overlap":"0",
           "split":"recursively",
           "language":"american",
           "normalize":"all" }')) C;

To run end-to-end example scenarios using this function, see Convert Text to Chunks With Custom Chunking Specifications and Convert File to Text to Chunks to Embeddings: Using a Pretrained ONNX Model in Oracle Database AI Vector Search User's Guide.

211.1.8 UTL_TO_EMBEDDING and UTL_TO_EMBEDDINGS

Use the DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDING and DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDINGS chainable utility functions to convert plain text to one or more vector embeddings.

Purpose

To perform a text to embedding transformation by accessing:

Oracle Database as the service provider: Calls the pretrained ONNX format embedding model that you have loaded into the database (default setting)
Third-party embedding model: Makes a REST call to your chosen third-party service provider, such as Cohere, Google AI, Hugging Face, Oracle Cloud Infrastructure (OCI) Generative AI, OpenAI, or Vertex AI

WARNING:

Certain features of the database may allow you to access services offered separately by third-parties, for example, through the use of JSON specifications that facilitate your access to REST APIs.

Syntax

DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDING (
    DATA           IN CLOB,
    PARAMS         IN JSON default NULL
) return VECTOR;

DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDINGS (
    DATA           IN VECTOR_ARRAY_T,
    PARAMS         IN JSON default NULL
) return VECTOR_ARRAY_T;

DATA

UTL_TO_EMBEDDING converts text (CLOB) to a single embedding (VECTOR).

UTL_TO_EMBEDDINGS convert an array of chunks (VECTOR_ARRAY_T) to an array of embeddings (VECTOR_ARRAY_T).

The embedding output includes:

{
    "embed_id":NUMBER,
    "embed_data":"VARCHAR2(4000)", 
    "embed_vector":"CLOB"
}

Where,

embed_id displays the ID number of each embedding.
embed_data displays the input text that is transformed into embeddings.
embed_vector displays the generated vector representations.

PARAMS

Specify input parameters in JSON format, depending on the service provider that you want to use.

If using Oracle Database as the provider:

{
  "provider": "database", 
  "model": "<pretrained ONNX embedding model file name>" 
}

Table 211-5 Database Provider Parameter Details

Parameter Description

Parameter	Description
`provider`	Specify `DATABASE` (default setting) to use Oracle Database as the provider. With this setting, you must load a pretrained ONNX embedding model into the database.
`model`	User-specified name under which the imported pretrained ONNX embedding model is stored in Oracle Database. If you do not have a pretrained embedding model in ONNX format, then perform the steps listed in Oracle Database AI Vector Search User's Guide.

provider

Specify DATABASE (default setting) to use Oracle Database as the provider. With this setting, you must load a pretrained ONNX embedding model into the database.

model

User-specified name under which the imported pretrained ONNX embedding model is stored in Oracle Database.

If you do not have a pretrained embedding model in ONNX format, then perform the steps listed in Oracle Database AI Vector Search User's Guide.

If using a third-party provider:

Set the following parameters along with additional embedding parameters specific to your provider:

For UTL_TO_EMBEDDING:

{
  "provider": "<AI service provider>", 
  "credential_name": "<credential name>",
  "url": "<REST endpoint URL for embedding service>", 
  "model": "<REST provider embedding model name>",
  "transfer_timeout": <maximum wait time for the request to complete>,
  "<additional REST provider parameter>": "<REST provider parameter value>" 
}

For UTL_TO_EMBEDDINGS:

{
  "provider": "<AI service provider>", 
  "credential_name": "<credential name>",
  "url": "<REST endpoint URL for embedding service>", 
  "model": "<REST provider embedding model name>",
  "transfer_timeout": <maximum wait time for the request to complete>,
  "batch size": "<number of vectors to request at a time>",
  "<additional REST provider parameter>": "<REST provider parameter value>" 
}

Table 211-6 Third-Party Provider Parameter Details

Parameter	Description
`provider`	Third-party service provider that you want to access for this operation. A REST call is made to the specified provider to access its embedding model. Specify one of the following values: `Cohere` `GoogleAI` `HuggingFace` `OCIGenAI` `OpenAI` `VertexAI`
`credential_name`	Name of the credential in the form: `schema.credential_name` A credential name holds authentication credentials to enable access to your provider for making REST API calls. You need to first set up your credential by calling the `DBMS_VECTOR_CHAIN.CREATE_CREDENTIAL` helper function to create and store a credential, and then refer to the credential name here. See CREATE_CREDENTIAL.
`url`	URL of the API endpoint for each REST call.
`model`	Name of the third-party embedding model in the form: `schema.model_name` If you do not specify a schema, then the schema of the procedure invoker is used. Note: For accurate results, ensure that the chosen model matches the vocabulary file used for chunking. If you are not using a vocabulary file, then ensure that the input length is defined within the token limits of your model.
`transfer_timeout`	Maximum time to wait for the request to complete. The default value is `60` seconds. You can increase this value for busy web servers.
`batch size`	Maximum number of vectors to request at a time. For example, for a batch size of `50`, if `100` chunks are passed, then this API sends two requests with an array of `50` strings each. If `30` chunks are passed (which is lesser than the defined batch size), then the API sends those in a single request. For REST calls, it is more efficient to send a batch of inputs at a time rather than requesting a single input per call. Increasing the batch size can provide better performance, whereas reducing the batch size may reduce memory and data usage, especially if your provider has a rate limit. The default or maximum allowed value depends on the third-party provider settings.

Additional REST provider parameters:

Cohere example:

{
  "provider": "cohere",
  "credential_name": "COHERE_CRED",
  "url": "https://api.cohere.example.com/embed",
  "model": "embed-model",
  "input_type": "search_query"
}

Google AI example:

{
  "provider": "googleai",
  "credential_name": "GOOGLEAI_CRED",
  "url": "https://googleapis.example.com/models/",
  "model": "embed-model"
}

Hugging Face example:

{
  "provider": "huggingface",
  "credential_name": "HF_CRED",
  "url": "https://api.huggingface.example.com/",
  "model": "embed-model"
}

Generative AI example:

{
  "provider": "ocigenai",
  "credential_name": "OCI_CRED",
  "url": "https://generativeai.oci.example.com/embedText",
  "model": "embed-model",
  "batch_size": 10
}

OpenAI example:

{
  "provider": "openai",
  "credential_name": "OPENAI_CRED",
  "url": "https://api.openai.example.com/embeddings",
  "model": "embed-model"
}

Vertex AI example:

{
  "provider": "vertexai",
  "credential_name": "VERTEXAI_CRED",
  "url": "https://googleapis.example.com/models/",
  "model": "embed-model"
}

Table 211-7 Additional REST Provider Parameter Details

Parameter	Description
`input_type`	Type of input to vectorize.

For more information on additional parameters, refer to your third-party provider's documentation.

Note:

The generated embedding results may be different between requests for the same input and configuration, depending on your embedding model or floating point precision. However, this does not affect your queries (and provides semantically correct results) because the vector distance will be similar.

Examples

You can use UTL_TO_EMBEDDING in a SELECT clause and UTL_TO_EMBEDDINGS in a FROM clause, as follows:

The following examples use UTL_TO_EMBEDDING to generate a vector embedding with Hello world as the input, by accessing the Generative AI embedding model:

-- select example

var params clob;
exec :params := '
{ 
  "provider": "ocigenai",
  "credential_name": "OCI_CRED", 
  "url": "https://generativeai.oci.example.com/embedText",
  "model": "embed.modelname",
  "batch_size": 10
}';

select dbms_vector_chain.utl_to_embedding('Hello world', json(:params)) from dual;

-- PL/SQL example

declare
  input clob;
  params clob;
  v vector;
begin
  input := 'Hello world';

  params := '
{
  "provider": "ocigenai",
  "credential_name": "OCI_CRED",
  "url": "https://generativeai.oci.example.com/embedText",
  "model": "embed.modelname"
}';

  v := dbms_vector_chain.utl_to_embedding(input, json(params));
  dbms_output.put_line(vector_serialize(v));
exception
  when OTHERS THEN
    DBMS_OUTPUT.PUT_LINE (SQLERRM);
    DBMS_OUTPUT.PUT_LINE (SQLCODE);
end;
/

To run an end-to-end example scenario using UTL_TO_EMBEDDING, see Convert Text String to Embedding Using Third-Party APIs in Oracle Database AI Vector Search User's Guide.

The following example uses UTL_TO_EMBEDDINGS to generate vector embeddings with a PDF document (stored in the documentation_tab table) as the input, by calling the ONNX model my_embedding_model.onnx:
```
SELECT et.* from documentation_tab dt,dbms_vector_chain.utl_to_embeddings(dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(dt.data)),
json(:embed_params)) et;
```
To run an end-to-end example scenario using UTL_TO_EMBEDDINGS, see Convert File to Text to Chunks to Embeddings: Using a Pretrained ONNX Model in Oracle Database AI Vector Search User's Guide.

211.1.9 UTL_TO_GENERATE_TEXT

Use the DBMS_VECTOR_CHAIN.UTL_TO_GENERATE_TEXT chainable utility function to generate text for a given prompt, by accessing third-party text generation models.

Purpose

To communicate with Large Language Models (LLMs) for generating a textual description for prompts, given as input to LLM-powered chat interfaces.

A prompt can be an input text string, such as a question that you ask an LLM. For example, "What is Oracle Text?". A prompt can also be a set of instructions or a command, such as "Summarize the following ...", "Draft an email asking for ...", or "Rewrite the following ...", and can include results from a search. This API responds with a textual answer, description, or summary based on the specified task in the prompt.

For this operation, this API makes a REST call to a third-party service provider, such as Cohere, Google AI, Hugging Face, Oracle Cloud Infrastructure (OCI) Generative AI, OpenAI, or Vertex AI.

WARNING:

Certain features of the database may allow you to access services offered separately by third-parties, for example, through the use of JSON specifications that facilitate your access to REST APIs.

Syntax

This function accepts the input as CLOB containing text data. It then processes this information to generate a new CLOB containing the generated text.

DBMS_VECTOR_CHAIN.UTL_TO_GENERATE_TEXT (
    	DATA          IN CLOB,
    	PARAMS        IN JSON default NULL
) return CLOB;

DATA

Specify the textual prompt as CLOB for the DATA clause.

PARAMS

Specify the following input parameters in JSON format, depending on the service provider that you want to access for text generation:

{
  "provider": "<AI service provider>",
  "credential_name": "<credential name>",  
  "url": "<REST endpoint URL for text generation service>",
  "model": "<text generation model name>",
  "transfer_timeout": <maximum wait time for the request to complete>,
  "<additional REST provider parameter>": "<REST provider parameter value>"
}

Table 211-8 UTL_TO_GENERATE_TEXT Parameter Details

Parameter	Description
`provider`	Supported REST provider that you want to access to generate text. Specify one of the following values: `Cohere` `GoogleAI` `HuggingFace` `OCIGenAI` `OpenAI` `VertexAI`
`credential_name`	Name of the credential in the form: `schema.credential_name` A credential name holds authentication credentials to enable access to your provider for making REST API calls. You need to first set up your credential by calling the `DBMS_VECTOR_CHAIN.CREATE_CREDENTIAL` helper function to create and store a credential, and then refer to the credential name here. See CREATE_CREDENTIAL.
`url`	URL of the API endpoint for each REST call.
`model`	Name of the third-party text generation model in the form: `schema.model_name` If the model name is not schema-qualified, then the schema of the procedure invoker is used.
`transfer_timeout`	Maximum time to wait for the request to complete. The default value is `60` seconds. You can increase this value for busy web servers.

Additional REST provider parameters:

Cohere example:

{
  "provider": "Cohere", 
  "credential_name": "COHERE_CRED",
  "url": "https://api.cohere.example.com/generateText",
  "model": "generate-text-model"
}

Google AI example:

{
  "provider": "googleai",
  "credential_name": "GOOGLEAI_CRED",
  "url": "https://googleapis.example.com/models/",
  "model": "generate-text-model"
}

Hugging Face example:

{
  "provider": "huggingface",
  "credential_name": "HF_CRED",
  "url": "https://api.huggingface.example.com/models/",
  "model": "generate-text-model"
}

Generative AI example:

{
  "provider": "OCIGenAI", 
  "credential_name": "GENAI_CRED",
  "url": "https://generativeai.oci.example.com/generateText",
  "model": "generate-text-model",
  "inferenceRequest": {
    "maxTokens": 300,
    "temperature": 1
  }
}

OpenAI example:

{
  "provider": "openai",
  "credential_name": "OPENAI_CRED",
  "url": "https://api.openai.example.com",
  "model": "generate-text-model",
  "max_tokens": 60,
  "temperature": 1.0
}

Vertex AI example:

{
  "provider": "vertexai",
  "credential_name":"VERTEXAI_CRED",
  "url": "https://googleapis.example.com/models/",
  "model": "generate-text-model",
  "generation_config": {
    "temperature": 0.9,
    "topP": 1,
    "candidateCount": 1,
    "maxOutputTokens": 256
  }
}

Table 211-9 Additional REST Provider Parameter Details

Parameter	Description
`max_tokens`	Maximum number of tokens in the output text.
`temperature`	Degree of randomness used when generating the output text, in the range of `0.0-5.0`. To generate the same output for a prompt, use `0`. To generate a random new text for that prompt, increase the temperature. Note: Start with the temperature set to `0`. If you do not require random results, a recommended temperature value is between `0` and `1`. A higher value is not recommended because a high temperature may produce creative text, which might also include hallucinations.
`topP`	Probability of tokens in the output, in the range of `0.0–1.0`. A lower value provides less random responses and a higher value provides more random responses.
`candidateCount`	Number of response variations to return, in the range of `1-4`.
`maxOutputTokens`	Maximum number of tokens to generate for each response.

For more information on additional parameters, refer to your third-party provider's documentation.

Example

The following statements generate text by making a REST call to Generative AI. Here, the prompt is "What is Oracle Text?".

-- select example

var params clob;
exec :params := '
{
  "provider": "ocigenai",
  "credential_name": "OCI_CRED",
  "url": "https://generativeai.oci.example.com/generateText",
  "model": "generate.modelname"
}';

select dbms_vector_chain.utl_to_generate_text(
 'What is Oracle Text?',
 json(:params)) from dual;

-- PL/SQL example

declare
  input clob;
  params clob;
  output clob;
begin
  input := 'What is Oracle Text?';

  params := '
{
  "provider": "ocigenai",
  "credential_name": "OCI_CRED",
  "url": "https://generativeai.oci.example.com/generateText",
  "model": "generate.modelname"
}';

  output := dbms_vector_chain.utl_to_generate_text(input, json(params));
  dbms_output.put_line(output);
  if output is not null then
    dbms_lob.freetemporary(output);
  end if;
exception
  when OTHERS THEN
    DBMS_OUTPUT.PUT_LINE (SQLERRM);
    DBMS_OUTPUT.PUT_LINE (SQLCODE);
end;
/

To run an end-to-end example scenario, see Generate Text Using Third-Party APIs in Oracle Database AI Vector Search User's Guide.

211.1.10 UTL_TO_TEXT

Use the DBMS_VECTOR_CHAIN.UTL_TO_TEXT chainable utility function to convert an input document (for example, PDF, DOC, JSON, XML, or HTML) to plain text.

Purpose

To perform a file to text transformation by using the Oracle Text component (CONTEXT) of Oracle Database.

Syntax

DBMS_VECTOR_CHAIN.UTL_TO_TEXT (
    DATA          IN CLOB | BLOB,
    PARAMS        IN JSON default NULL
) return CLOB;

DATA

This function accepts the input data type as CLOB or BLOB.

It returns a plain text version of the document as CLOB.

You can either pass the input document directly to the API or specify a file stored in Oracle Database.

For a complete list of all the supported document formats, see Oracle Text Reference.

PARAMS

Specify the following input parameter in JSON format:

{ 
    "plaintext":"true or false",
    "charset":"UTF8" 
}

Table 211-10 Parameter Details

Parameter Description

Parameter	Description
`plaintext`	Plain text output. The default value for this parameter is `TRUE`, that is, by default the output format is plain text. If you do not want to return the document as plain text, then set this parameter to `FALSE`.
`charset`	Character set encoding. Currently, only `UTF8` is supported.

plaintext

Plain text output.

The default value for this parameter is TRUE, that is, by default the output format is plain text.

If you do not want to return the document as plain text, then set this parameter to FALSE.

charset

Character set encoding.

Currently, only UTF8 is supported.

Example

select DBMS_VECTOR_CHAIN.UTL_TO_TEXT (
    t.blobdata, 
     json('{
            "plaintext":"true",
            "charset":"UTF8" 
           }')
) from tab t;

To run an end-to-end example scenario using this function, see Oracle Database AI Vector Search User's Guide.

211.1.11 UTL_TO_SUMMARY

Use the DBMS_VECTOR_CHAIN.UTL_TO_SUMMARY chainable utility function to convert plain text to a summary.

Purpose

To perform a text to summary transformation by accessing:

Oracle Database as the AI service provider: Uses a document gist (or summary) generated by Oracle Text (default setting)
Third-party summarization model: Makes a REST API call to your chosen third-party AI service provider, such as Cohere, Google AI, Hugging Face, Oracle Cloud Infrastructure (OCI) Generative AI, OpenAI, or Vertex AI

A summary is a short and concise extract with key features of a document that best represents what the document is about as a whole. A summary can be free-form paragraphs or bullet points based on the format that you specify.

WARNING:

Certain features of the database may allow you to access services offered separately by third-parties, for example, through the use of JSON specifications that facilitate your access to REST APIs.

Syntax

DBMS_VECTOR_CHAIN.UTL_TO_SUMMARY (
    DATA          IN CLOB,
    PARAMS        IN JSON default NULL
) return CLOB;

DATA

This function accepts the input data type as CLOB.

It returns a summary of the input document in plain text as CLOB.

PARAMS

Specify input parameters in JSON format, depending on the service provider that you want to use for text summarization.

If using Oracle Database as the provider:

{
  "provider": "database",
  "glevel": "<summary format>",
  "numParagraphs": <number in the range 1-16>,
  "maxPercent": <number in the range 1-100>,
  "num_themes": <number in the range 1-50>,
  "language": "<name of the language>"
}

Table 211-11 Database Provider Parameter Details

Parameter	Description
`provider`	Specify `DATABASE` (default setting) to access Oracle Database as the provider. Oracle Text is internally used to extract a document gist or summary from user data.
`glevel`	Format to display the summary: `SENTENCE` \| `S`: As a list of sentences `PARAGRAPH` \| `P`: In a free-form paragraph
`numParagraphs`	Maximum number of document paragraphs (or sentences) selected for the summary. The default value is `16`. The `numParagraphs` parameter is used only when this parameter yields a smaller summary size than the summary size yielded by the `maxPercent` parameter, because the function always returns the smallest size summary.
`maxPercent`	Maximum number of document paragraphs (or sentences) selected for the summary, as a percentage of the total paragraphs (or sentences) in the document. The default value is `10`. The `maxPercent` parameter is used only when this parameter yields a smaller summary size than the summary size yielded by the `numParagraphs` parameter, because the function always returns the smallest size summary.
`num_themes`	Number of theme summaries to produce. For example, if you specify `10`, then this function returns the top 10 theme summaries. If you specify `0` or `NULL`, then this function returns all themes in a document. The default value is `50`. If the document contains more than 50 themes, only the top 50 themes show conceptual hierarchy.
`language`	Language name of your summary text, as listed in Supported Languages and Data File Locations.

If using a third-party provider:

Set the following parameters along with additional summarization parameters specific to your provider:

{
  "provider": "<AI service provider>", 
  "credential_name": "<credential name>",
  "url": "<REST endpoint URL for summarization service>", 
  "model": "<REST provider summarization model name>",
  "transfer_timeout": <maximum wait time for the request to complete>,
  "<additional REST provider parameter>": "<REST provider parameter value>"
}

Table 211-12 Third-Party Provider Parameter Details

Parameter	Description
`provider`	Third-party service provider that you want to access to get the summary. A REST call is made to the specified provider to access its text summarization model. Specify one of the following values: `Cohere` `GoogleAI` `HuggingFace` `OCIGenAI` `OpenAI` `VertexAI`
`credential_name`	Name of the credential in the form: `schema.credential_name` A credential name holds authentication credentials to enable access to your provider for making REST API calls. You need to first set up your credential by calling the `DBMS_VECTOR_CHAIN.CREATE_CREDENTIAL` helper function to create and store a credential, and then refer to the credential name here. See CREATE_CREDENTIAL.
`url`	URL of the API endpoint for each REST call.
`model`	Name of the third-party text summarization model in the form: `schema.model_name` If the model name is not schema-qualified, then the schema of the procedure invoker is used. Note: For Generative AI, you must specify `schema.model_name`.
`transfer_timeout`	Maximum time to wait for the request to complete. The default value is `60` seconds. You can increase this value for busy web servers.

Additional REST provider parameters:

Cohere example:

{
  "provider": "cohere",
  "credential_name": "COHERE_CRED",
  "url": "https://api.cohere.example.com/summarize",
  "model": "summarize-model",
  "length": "medium",
  "format": "paragraph",
  "temperature": 1.0
}

Google AI example:

{
  "provider": "googleai",
  "credential_name": "GOOGLEAI_CRED",
  "url": "https://googleapis.example.com/models/",
  "model": "summarize-model",
  "generation_config": {
    "temperature": 0.9,
    "topP": 1,
    "candidateCount": 1,
    "maxOutputTokens": 256
  }
}

Hugging Face example:

{
  "provider": "huggingface",
  "credential_name": "HF_CRED",
  "url": "https://api.huggingface.example.co/models/",
  "model": "summarize-model"
}

Generative AI example:

{
 "provider": "ocigenai",
 "credential_name": "OCI_CRED",
 "url": "https://generativeai.oci.example.com/summarizeText",
 "model": "summarize-model",
 "length": "medium",
 "format": "paragraph"
}

OpenAI example:

{
  "provider": "openai",
  "credential_name": "OPENAI_CRED",
  "url": "https://api.openai.example.com",
  "model": "summarize-model",
  "max_tokens": 256,
  "temperature": 1.0
}

Vertex AI example:

{
  "provider": "vertexai",
  "credential_name": "VERTEXAI_CRED",
  "url": "https://googleapis.example.com/models/",
  "model": "summarize-model",
  "generation_config": {
    "temperature": 0.9,
    "topP": 1,
    "candidateCount": 1,
    "maxOutputTokens": 256
  }
}

Table 211-13 Additional REST Provider Parameter Details

Parameter	Description
`length`	Approximate length of the summary text: `SHORT`: Roughly up to 2 sentences `MEDIUM`: Between 3 and 5 sentences `LONG`: 6 or more sentences `AUTO`: The model chooses a length based on the input size Note: For Generative AI, you must enter this value in uppercase.
`format`	Format to display the summary: `PARAGRAPH`: In a free-form paragraph `BULLETS`: In bullet points Note: For Generative AI, you must enter this value in uppercase.
`temperature`	Degree of randomness used when generating output text, in the range of `0.0-5.0`. To generate the same output for a prompt, use `0`. To generate a random new text for that prompt, increase the temperature. Default temperature is `1` and the maximum temperature is `5`. Note: To summarize a text, start with the temperature set to `0`. If you do not require random results, a recommended temperature value is `0.2` for Generative AI and between `0` and `1` for Cohere. Use a higher value if for example you plan to perform a selection of the various summaries afterward. Do not use a high temperature for summarization because a high temperature encourages the model to produce creative text, which might also include hallucinations.
`extractiveness`	How much to reuse the input in the summary: `LOW`: Summaries with low extractiveness tend to paraphrase. `HIGH`: Summaries with high extractiveness lean toward reusing sentences verbatim. Note: For Generative AI, you must enter this value in uppercase.
`max_tokens`	Maximum number of tokens in the output text.
`topP`	Probability of tokens in the output, in the range of `0.0–1.0`. A lower value provides less random responses and a higher value provides more random responses.
`candidateCount`	Number of response variations to return, in the range of `1-4`.
`maxOutputTokens`	Maximum number of tokens to generate for each response.

Note:

When you enter the length, format, and extractiveness values for Generative AI, ensure to enter them in uppercase letters.

For more information on additional parameters, refer to your third-party provider's documentation.

Examples

These statements generate a summary from an extract on "Transactions", by accessing the Generative AI summarization model.

-- select example

var params clob;
exec :params := '
{  
 "provider": "ocigenai", 
 "credential_name": "OCI_CRED", 
 "url": "https://generativeai.oci.example.com/summarizeText", 
 "model": "summarize.modelname",
 "temperature": "0.0",
 "extractiveness": "LOW"
}';

select dbms_vector_chain.utl_to_summary(
 'A transaction is a logical, atomic unit of work that contains one or more SQL statements.  An RDBMS must be able to group SQL statements so that they are either all committed, which means they are applied to the database, or all rolled back, which means they are undone.  An illustration of the need for transactions is a funds transfer from a savings account to a checking account. The transfer consists of the following separate operations:
    1. Decrease the savings account.
    2. Increase the checking account.
    3. Record the transaction in the transaction journal.
    Oracle Database guarantees that all three operations succeed or fail as a unit. For example, if a hardware failure prevents a statement in the transaction from executing, then the other statements must be rolled back.
    Transactions set Oracle Database apart from a file system. If you perform an atomic operation that updates several files, and if the system fails halfway through, then the files will not be consistent. In contrast, a transaction moves an Oracle database from one consistent state to another. The basic principle of a transaction is "all or nothing": an atomic operation succeeds or fails as a whole.',
 json(:params)) from dual;

-- PL/SQL example

declare
  input clob;
  params clob;
  output clob;
begin
  input := 'A transaction is a logical, atomic unit of work that contains one or more SQL statements.  An RDBMS must be able to group SQL statements so that they are either all committed, which means they are applied to the database, or all rolled back, which means they are undone.  An illustration of the need for transactions is a funds transfer from a savings account to a checking account. The transfer consists of the following separate operations:
    1. Decrease the savings account.
    2. Increase the checking account.
    3. Record the transaction in the transaction journal.
    Oracle Database guarantees that all three operations succeed or fail as a unit. For example, if a hardware failure prevents a statement in the transaction from executing, then the other statements must be rolled back.
    Transactions set Oracle Database apart from a file system. If you perform an atomic operation that updates several files, and if the system fails halfway through, then the files will not be consistent. In contrast, a transaction moves an Oracle database from one consistent state to another. The basic principle of a transaction is "all or nothing": an atomic operation succeeds or fails as a whole.';

  params := '
  {
   "provider": "ocigenai",
   "credential_name": "OCI_CRED",
   "url": "https://generativeai.oci.example.com/summarizeText",
   "model": "summarize.modelname",
   "length": "MEDIUM",
   "format": "PARAGRAPH",
   "temperature": 1.0
  }';

  output := dbms_vector_chain.utl_to_summary(input, json(params));
  dbms_output.put_line(output);
  if output is not null then
    dbms_lob.freetemporary(output);
  end if;
exception
  when OTHERS THEN
    DBMS_OUTPUT.PUT_LINE (SQLERRM);
    DBMS_OUTPUT.PUT_LINE (SQLCODE);
end;
/

To run an end-to-end example scenario using this function, see Oracle Database AI Vector Search User's Guide.

211.2 Supported Languages and Data File Locations

These are the supported languages for which language data files are distributed by default in the specified directories.

Language Name	Abbreviation	Data File
AFRIKAANS	af	`ctx/data/eos/dreosaf.txt`
AMERICAN	us	`ctx/data/eos/dreosus.txt`
ARABIC	ar	`ctx/data/eos/dreosar.txt`
BASQUE	eu	`ctx/data/eos/dreoseu.txt`
BELARUSIAN	be	`ctx/data/eos/dreosbe.txt`
BRAZILIAN PORTUGUESE	ptb	`ctx/data/eos/dreosptb.txt`
BULGARIAN	bg	`ctx/data/eos/dreosbg.txt`
CANADIAN FRENCH	frc	`ctx/data/eos/dreosfrc.txt`
CATALAN	ca	`ctx/data/eos/dreosca.txt`
CROATIAN	hr	`ctx/data/eos/dreoshr.txt`
CYRILLIC SERBIAN	csr	`ctx/data/eos/dreoscsr.txt`
CZECH	cs	`ctx/data/eos/dreoscs.txt`
DANISH	dk	`ctx/data/eos/dreosdk.txt`
DARI	prs	`ctx/data/eos/dreosprs.txt`
DUTCH	nl	`ctx/data/eos/dreosnl.txt`
EGYPTIAN	eg	`ctx/data/eos/dreoseg.txt`
ENGLISH	gb	`ctx/data/eos/dreosgb.txt`
ESTONIAN	et	`ctx/data/eos/dreoset.txt`
FINNISH	sf	`ctx/data/eos/dreossf.txt`
FRENCH	f	`ctx/data/eos/dreosf.txt`
GALICIAN	ga	`ctx/data/eos/dreosga.txt`
GERMAN	d	`ctx/data/eos/dreosd.txt`
GERMAN DIN	din	`ctx/data/eos/dreosdin.txt`
GREEK	el	`ctx/data/eos/dreosel.txt`
HEBREW	iw	`ctx/data/eos/dreosiw.txt`
HINDI	hi	`ctx/data/eos/dreoshi.txt`
HUNGARIAN	hu	`ctx/data/eos/dreoshu.txt`
ICELANDIC	is	`ctx/data/eos/dreosis.txt`
INDONESIAN	in	`ctx/data/eos/dreosin.txt`
ITALIAN	i	`ctx/data/eos/dreosi.txt`
JAPANESE	ja	`ctx/data/eos/dreosja.txt`
KOREAN	ko	`ctx/data/eos/dreosko.txt`
LATIN AMERICAN SPANISH	esa	`ctx/data/eos/dreosesa.txt`
LATIN BOSNIAN	lbs	`ctx/data/eos/dreoslbs.txt`
LATIN SERBIAN	lsr	`ctx/data/eos/dreoslsr.txt`
LATVIAN	lv	`ctx/data/eos/dreoslv.txt`
LITHUANIAN	lt	`ctx/data/eos/dreoslt.txt`
MACEDONIAN	mk	`ctx/data/eos/dreosmk.txt`
MALAY	ms	`ctx/data/eos/dreosms.txt`
MEXICAN SPANISH	esm	`ctx/data/eos/dreosesm.txt`
NORWEGIAN	n	`ctx/data/eos/dreosn.txt`
NYNORSK	nn	`ctx/data/eos/dreosnn.txt`
PERSIAN	fa	`ctx/data/eos/dreosfa.txt`
POLISH	pl	`ctx/data/eos/dreospl.txt`
PORTUGUESE	pt	`ctx/data/eos/dreospt.txt`
ROMANIAN	ro	`ctx/data/eos/dreosro.txt`
RUSSIAN	ru	`ctx/data/eos/dreosru.txt`
SIMPLIFIED CHINESE	zhs	`ctx/data/eos/dreoszhs.txt`
SLOVAK	sk	`ctx/data/eos/dreossk.txt`
SLOVENIAN	sl	`ctx/data/eos/dreossl.txt`
SPANISH	e	`ctx/data/eos/dreose.txt`
SWEDISH	s	`ctx/data/eos/dreoss.txt`
THAI	th	`ctx/data/eos/dreosth.txt`
TRADITIONAL CHINESE	zht	`ctx/data/eos/dreoszht.txt`
TURKISH	tr	`ctx/data/eos/dreostr.txt`
UKRAINIAN	uk	`ctx/data/eos/dreosuk.txt`
URDU	ur	`ctx/data/eos/dreosur.txt`

Related Topics

CREATE_LANG_DATA