- Oracle AI Vector Search User's Guide
- Generate Vector Embeddings
- Vector Generation Examples
- Perform Chunking With Embedding
- Convert Text to Chunks With Custom Chunking Specifications
Convert Text to Chunks With Custom Chunking Specifications
A chunked output, especially for long and complex documents, sometimes loses contextual meaning or coherence with its parent content. In this example, you can see how to refine your chunks by applying custom chunking specifications.
Here, you use the VECTOR_CHUNKS
SQL function or the UTL_TO_CHUNKS()
PL/SQL function from the DBMS_VECTOR_CHAIN
package.
- Connect to Oracle Database as a local user.
- Log in to SQL*Plus as the
SYS
user, connecting asSYSDBA
, to your PDB:conn sys/password@CDB_PDB as sysdba
CREATE TABLESPACE tbs1 DATAFILE 'tbs5.dbf' SIZE 20G AUTOEXTEND ON EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO;
- Create a local test user (
docuser
) and grant necessary privileges:DROP USER docuser cascade;
CREATE USER docuser identified by docuser DEFAULT TABLESPACE tbs1 quota unlimited on tbs1;
GRANT DB_DEVELOPER_ROLE to docuser;
- Connect to Oracle Database as
docuser
and alter the environment settings for your session:CONN docuser/password@CDB_PDB
SET ECHO ON SET FEEDBACK 1 SET NUMWIDTH 10 SET LINESIZE 80 SET TRIMSPOOL ON SET TAB OFF SET PAGESIZE 10000 SET LONG 10000
- Log in to SQL*Plus as the
- Create a table to store chunks.
Create a relational table (
documentation_tab
) to store unstructured text chunks in it:DROP TABLE IF EXISTS documentation_tab; CREATE TABLE documentation_tab ( id NUMBER, text VARCHAR2(2000));
INSERT INTO documentation_tab VALUES(1, 'Oracle AI Vector Search stores and indexes vector embeddings'|| ' for fast retrieval and similarity search.'||CHR(10)||CHR(10)|| ' About Oracle AI Vector Search'||CHR(10)|| ' Vector Indexes are a new classification of specialized indexes'|| ' that are designed for Artificial Intelligence (AI) workloads that allow'|| ' you to query data based on semantics, rather than keywords.'||CHR(10)||CHR(10)|| ' Why Use Oracle AI Vector Search?'||CHR(10)|| ' The biggest benefit of Oracle AI Vector Search is that semantic search'|| ' on unstructured data can be combined with relational search on business'|| ' data in one single system.'||CHR(10)); COMMIT;
SET LINESIZE 1000; SET PAGESIZE 200; COLUMN doc FORMAT 999; COLUMN id FORMAT 999; COLUMN pos FORMAT 999; COLUMN siz FORMAT 999; COLUMN txt FORMAT a60; COLUMN data FORMAT a80;
- Call VECTOR_CHUNKS.
Call the
VECTOR_CHUNKS
SQL function and specify the following custom chunking parameters:SELECT D.id doc, C.chunk_offset pos, C.chunk_length siz, C.chunk_text txt FROM documentation_tab D, VECTOR_CHUNKS(D.text BY words MAX 50 OVERLAP 0 SPLIT BY recursively LANGUAGE american NORMALIZE all) C;
To call the same operation using the
UTL_TO_CHUNKS
function from theDBMS_VECTOR_CHAIN
package, run:SELECT D.id doc, JSON_VALUE(C.column_value, '$.chunk_id' RETURNING NUMBER) AS id, JSON_VALUE(C.column_value, '$.chunk_offset' RETURNING NUMBER) AS pos, JSON_VALUE(C.column_value, '$.chunk_length' RETURNING NUMBER) AS siz, JSON_VALUE(C.column_value, '$.chunk_data') AS txt FROM documentation_tab D, dbms_vector_chain.utl_to_chunks(D.text, JSON('{"by":"words", "max":"50", "overlap":"0", "split":"recursively", "language":"american", "normalize":"all"}')) C;
This returns a set of three chunks, which are split by words recursively using blank lines, new lines, and spaces:
DOC POS SIZ TXT ---- ---- ---- ------------------------------------------------------------ 1 1 108 Oracle AI Vector Search stores and indexes vector embeddings for fast retrieval and similarity search. 1 109 234 About Oracle AI Vector Search Vector Indexes are a new classification of specialized index es that are designed for Artificial Intelligence (AI) worklo ads that allow you to query data based on semantics, rather than keywords. 1 343 204 Why Use Oracle AI Vector Search? The biggest benefit of Oracle AI Vector Search is that seman tic search on unstructured data can be combined with relatio nal search on business data in one single system.
- To further clean up the chunking results, supply JSON parameters for
UTL_TO_CHUNKS
:SELECT C.* FROM documentation_tab D, dbms_vector_chain.utl_to_chunks(D.text, JSON('{"by":"words", "max":"50", "overlap":"0", "split":"recursively", "language":"american", "normalize":"all"}')) C;
This returns chunk texts in JSON format:COLUMN_VALUE -------------------------------------------------------------------------------- {"chunk_id":1,"chunk_offset":1,"chunk_length":108,"chunk_data":"Oracle AI Vector Search stores and indexes vector embeddings for fast retrieval and similarity s earch."} {"chunk_id":2,"chunk_offset":109,"chunk_length":234,"chunk_data":"About Oracle A I Vector Search\nVector Indexes are a new classification of specialized indexes that are designed for Artificial Intelligence (AI) workloads that allow you to q uery data based on semantics, rather than keywords."} {"chunk_id":3,"chunk_offset":343,"chunk_length":204,"chunk_data":"Why Use Oracle AI Vector Search?\nThe biggest benefit of Oracle AI Vector Search is that seman tic search on unstructured data can be combined with relational search on busine ss data in one single system."}
The chunking results contain:-
chunk_id
: Chunk ID for each chunk -
chunk_offset
: Original position of each chunk in the source document, relative to the start of document (which has a position of1
) -
chunk_length
: Character length of each chunk -
chunk_data
: Text pieces from each chunk
-
Related Topics
Parent topic: Perform Chunking With Embedding