VECTOR_CHUNKS
chunks_table_arguments::=
chunking_spec::=
split_characters_list::=
custom_split_characters_list
normalization_spec
custom_normalization_spec
normalization_mode
chunking_mode::=
Purpose
VECTOR_CHUNKS
is a one-to-many table function that takes a row of textual input and returns the output as one or more chunks with metadata.
The chunking output includes:
-
chunk_offset
: Position of each chunk (NUMBER
) in the source document, relative to the start of document which has a position of1
-
chunk_length
: Character length (NUMBER
) of each chunk -
chunk_text
: Text pieces from each chunk
VECTOR_CHUNKS
takes as input one of the following data types: VARCHAR2
, CHAR
, CLOB
, NVARCHAR2
, NCLOB
, NCHAR
. It is aware of the database character set and the national language character set.
It returns as output a text chunk as VARCHAR2
or NVARCAHR2
.
Table 7-13 Input and Output Data Type Details
Input Data Type | Database NLS Parameter | Input Encoding | Output Data Type | Output Offset |
---|---|---|---|---|
|
|
Any |
|
|
|
|
Any |
|
byte |
|
Note: |
Any |
|
|
|
|
(also |
|
|
|
|
(also |
|
|
|
|
(also |
|
|
Note:
-
The SQL
NCHAR
,NVARCHAR2
, andNCLOB
data types support Unicode data only. You can use eitherUTF8
orAL16UTF16
character set. The default isAL16UTF16
. -
The
VARCHAR2
input data type is limited to4000
bytes unless theMAX_STRING_SIZE
parameter is set toEXTENDED
, which increases the limit to32767
.
Parameters
All chunking parameters are optional, and the default chunking specifications are automatically applied to your chunk data.
When specifying chunking parameters for this API, ensure that you provide these parameters only in the listed order.
Table 7-14 Parameters Table
Parameter | Description and Acceptable Values |
---|---|
|
Specifies the mode for splitting your data, that is, to split by counting the number of characters, words, or vocabulary tokens. Valid values:
Default value: |
|
Specifies a limit on the maximum size of each chunk. This setting splits the input text at a fixed point where the maximum limit occurs in the larger text. The units of Valid values:
Default value: |
|
Specifies where to split the input text when it reaches the maximum size limit. This helps to keep related data together by defining appropriate boundaries for chunks. Valid values:
Default value: |
|
Specifies the amount (as a positive integer literal or zero) of the preceding text that the chunk should contain, if any. This helps in logically splitting up related text (such as a sentence) by including some amount of the preceding chunk text. The amount of overlap depends on how the maximum size of the chunk is measured (in characters, words, or vocabulary tokens). The overlap begins at the specified Valid value: Default value: |
|
Specifies the language of your input data. This clause is important, especially when your text contains certain characters (for example, punctuations or abbreviations) that may be interpreted differently in another language. Valid value: Any NLS-supported language name or language abbreviation, as listed in Oracle Database Globalization Support Guide. You must use double quotation marks (
For one-word language names, quotation marks are not needed. For example:
Note: You can use the Default value: |
|
Automatically pre-processes or post-processes issues (such as multiple consecutive spaces and smart quotes) that may arise when documents are converted into text. Oracle recommends you to use this mode to extract good-quality chunks. Valid values:
Note: You must specify a comma-separated list of Default value: None |
|
Increases the output limit of a Default value: |
Example
CREATE TABLE documentation_tab (
id NUMBER,
text VARCHAR2(2000));
INSERT INTO documentation_tab
VALUES(1, 'sample');
COMMIT;
SET LINESIZE 100;
SET PAGESIZE 20;
COLUMN pos FORMAT 999;
COLUMN siz FORMAT 999;
COLUMN txt FORMAT a60;
PROMPT SQL VECTOR_CHUNKS
SELECT D.id id, C.chunk_offset pos, C.chunk_length siz, C.chunk_text txt
FROM documentation_tab D, VECTOR_CHUNKS(D.text BY words
MAX 200
OVERLAP 10
SPLIT BY recursively
LANGUAGE american
NORMALIZE all) C;
See Also:
-
For a complete set of examples on each of the chunking parameters listed in the preceding table, see Explore Chunking Techniques and Examples in the Oracle AI Vector Search User's Guide.
-
To run an end-to-end example scenario using this function, see Convert Text to Chunks With Custom Chunking Specifications in the Oracle AI Vector Search User's Guide.