MySQL HeatWave User Guide
This section describes how to generate vector embeddings for files or folders stored in Object Storage, and load the embeddings into a vector store table.
The following sections in this topic describe how to ingest files into a vector store:
Review the MySQL HeatWave GenAI requirements and privileges.
If not already done, create an Oracle Cloud Infrastructure (OCI) bucket for storing files that you want to ingest into the vector store. Then, upload the files to the bucket.
Vector store can ingest files in the following formats: PDF, PPT, TXT, HTML, and DOC.
To run the steps in this topic, create an Object Storage
bucket with the name demo_bucket
.
Download the
MySQL AI;
user guide PDF, then upload it to
demo_bucket
.
Connect to your MySQL HeatWave DB System.
X protocol is not supported for Ingest Files Using Asynchronous Load. To set up a vector store using this method, ensure that you use the classic MySQL protocol while connecting to the database:
mysqlsh -uDBSystemAdminName
-pDBSystemPassword
-hDBSystemPrivateIP
--sqlc
To create and store vector store tables using the steps
described in this topic, you can create a new database
demo_db
:
mysql> CREATE DATABASE demo_db;
The
VECTOR_STORE_LOAD
routine creates and loads vector embeddings asynchronously
into the vector store. You can ingest the source files into
the vector store using the following methods:
This section describes how to load source documents from the bucket into the vector table using the uniform resource identifier (URI) of the object.
To use this method, you need to enable the DB system to access the Oracle Cloud Infrastructure Object Storage bucket. For more information, see Resource Principals.
To set up a new vector store using an object URI, perform the following steps:
To create the vector store table, use a new or existing database:
mysql> USE DBName
;
Replace DBName
with the
database name.
For example:
mysql> USE demo_db;
As of MySQL 9.3.1, this step is not required. MySQL HeatWave no
longer uses the mysql_task_management
schema, so you can delete the schema if you are not using
it.
In earlier versions of MySQL, if you are loading a vector store table on a database system for the first time, create a new schema dedicated for task management:
SELECT mysql_task_management_ensure_schema();
This creates a new schema,
mysql_task_management
, which keeps
track of the tasks that run in the background. This schema
contains internal tables that store the task details such
as task ID and task logs. These internal tables must not
be altered as it can cause the asynchronous task created
by VECTOR_STORE_LOAD
to fail.
Optionally, to specify a name for the vector store table
and language to use, set the @options
variable:
mysql> SET @options = JSON_OBJECT("table_name", "VectorStoreTableName
", "language", "Language
");
Replace the following:
VectorStoreTableName
: the
name you want for the vector store table.
Language
: the two-letter
ISO 639-1
code for the language you
want to use. Default language is
en
, which is English. To view the
list of supported languages, see
Languages.
The language
parameter is supported
as of MySQL 9.0.1-u1.
For example:
mysql> SET @options = JSON_OBJECT("table_name", "demo_embeddings", "language", "en");
As of MySQL 9.1.2, Optical Character Recognition (OCR) is enabled by default.
To learn more about the available routine options, see VECTOR_STORE_LOAD Syntax.
To ingest the file from the Object Storage, create vector
embeddings, and load the vector embeddings into MySQL HeatWave,
use the VECTOR_STORE_LOAD
routine:
mysql> CALL sys.VECTOR_STORE_LOAD("oci://BucketName
@Namespace
/Path
/Filename
", @options);
Replace the following:
BucketName
: the OCI Object
Storage bucket name.
Namespace
: the name of the
bucket namespace.
Path
: path to the folder
that contains the source file.
Filename
: the filename with
the file extension.
For example:
mysql> CALL sys.VECTOR_STORE_LOAD("oci://demo_bucket@demo_namespace/heatwave-en.pdf", @options);
This creates an asynchronous task that runs in background
and loads the vector embeddings into the specified vector
store table. The output of the
VECTOR_STORE_LOAD
routine contains the
following:
An ID of the task which was created.
A task query that you can use to track the progress of task.
If MySQL HeatWave GenAI detects multiple files with the same or
different file formats in a single load, it creates a
separate table for every format it finds. The table name
for each format is the specified or default table name
followed by the format. For example,
demo_embeddings_pdf
is the name of the
table that contains PDF files.
After the task is completed, verify that embeddings are loaded in the vector store table:
mysql> SELECT COUNT(*) FROM VectorStoreTableName
;
For example:
mysql> SELECT COUNT(*) FROM demo_embeddings;
If you see a numerical value in the output, your embeddings are successfully loaded in the vector store table.
To view the details of the vector store table, use the following statement:
mysql> DESCRIBE demo_embeddings;
+-------------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+-------+
| document_name | varchar(1024) | NO | | NULL | |
| metadata | json | NO | | NULL | |
| document_id | int unsigned | NO | PRI | NULL | |
| segment_number | int unsigned | NO | PRI | NULL | |
| segment | varchar(1024) | NO | | NULL | |
| segment_embedding | vector(384) | NO | | NULL | |
+-------------------+---------------+------+-----+---------+-------+
This section describes how to ingest source documents from the using pre-authenticated requests (PAR). Use this method if OCI bucket access is not enabled on your DB system.
For confidential data, Using the Uniform Resource Identifier with Asynchronous Load is recommended for ingesting the source files into the vector store as it is a more secure method.
To set up a new vector store, perform the following steps:
To create the vector store table, use a new or existing database:
mysql> USE DBName
;
Replace DBName
with the
database name.
For example:
mysql> USE demo_db;
As of MySQL 9.3.1, this step is not required.
In earlier versions of MySQL, if you are loading a vector store table on a database system for the first time, create a new schema dedicated for task management:
SELECT mysql_task_management_ensure_schema();
This creates a new schema,
mysql_task_management
, which keeps
track of the tasks that run in the background. This schema
contains internal tables that store the task details such
as task ID and task logs. These internal tables must not
be altered as it can cause the asynchronous task created
by VECTOR_STORE_LOAD
to fail.
Optionally, to specify a name for the vector store table
and language to use, set the @options
variable:
mysql> SET @options = JSON_OBJECT("table_name", "VectorStoreTableName
", "language", "Language
");
Replace the following:
VectorStoreTableName
: the
name you want for the vector store table.
Language
: the two-letter
ISO 639-1
code for the language you
want to use. Default language is
en
, which is English. To view the
list of supported languages, see
Languages.
The language
parameter is supported
as of MySQL 9.0.1-u1.
For example:
mysql> SET @options = JSON_OBJECT("table_name", "demo_embeddings_par", "language", "en");
As of MySQL 9.1.2, Optical Character Recognition (OCR) is enabled by default.
To learn more about the available routine options, see VECTOR_STORE_LOAD Syntax.
To ingest the file from the Object Storage, create vector
embeddings, and load the vector embeddings into MySQL HeatWave,
use the VECTOR_STORE_LOAD
routine:
mysql> CALL sys.VECTOR_STORE_LOAD("PAR
", @options);
Replace PAR
with PAR of the
bucket, folder, or file that you want to use to set up the
vector store.
To learn how to create PAR for your , see Creating a PAR Request in .
If you are creating a PAR for a folder or Object Storage bucket, then select Enable Object Listing in the Create Pre-Authenticated Request dialog to enable object listing.
For example:
mysql> CALL sys.VECTOR_STORE_LOAD("https://demo.objectstorage.demo-region.oci.customer-oci.com/p/demo-url/n/demo/b/demo-bucket/o/heatwave-en.pdf",
@options);
This creates a task that runs in background and loads the
vector embeddings into the specified vector store table.
The output of the VECTOR_STORE_LOAD
routine contains the following:
An ID of the task which was created.
A task query that you can use to track the progress of task.
If MySQL HeatWave GenAI detects files with different file formats in
a single load, it creates a separate table for every
format it finds. The table name for each format is the
specified or default table name followed by the format.
For example, demo_embeddings_par_pdf
is
the name of the table that contains PDF files.
After the task is completed, verify that embeddings are loaded in the vector store table:
mysql> SELECT COUNT(*) FROM VectorStoreTableName
;
For example:
mysql> SELECT COUNT(*) FROM demo_embeddings_par;
If you see a numerical value in the output, your embeddings are successfully loaded in the vector store table.
To view the details of the vector store table, use the following statement:
mysql> DESCRIBE demo_embeddings_par;
+-------------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+-------+
| document_name | varchar(1024) | NO | | NULL | |
| metadata | json | NO | | NULL | |
| document_id | int unsigned | NO | PRI | NULL | |
| segment_number | int unsigned | NO | PRI | NULL | |
| segment | varchar(1024) | NO | | NULL | |
| segment_embedding | vector(384) | NO | | NULL | |
+-------------------+---------------+------+-----+---------+-------+
The
HEATWAVE_LOAD
routine creates and loads vector embeddings into the vector
store using auto parallel load.
To ingest files using the
HEATWAVE_LOAD
routine, perform the following steps:
To create a vector store table, use a new or existing database:
mysql> USE DBName
;
Replace DBName
with the name
you want for the new database.
For example:
mysql> USE demo_db;
To ingest the file from Object Storage and create vector
embeddings in a new vector store table, set the
@dl_tables
variable:
SET @dl_tables = '[{ "db_name": "DBName
", "tables": [{ "table_name": "VectorStoreTableName
", "engine_attribute": { "dialect": { "format": "FileFormat
", "language": "Language
" }, "file": [{"par": "PAR
"}], "chunking": { "split_by": "SplitBy
", "max":Max
, "by": "By
", "truncate":Truncate
, "overlap":Overlap
} } }] }]';
Replace the following:
DBName
: the database name.
VectorStoreTableName
: the
name you want for the vector store table.
FileFormat
: the formats of
the files to be ingested into the vector store table.
The supported file formats are
html
, pdf
,
ppt
, pptx
txt
, doc
, and
docx
. To ingest multiple files with
different unstructured data file formats into the
vector store table in a single load, replace
FileFormat
with
auto_unstructured
.
Language
: the two-letter
ISO 639-1
code for the language you
want to use. Default language is
en
, which is English. To view the
list of supported languages, see
Languages.
The language
parameter is supported
as of MySQL 9.0.1-u1.
PAR
: the pre-authenticated
request (PAR) detail of the bucket, folder, or file
that you want to use to set up the vector store.
To learn how to create PAR for your , see Creating a PAR request in Object Storage.
If you are creating a PAR for a folder or Object Storage bucket, then select Enable Object Listing to enable object listing in the Create Pre-Authenticated Request dialog while creating the PAR.
As of MySQL 9.3.2, you can optionally set the
following chunking
parameter values
for customized text segmentation during vector store
creation:
SplitBy
: method to use
for splitting the text into segments. It can be
one of the following:
page
: for text segmentation
based on the pagination available in the
document. This segmentation method is
supported for PDF and PPT documents only. If
used for other documents, the routine falls
back to the document
text
segmentation method for unsupported documents.
paragraph
: for text
segmentation based on the paragraphs
identified in the document. Wherein, a
paragraph is a piece of text separated from
another piece of text using
\n\n
characters.
sentence
: for text
segmentation based on the sentences identified
in the document. Wherein, a sentence is a
sequence of words that is separated from
another sequence of words using a punctuation
that marks the end of a sentence:
.
, !
, or
?
. For PDF files, when OCR
is enabled, any sentence that spills into the
next page is considered as a two separate
sentences.
document
: for putting an
entire document into one text segment. In case
the max
parameter is not
set, you cannot overlap across documents.
recursive
: for using the
default text segmentation method used while
creating vector store tables in previous
versions of MySQL. This method provides
backward compatibility with vector store
tables created in previous versions of MySQL.
Default value is recursive
.
Max
: maximum number of
characters or words to be included in each
segment. If left unspecified, there is no maximum
size enforcement on the chunks of text. For words,
this value can be up to 100000
and for characters, this value can be up to
1000000
. This is unspecified by
default.
Overlap
: maximum number
of characters or words to overlap between segments
on each side. For words, this value can be up to
50000
and for characters, this
value can be up to 500000
.
Default value is 0
.
By
: the unit to use for
defining the maximum and overlap text segment
limits using the max
and
overlap
parameter. It can be
set to characters
or
words
. Default value is
characters
.
Truncate
: set to
true
to enable truncation of
text segments that are too large for the embedding
model to handle without truncation. However, when
tructation is enabled, the end of the text segment
is trimmed out, and you might lose the information
from the part of the text segment that is trimmed
out.
Set to false
to disable
truncation and instead throw an error when a text
segment that is too large for the embeeding model
is found. In this case, the entire load fails and
exits with an error. You can adjust the text
segmentation method used to ensure that the size
of the text segments is within the segment size
limit of the embedding model.
Default value is true
.
For example:
mysql> SET @dl_tables = '[{
"db_name": "demo_db",
"tables": [{
"table_name": "demo_embeddings_apl",
"engine_attribute": {
"dialect": {
"format": "pdf",
"language": "en"
},
"file": [{"par": "https://demo.objectstorage.demo-region.oci.customer-oci.com/p/demo-url/n/demo/b/demo_bucket/o/heatwave-en.pdf"}],
"chunking": {
"split_by": "recursive"
}
}
}]
}]';
As of MySQL 9.1.2,
Optical
Character Recognition (OCR) is enabled by default.
In MySQL 9.1.0 and 9.1.1, to enable OCR, set the
ocr
dialect parameter to
true
.
To prepare for loading the vector embeddings into the
MySQL HeatWave system, set the @options
variable:
mysql> SET @options = JSON_OBJECT('mode', 'normal');
To load the vector embeddings into MySQL HeatWave, use the
HEATWAVE_LOAD
routine:
mysql> CALL sys.HEATWAVE_LOAD(CAST(@dl_tables AS JSON), @options);
This creates and stores the vector embeddings in the specified vector store table.
Verify that embeddings are loaded in the vector store table:
mysql> SELECT COUNT(*) FROM VectorStoreTableName
;
For example:
mysql> SELECT COUNT(*) FROM demo_embeddings_apl;
If you see a numerical value in the output, your embeddings are successfully loaded in the table.
To view the details of the vector store table, use the following statement:
mysql> DESCRIBE demo_embeddings_apl;
+-------------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+-------+
| document_name | varchar(1024) | NO | | NULL | |
| metadata | json | NO | | NULL | |
| document_id | int unsigned | NO | PRI | NULL | |
| segment_number | int unsigned | NO | PRI | NULL | |
| segment | varchar(1024) | NO | | NULL | |
| segment_embedding | vector(384) | NO | | NULL | |
+-------------------+---------------+------+-----+---------+-------+
To avoid being billed for the resources that you created for this topic, ensure that you delete them:
Delete the database:
mysql> DROP DATABASE demo_db;
Delete demo_bucket
. For more
information, see
Deleting
the Bucket.
Learn how to Update the Vector Store.
Learn how to Perform Vector Search With Retrieval-Augmented Generation.
Learn how to Start a Conversational Chat.