Load Character Vector Data Using SQL*Loader Example
In this example, you can see how to use SQL*Loader to load vector data into a five-dimension vector space.
Let's imagine we have the following text documents classifying galaxies by their types:
- DOC1: "Messier 31 is a barred spiral galaxy in the Andromeda constellation which has a lot of barred spiral galaxies."
- DOC2: "Messier 33 is a spiral galaxy in the Triangulum constellation."
- DOC3: "Messier 58 is an intermediate barred spiral galaxy in the Virgo constellation."
- DOC4: "Messier 63 is a spiral galaxy in the Canes Venatici constellation."
- DOC5: "Messier 77 is a barred spiral galaxy in the Cetus constellation."
- DOC6: "Messier 91 is a barred spiral galaxy in the Coma Berenices constellation."
- DOC7: "NGC 1073 is a barred spiral galaxy in Cetus constellation."
- DOC8: "Messier 49 is a giant elliptical galaxy in the Virgo constellation."
- DOC9: "Messier 60 is an elliptical galaxy in the Virgo constellation."
You can create vectors representing the preceding galaxy's classes using the following five-dimension vector space based on the count of important words appearing in each document:
Table 5-1 Five dimension vector space
Galaxy Classes | Intermediate | Barred | Spiral | Giant | Elliptical |
---|---|---|---|---|---|
M31 | 0 | 2 | 2 | 0 | 0 |
M33 | 0 | 0 | 1 | 0 | 0 |
M58 | 1 | 1 | 1 | 0 | 0 |
M63 | 0 | 0 | 1 | 0 | 0 |
M77 | 0 | 1 | 1 | 0 | 0 |
M91 | 0 | 1 | 1 | 0 | 0 |
M49 | 0 | 0 | 0 | 1 | 1 |
M60 | 0 | 0 | 0 | 0 | 1 |
NGC1073 | 0 | 1 | 1 | 0 | 0 |
This naturally gives you the following vectors:
- M31:
[0,2,2,0,0]
- M33:
[0,0,1,0,0]
- M58:
[1,1,1,0,0]
- M63:
[0,0,1,0,0]
- M77:
[0,1,1,0,0]
- M91:
[0,1,1,0,0]
- M49:
[0,0,0,1,1]
- M60:
[0,0,0,0,1]
- NGC1073:
[0,1,1,0,0]
You can use SQL*Loader to load this data into the GALAXIES
database table defined as:
drop table galaxies purge;
create table galaxies (id number, name varchar2(50), doc varchar2(500), embedding vector);
Based on the data described previously, you can create the following galaxies_vec.csv
file:
1:M31:Messier 31 is a barred spiral galaxy in the Andromeda constellation which has a lot of barred spiral galaxies.:[0,2,2,0,0]:
2:M33:Messier 33 is a spiral galaxy in the Triangulum constellation.:[0,0,1,0,0]:
3:M58:Messier 58 is an intermediate barred spiral galaxy in the Virgo constellation.:[1,1,1,0,0]:
4:M63:Messier 63 is a spiral galaxy in the Canes Venatici constellation.:[0,0,1,0,0]:
5:M77:Messier 77 is a barred spiral galaxy in the Cetus constellation.:[0,1,1,0,0]:
6:M91:Messier 91 is a barred spiral galaxy in the Coma Berenices constellation.:[0,1,1,0,0]:
7:M49:Messier 49 is a giant elliptical galaxy in the Virgo constellation.:[0,0,0,1,1]:
8:M60:Messier 60 is an elliptical galaxy in the Virgo constellation.:[0,0,0,0,1]:
9:NGC1073:NGC 1073 is a barred spiral galaxy in Cetus constellation.:[0,1,1,0,0]:
Here is a possible SQL*Loader control file galaxies_vec.ctl
:
recoverable
LOAD DATA
infile 'galaxies_vec.csv'
INTO TABLE galaxies
fields terminated by ':'
trailing nullcols
(
id,
name char (50),
doc char (500),
embedding char (32000)
)
Note:
You cannot use comma-delimited vectors (vectors separated by commas) as the field terminator in your CSV file. You must use another deliminator. In these examples the deliminator is a colon (:
).
After you have created the two files galaxies_vec.csv
and galaxies_vec.ctl
, you can run the following sequence of instructions directly from your favorite SQL command line tool:
host sqlldr vector/vector@CDB1_PDB1 control=galaxies_vec.ctl log=galaxies_vec.log
SQL*Loader: Release 23.0.0.0.0 - Development on Thu Jan 11 19:46:21 2024
Version 23.4.0.23.00
Copyright (c) 1982, 2024, Oracle and/or its affiliates. All rights reserved.
Path used: Conventional
Commit point reached - logical record count 10
Table GALAXIES2:
9 Rows successfully loaded.
Check the log file:
galaxies_vec.log
for more information about the load.
SQL>
select * from galaxies;
ID NAME DOC EMBEDDING
--- ------ ---------------------------------------------------------- ---------------------------------
1 M31 Messier 31 is a barred spiral galaxy in the Andromeda ... [0,2.0E+000,2.0E+000,0,0]
2 M33 Messier 33 is a spiral galaxy in the Triangulum ... [0,0,1.0E+000,0,0]
3 M58 Messier 58 is an intermediate barred spiral galaxy ... [1.0E+000,1.0E+000,1.0E+000,0,0]
4 M63 Messier 63 is a spiral galaxy in the Canes Venatici ... [0,0,1.0E+000,0,0]
5 M77 Messier 77 is a barred spiral galaxy in the Cetus ... [0,1.0E+000,1.0E+000,0,0]
6 M91 Messier 91 is a barred spiral galaxy in the Coma ... [0,1.0E+000,1.0E+000,0,0]
7 M49 Messier 49 is a giant elliptical galaxy in the Virgo ... [0,0,0,1.0E+000,1.0E+000]
8 M60 Messier 60 is an elliptical galaxy in the Virgo ... [0,0,0,0,1.0E+000]
9 NGC1073 NGC 1073 is a barred spiral galaxy in Cetus ... [0,1.0E+000,1.0E+000,0,0]
9 rows selected.
SQL>
Here is the resulting log file for this load (galaxies_vec.log
):
cat galaxies_vec.log
SQL*Loader: Release 23.0.0.0.0 - Development on Thu Jan 11 19:46:21 2024
Version 23.4.0.23.00
Copyright (c) 1982, 2024, Oracle and/or its affiliates. All rights reserved.
Control File: galaxies_vec.ctl
Data File: galaxies_vec.csv
Bad File: galaxies_vec.bad
Discard File: none specified
(Allow all discards)
Number to load: ALL
Number to skip: 0
Errors allowed: 50
Bind array: 250 rows, maximum of 1048576 bytes
Continuation: none specified
Path used: Conventional
Table GALAXIES, loaded from every logical record.
Insert option in effect for this table: INSERT
TRAILING NULLCOLS option in effect
Column Name Position Len Term Encl Datatype
----------- ---------- ----- ---- ---- ----------
ID FIRST * : CHARACTER
NAME NEXT 50 : CHARACTER
DOC NEXT 500 : CHARACTER
EMBEDDING NEXT 32000 : CHARACTER
value used for ROWS parameter changed from 250 to 31
Table GALAXIES2:
9 Rows successfully loaded.
0 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
Space allocated for bind array: 1017234 bytes(31 rows)
Read buffer bytes: 1048576
Total logical records skipped: 0
Total logical records read: 9
Total logical records rejected: 0
Total logical records discarded: 1
Run began on Thu Jan 11 19:46:21 2024
Run ended on Thu Jan 11 19:46:24 2024
Elapsed time was: 00:00:02.43
CPU time was: 00:00:00.03
$
Note:
This example usesembedding char (32000)
vectors. For very large vectors,
you can use the LOBFILE
feature
Related Topics
Parent topic: Load Vector Data Using SQL*Loader