Load Binary Vector Data Using SQL*Loader Example
In this example, you can see how to use SQL*Loader to load binary vector data files.
The vectors in a binary (fvec) file are stored in raw 32-bit
Little Endian format.
Each vector takes 4+d*4 bytes for the
.fvecs file where the first 4 bytes indicate the dimensionality (d) of the vector (that is, the number of dimensions in the vector)
followed by d*4 bytes representing the vector data, as
described in the following table:
Table 5-2 Fields for Vector Dimensions and Components
| Field | Field Type | Description |
|---|---|---|
|
|
int | The vector dimension |
|
components |
array of floats |
The vector components |
For binary fvec files, they must be defined as follows:
- You must specify
LOBFILE. - You must specify the syntax format
fvecsto indicate that the dafafile contains binary dimensions. - You must specify that the datafile contains raw binary data
(
raw).
The following is an example of a control file used to load
VECTOR columns from binary floating point arrays using the galaxies
vector example described in Understand Hierarchical Navigable Small World Indexes,
but in this case importing fvecs data, using the control file syntax
format "fvecs":
Note:
SQL*Loader supports loading VECTOR columns from character
data and binary floating point array fvec files. The format for
fvec files is that each binary 32-bit floating point array is preceded
by a four (4) byte value, which is the number of elements in the vector. There can be
multiple vectors in the file, possibly with different dimensions.
LOAD DATA
infile 'galaxies_vec.csv'
INTO TABLE galaxies
fields terminated by ':'
trailing nullcols
(
id,
name char (50),
doc char (500),
embedding lobfile (constant '/u01/data/vector/embedding.fvecs' format "fvecs") raw
)
The data contained in galaxies_vec.csv in this case does not
have the vector data. Instead, the vector data will be read from the secondary
LOBFILE in the /u01/data/vector directory
(/u01/data/vector/embedding.fvecs), which contains the same information
in float32 floating point binary numbers, but is in fvecs
format:
1:M31:Messier 31 is a barred spiral galaxy in the Andromeda constellation which has a lot of barred spiral galaxies.:
2:M33:Messier 33 is a spiral galaxy in the Triangulum constellation.:
3:M58:Messier 58 is an intermediate barred spiral galaxy in the Virgo constellation.:
4:M63:Messier 63 is a spiral galaxy in the Canes Venatici constellation.:
5:M77:Messier 77 is a barred spiral galaxy in the Cetus constellation.:
6:M91:Messier 91 is a barred spiral galaxy in the Coma Berenices constellation.:
7:M49:Messier 49 is a giant elliptical galaxy in the Virgo constellation.:
8:M60:Messier 60 is an elliptical galaxy in the Virgo constellation.:
9:NGC1073:NGC 1073 is a barred spiral galaxy in Cetus constellation.:
Parent topic: Load Vector Data Using SQL*Loader