Load Binary Vector Data Using SQL*Loader Example

In this example, you can see how to use SQL*Loader to load binary vector data files.

The vectors in a binary (fvec) file are stored in raw 32-bit Little Endian format.

Each vector takes 4+d*4 bytes for the .fvecs file where the first 4 bytes indicate the dimensionality (d) of the vector (that is, the number of dimensions in the vector) followed by d*4 bytes representing the vector data, as described in the following table:

Table 5-2 Fields for Vector Dimensions and Components

Field Field Type Description

d

int The vector dimension

components

array of floats

The vector components

For binary fvec files, they must be defined as follows:

  • You must specify LOBFILE.
  • You must specify the syntax format fvecs to indicate that the dafafile contains binary dimensions.
  • You must specify that the datafile contains raw binary data (raw).

The following is an example of a control file used to load VECTOR columns from binary floating point arrays using the galaxies vector example described in Understand Hierarchical Navigable Small World Indexes, but in this case importing fvecs data, using the control file syntax format "fvecs":

Note:

SQL*Loader supports loading VECTOR columns from character data and binary floating point array fvec files. The format for fvec files is that each binary 32-bit floating point array is preceded by a four (4) byte value, which is the number of elements in the vector. There can be multiple vectors  in the file, possibly with different dimensions.

LOAD DATA
infile 'galaxies_vec.csv'
INTO TABLE galaxies
fields terminated by ':'
trailing nullcols
(
id,
name char (50),
doc  char (500),
embedding lobfile (constant '/u01/data/vector/embedding.fvecs' format "fvecs")  raw
)

The data contained in galaxies_vec.csv in this case does not have the vector data. Instead, the vector data will be read from the secondary LOBFILE in the /u01/data/vector directory (/u01/data/vector/embedding.fvecs), which contains the same information in float32 floating point binary numbers, but is in fvecs format:

1:M31:Messier 31 is a barred spiral galaxy in the Andromeda constellation which has a lot of barred spiral galaxies.:
2:M33:Messier 33 is a spiral galaxy in the Triangulum constellation.:
3:M58:Messier 58 is an intermediate barred spiral galaxy in the Virgo constellation.:
4:M63:Messier 63 is a spiral galaxy in the Canes Venatici constellation.:
5:M77:Messier 77 is a barred spiral galaxy in the Cetus constellation.:
6:M91:Messier 91 is a barred spiral galaxy in the Coma Berenices constellation.:
7:M49:Messier 49 is a giant elliptical galaxy in the Virgo constellation.:
8:M60:Messier 60 is an elliptical galaxy in the Virgo constellation.:
9:NGC1073:NGC 1073 is a barred spiral galaxy in Cetus constellation.: