MySQL AI User Guide
MySQL includes a bulk load extension to the
LOAD DATA
statement. It can do the
following:
Optimize the loading of data sorted by primary key.
Optimize the loading of unsorted data.
Optimize the loading of data from an object store.
Optimize the loading of data from a series of files.
Load a MySQL Shell dump file.
Load ZSTD compressed CSV files.
Monitor bulk load progress with the Performance Schema.
Large data support.
Use a second session to monitor bulk load progress:
If the data is sorted, there is a single stage:
loading
.
If the data is unsorted, there are two stages:
sorting
and loading
.
LOAD DATA
with
ALGORITHM=BULK
supports tables with at least
one column with the VECTOR
data
type. If you attempt to load a table without at least one column
with the VECTOR
data type, an
error occurs.
In addition to the requirement to have at least one
VECTOR
column,
LOAD DATA
with
ALGORITHM=BULK
supports the following data
types:
mysql>LOAD DATA [LOW_PRIORITY | CONCURRENT] [FROM] INFILE | URL | S3 '
file_prefix
' | 'options
' [COUNTN
] [IN PRIMARY KEY ORDER] INTO TABLEtbl_name
[CHARACTER SETcharset_name
] [COMPRESSION = {'ZSTD'}] [{FIELDS | COLUMNS} [TERMINATED BY 'string
'] [[OPTIONALLY] ENCLOSED BY 'char
'] [ESCAPED BY 'char
'] ] [LINES [TERMINATED BY 'string
'] ] [IGNOREnumber
{LINES | ROWS}] [PARALLEL =number
] [MEMORY =M
] [ALGORITHM = BULK]options
: { JSON_OBJECT("key
","value
"[,"key
","value
"] ...)"key","value"
: { "url-prefix","prefix
" ["url-sequence-start",0
] ["url-suffix","suffix
"] ["url-prefix-last-append","@
"] ["is-dryrun",{true|false}] } }
The additional LOAD DATA
clauses
are:
FROM
: Makes the statement more readable.
URL
: A URL accessible with a HTTP GET
request.
S3
: The AWS S3 file location.
This requires the user privilege
LOAD_FROM_S3
.
COUNT
: The number of files in a series of
files.
For COUNT 5
and
file_prefix
set to
data.csv.
, the five files would be:
data.csv.1
,
data.csv.2
,
data.csv.3
,
data.csv.4
, and
data.csv.5
.
IN PRIMARY KEY ORDER
: Use when the data
is already sorted. The values should be in ascending order
within the file.
For a file series, the primary keys in each file must be disjoint and in ascending order from one file to the next.
PARALLEL
: The number of concurrent
threads to use. A typical value might be 16, 32 or 48. The
default value is 16.
PARALLEL
does not require
CONCURRENT
.
MEMORY
: The amount of memory to use. A
typical value might be 512M or 4G. The default value is 1G.
ALGORITHM
: Set to BULK
for bulk load. The file format is CSV.
COMPRESSION
: The file compression
algorithm. Bulk load supports the ZSTD algorithm.
options
is a JSON object literal that
includes:
url-prefix
: The common URL prefix for
the files to load.
url-sequence-start
: The sequence
number for the first file.
The default value is 1, and the minimum value is 0. The value cannot be a negative number. The value can be a string or a number, for example, "134", or "default".
url-suffix
: The file suffix.
url-prefix-last-append
: The string to
append to the prefix of the last file.
This supports MySQL Shell dump files.
is-dryrun
: Set to
true
to run basic checks and report
if bulk load is possible on the given table. The default
value is false
.
To enable is-dryrun
, use any of the
following values: true
,
"true"
, "1"
,
"on"
or 1
.
To disable is-dryrun
, use any of the
following values: false
,
"false"
, "0"
,
"off"
or 0
.
LOAD DATA
with
ALGORITHM=BULK
does not support these
clauses:
LOAD DATA [LOCAL] [REPLACE | IGNORE] [PARTITION (partition_name
[,partition_name
] ...)] ] [LINES [STARTING BY 'string
'] ] [(col_name_or_user_var
[,col_name_or_user_var
] ...)] [SETcol_name
={expr
| DEFAULT} [,col_name
={expr
| DEFAULT}] ...]
An example that loads unsorted data from AWS S3 with 48 concurrent threads and 4G of memory:
mysql>GRANT LOAD_FROM_S3 ON *.* TO load_user@localhost;
mysql>LOAD DATA FROM S3 's3-us-east-1://innodb-bulkload-dev-1/lineitem.tbl' INTO TABLE lineitem FIELDS TERMINATED BY "|" OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' PARALLEL = 48 MEMORY = 4G ALGORITHM=BULK;
An example that loads eight files of sorted data from AWS
S3. The file_prefix
ends with a period.
The files are lineitem.tbl.1
,
lineitem.tbl.2
, ...
lineitem.tbl.8
:
mysql>GRANT LOAD_FROM_S3 ON *.* TO load_user@localhost;
mysql>LOAD DATA FROM S3 's3-us-east-1://innodb-bulkload-dev-1/lineitem.tbl.' COUNT 8 IN PRIMARY KEY ORDER INTO TABLE lineitem FIELDS TERMINATED BY "|" OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' ALGORITHM=BULK;
An example that performs a dry run on a sequence of MySQL Shell dump files compressed with the ZSTD algorithm:
mysql>GRANT LOAD_FROM_URL ON *.* TO load_user@localhost;
mysql>LOAD DATA FROM URL '{"url-prefix","https://example.com/bucket/test@lineitem@","url-sequence-start",0,"url-suffix",".tsv.zst","url-prefix-last-append","@","is-dryrun",true}' COUNT 20 INTO TABLE lineitem CHARACTER SET ???? COMPRESSION = {'ZSTD'} FIELDS TERMINATED BY "|" OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 20000 LINES ALGORITHM=BULK;
An example that loads data with the URI keyword (supported as of MySQL 9.4.0):
mysql>GRANT LOAD_FROM_URL ON *.* TO load_user@localhost;
mysql>LOAD DATA FROM URI 'https://data_files.com/data_files_1.tbl' INTO TABLE lineitem FIELDS TERMINATED BY "|" OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' ALGORITHM=BULK;
An example that monitors bulk load progress in a second session.
Review the list of stages with the following query:
mysql> SELECT NAME, ENABLED, TIMED FROM performance_schema.setup_instruments
WHERE ENABLED='YES' AND NAME LIKE "stage/bulk_load%";
Enable the events_stages_current
with
the following query:
mysql> UPDATE performance_schema.setup_consumers
SET ENABLED = 'YES' WHERE NAME LIKE 'events_stages_current';
Use one session to run bulk load, and monitor progress in a second session:
mysql> SELECT thread_id, event_id, event_name, WORK_ESTIMATED, WORK_COMPLETED
FROM performance_schema.events_stages_current;
--------------
SELECT thread_id, event_id, event_name, WORK_ESTIMATED, WORK_COMPLETED FROM performance_schema.events_stages_current
--------------
+-----------+----------+----------------------------------+----------------+----------------+
| thread_id | event_id | event_name | WORK_ESTIMATED | WORK_COMPLETED |
+-----------+----------+----------------------------------+----------------+----------------+
| 49 | 5 | stage/bulk_load_unsorted/sorting | 1207551343 | 583008145 |
+-----------+----------+----------------------------------+----------------+----------------+
1 row in set (0.00 sec)