MySQL Shell 8.0 (part of MySQL 8.0)
MySQL Shell's parallel table import utility
util.importTable()
, introduced in MySQL Shell
8.0.17, provides rapid data import to a MySQL relational table for
large data files. The utility analyzes an input data file,
distributes it into chunks, and uploads the chunks to the target
MySQL server using parallel connections. The utility is capable of
completing a large data import many times faster than a standard
single-threaded upload using a LOAD
DATA
statement.
When you run the parallel table import utility, you specify the
mapping between the fields in the data file or files, and the
columns in the MySQL table. You can set field- and line-handling
options as for the LOAD DATA
statement to handle data files in arbitrary formats. For multiple
files, all the files must be in the same format. The default
dialect for the utility maps to a file created using a
SELECT...INTO
OUTFILE
statement with the default settings for that
statement. The utility also has preset dialects that map to the
standard data formats for CSV files (created on DOS or UNIX
systems), TSV files, and JSON, and you can customize these using
the field- and line-handling options as necessary. Note that JSON
data must be in document-per-line format.
A number of functions have been added to the parallel table import utility since it was introduced, so use the most recent version of MySQL Shell to get the utility's full functionality.
From MySQL Shell 8.0.22, the parallel table import utility
can capture columns from the data file or files for input
preprocessing, in the same way as with a
LOAD DATA
statement. The
selected data can be discarded, or you can transform the
data and assign it to a column in the target table.
Up to MySQL Shell 8.0.20, the data must be imported from a
location that is accessible to the client host as a local
disk. From MySQL Shell 8.0.21, the data can also be
imported from an Oracle Cloud Infrastructure Object Storage
bucket, specified by the osBucketName
option.
Up to MySQL Shell 8.0.22, the parallel table import utility can import a single input data file to a single relational table. From MySQL Shell 8.0.23, the utility is also capable of importing a specified list of files, and it supports wildcard pattern matching to include all relevant files from a location. Multiple files uploaded by a single run of the utility are placed into a single relational table, so for example, data that has been exported from multiple hosts could be merged into a single table to be used for analytics.
Up to MySQL Shell 8.0.21, the parallel table import utility
only accepts an uncompressed input data file. The utility
analyzes the data file, distributes it into chunks, and
uploads the chunks to the relational table in the target
MySQL server, dividing the chunks up between the parallel
connections. From MySQL Shell 8.0.22, the utility can also
accept data files compressed in the gzip
(.gz
) and zstd
(.zst)
formats, detecting the format
automatically based on the file extension. The utility
uploads a compressed file from storage in the compressed
format, saving bandwidth for that part of the transfer.
Compressed files cannot be distributed into chunks, so
instead the utility uses its parallel connections to
decompress and upload multiple files simultaneously to the
target server. If there is only one input data file, the
upload of a compressed file can only use a single
connection.
MySQL Shell's parallel table import utility supports the output from MySQL Shell's table export utility, which can compress the data file it produces as output, and can export it to a local folder or an Object Storage bucket. The default dialect for the parallel table import utility is the default for the output file produced by the table export utility. The parallel table import utility can also be used to upload files from other sources.
MySQL Shell's dump loading utility
util.loadDump()
is designed to import the
combination of chunked output files and metadata produced by
MySQL Shell's instance dump utility
util.dumpInstance()
, schema dump utility
util.dumpSchemas()
, and table dump utility
util.dumpTables()
. The parallel table import
utility can be used in combination with the dump loading utility
if you want to modify any of the data in the chunked output files
before uploading it to the target server. To do this, first use
the dump loading utility to load only the DDL for the selected
table, to create the table on the target server. Then use the
parallel table import utility to capture and transform data from
the output files for the table, and import it to the target table.
Repeat that process as necessary for any other tables where you
want to modify the data. Finally, use the dump loading utility to
load the DDL and data for any remaining tables that you do not
want to modify, excluding the tables that you did modify. For a
description of the procedure, see
Modifying Dumped Data.
The parallel table import utility requires an existing
classic MySQL protocol connection to the target MySQL server. Each
thread opens its own session to send chunks of the data to the
MySQL server, or in the case of compressed files, to send multiple
files in parallel. You can adjust the number of threads, number of
bytes sent in each chunk, and maximum rate of data transfer per
thread, to balance the load on the network and the speed of data
transfer. The utility cannot operate over X Protocol connections,
which do not support LOAD DATA
statements.
The data file or files to be imported must be in one of the following locations:
A location that is accessible to the client host as a local disk.
A remote location that is accessible to the client host through HTTP or HTTPS, specified with a URL. Pattern matching is not supported for files accessed in this way.
An Oracle Cloud Infrastructure Object Storage bucket (from MySQL Shell 8.0.21).
The data is imported to a single relational table in the MySQL server to which the active MySQL session is connected.
The parallel table import utility uses LOAD DATA LOCAL
INFILE
statements to upload data, so the
local_infile
system variable must
be set to ON
on the target server. You can do
this by issuing the following statement in SQL mode before running
the parallel table import utility:
SET GLOBAL local_infile = 1;
To avoid a known potential security issue with LOAD DATA
LOCAL
, when the MySQL server replies to the parallel
table import utility's LOAD DATA
requests with
file transfer requests, the utility only sends the predetermined
data chunks, and ignores any specific requests attempted by the
server. For more information, see
Security Considerations for LOAD DATA LOCAL.
In the MySQL Shell API, the parallel table import utility is a
function of the util
global object, and has
the following signature:
importTable ({file_name | file_list}, options)
file_name
is a string specifying the name and
path for a single file containing the data to be imported.
Alternatively, file_list
is an array of file
paths specifying multiple data files. On Windows, backslashes
must be escaped in file paths, or you can use forward slashes
instead.
For files that are accessible to the client host on a local
disk, you can prefix the directory path with the
file://
schema, or allow it to default to
that. For files accessed in this way, file paths can contain
the wildcards *
(multiple characters) and
?
(single character) for pattern
matching.
For files that are accessible to the client host through
HTTP or HTTPS, provide a URL or a list of URLs, prefixed
with the http://
or
https://
schema as appropriate, in the
format http[s]://host.domain[:port]/path
.
For files accessed in this way, pattern matching is not
available. The HTTP server must support the Range request
header, and must return the Content-Range response header to
the client.
For files in an Oracle Cloud Infrastructure Object Storage
bucket, specify a path to the file in the bucket, and use
the osBucketName
option to specify the
bucket name.
options
is a dictionary of import options
that can be omitted if it is empty. The options are listed after
the examples.
The function returns void, or an exception in case of an error. If the import is stopped partway by the user with Ctrl+C or by an error, the utility stops sending data. When the server finishes processing the data it received, messages are returned showing the chunk that was being imported by each thread at the time, the percentage complete, and the number of records that were updated in the target table.
The following examples, the first in in MySQL Shell's
JavaScript mode and the second in MySQL Shell's Python mode,
import the data in a single CSV file
/tmp/productrange.csv
to the
products
table in the mydb
database, skipping a header row in the file:
mysql-js> util.importTable("/tmp/productrange.csv", {schema: "mydb", table: "products", dialect: "csv-unix", skipRows: 1, showProgress: true})
mysql-py> util.import_table("/tmp/productrange.csv", {"schema": "mydb", "table": "products", "dialect": "csv-unix", "skipRows": 1, "showProgress": True})
The following example in MySQL Shell's Python mode only
specifies the dialect for the CSV file. mydb
is the active schema for the MySQL Shell session. The utility
therefore imports the data in the file
/tmp/productrange.csv
to the
productrange
table in the
mydb
database:
mysql-py>\use mydb
mysql-py>util.import_table("/tmp/productrange.csv", {"dialect": "csv-unix"})
The following example in MySQL Shell's Python mode imports the data from multiple files, including a mix of individually named files, ranges of files specified using wildcard pattern matching, and compressed files:
mysql-py> util.import_table(
[
"data_a.csv",
"data_b*",
"data_c*",
"data_d.tsv.zst",
"data_e.tsv.zst",
"data_f.tsv.gz",
"/backup/replica3/2021_01_12/data_g.tsv",
"/backup/replica3/2021_01_13/*.tsv",
],
{"schema": "mydb", "table": "productrange"}
)
The parallel table import utility can also be invoked from the command line using the mysqlsh command interface. With this interface, you invoke the utility as in the following examples:
mysqlsh mysql://root:@127.0.0.1:3366 --ssl-mode=DISABLED -- util import-table /r/mytable.dump --schema=mydb --table=regions --bytes-per-chunk=10M --linesTerminatedBy=$'\r\n'
When you import multiple data files, ranges of files specified using wildcard pattern matching are expanded by MySQL Shell's glob pattern matching logic if they are quoted, as in the following example. Otherwise they are expanded by the pattern matching logic for the user shell where you entered the mysqlsh command.
mysqlsh mysql://root:@127.0.0.1:3366 -- util import-table data_a.csv "data_b*" data_d.tsv.zst --schema=mydb --table=productrange --osBucketName=mybucket
When you use the mysqlsh command interface to
invoke the parallel table import utility, the
columns
option is not supported because array
values are not accepted, so the input lines in your data file
must contain a matching field for every column in the target
table.
Also note that as shown in the above example, line feed
characters must be passed using ANSI-C quoting in shells that
support this function (such as bash
,
ksh
, mksh
, and
zsh
). For information on this interface, see
Section 5.8, “API Command Line Interface”.
The following import options are available for the parallel table import utility to specify where and how the data is imported:
schema:
"db_name
"
The name of the target database on the connected MySQL
server. If you omit this option, the utility attempts to
identify and use the schema name in use for the current
MySQL Shell session, as specified in a connection URI
string, \use
command, or MySQL Shell
option. If the schema name is not specified and cannot be
identified from the session, an error is returned.
table:
"table_name
"
The name of the target relational table. If you omit this option, the utility assumes the table name is the name of the data file without the extension. The target table must exist in the target database.
columns: array of column
names
An array of strings containing column names from the import file or files, given in the order that they map to columns in the target relational table. Use this option if the imported data does not contain all the columns of the target table, or if the order of the fields in the imported data differs from the order of the columns in the table. If you omit this option, input lines are expected to contain a matching field for each column in the target table.
From MySQL Shell 8.0.22, you can use this option to
capture columns from the import file or files for input
preprocessing, in the same way as with a
LOAD DATA
statement. When you use an integer value in place of a
column name in the array, that column in the import file
or files is captured as a user variable
@
, for
example int
@1
. The selected data can be
discarded, or you can use the
decodeColumns
option to transform the
data and assign it to a column in the target table.
In this example in MySQL Shell's JavaScript mode, the
second and fourth columns from the import file are
assigned to the user variables @1
and
@2
, and no
decodeColumns
option is present to
assign them to any column in the target table, so they are
discarded.
mysql-js> util.importTable('file.txt', {
table: 't1',
columns: ['column1', 1, 'column2', 2, 'column3']
});
decodeColumns:
dictionary
A dictionary of key-value pairs that assigns import file
columns captured as user variables by the
columns
option to columns in the target
table, and specifies preprocessing transformations for
them in the same way as the SET
clause
of a LOAD DATA
statement.
This option is available from MySQL Shell 8.0.22.
In this example in MySQL Shell's JavaScript mode, the
first input column from the data file is used as the first
column in the target table. The second input column, which
has been assigned to the variable @1
by
the columns
option, is subjected to a
division operation before being used as the value of the
second column in the target table.
mysql-js> util.importTable('file.txt', {
columns: ['column1', 1],
decodeColumns: {'column2': '@1 / 100'}
});
In this example in MySQL Shell's JavaScript mode, the input columns from the data file are both assigned to variables, then transformed in various ways and used to populate the columns of the target table:
mysql-js> util.importTable('file.txt', {
table: 't1',
columns: [1, 2],
decodeColumns: {
'a': '@1',
'b': '@2',
'sum': '@1 + @2',
'multiple': '@1 * @2',
'power': 'POW(@1, @2)'
}
});
skipRows:
number
Skip this number of rows of data at the beginning of the import file, or in the case of multiple import files, at the beginning of every file included in the file list. You can use this option to omit an initial header line containing column names from the upload to the table. The default is that no rows are skipped.
replaceDuplicates: [true|false]
Whether input rows that have the same value for a primary
key or unique index as an existing row should be replaced
(true
) or skipped
(false
). The default is
false
.
dialect: [default|csv|csv-unix|tsv|json]
Use a set of field- and line-handling options appropriate
for the specified file format. You can use the selected
dialect as a base for further customization, by also
specifying one or more of the
linesTerminatedBy
,
fieldsTerminatedBy
,
fieldsEnclosedBy
,
fieldsOptionallyEnclosed
, and
fieldsEscapedBy
options to change the
settings. The default dialect maps to a file created using
a
SELECT...INTO
OUTFILE
statement with the default settings for
that statement. This is the default for the output file
produced by MySQL Shell's table export utility. Other
dialects are available to suit CSV files (created on
either DOS or UNIX systems), TSV files, and JSON data. The
settings applied for each dialect are as follows:
Table 8.2 Dialect settings for parallel table import utility
|
|
|
|
|
|
---|---|---|---|---|---|
|
[LF] |
[TAB] |
[empty] |
|
\ |
|
[CR][LF] |
, |
'' |
|
\ |
|
[LF] |
, |
'' |
|
\ |
|
[CR][LF] |
[TAB] |
'' |
|
\ |
|
[LF] |
[LF] |
[empty] |
|
[empty] |
The carriage return and line feed values for the dialects are operating system independent.
If you use the linesTerminatedBy
,
fieldsTerminatedBy
,
fieldsEnclosedBy
,
fieldsOptionallyEnclosed
, and
fieldsEscapedBy
options,
depending on the escaping conventions of your
command interpreter, the backslash character (\)
might need to be doubled if you use it in the option
values.
Like the MySQL server with the
LOAD DATA
statement,
MySQL Shell does not validate the field- and
line-handling options that you specify. Inaccurate
selections for these options can cause data to be
imported into the wrong fields, partially, and/or
incorrectly. Always verify your settings before
starting the import, and verify the results
afterwards.
linesTerminatedBy:
"characters
"
One or more characters (or an empty string) that
terminates each of the lines in the input data file or
files. The default is as for the specified dialect, or a
linefeed character (\n
) if the dialect
option is omitted. This option is equivalent to the
LINES TERMINATED BY
option for the
LOAD DATA
statement. Note that the utility does not provide an
equivalent for the LINES STARTING BY
option for the LOAD DATA
statement,
which is set to the empty string.
fieldsTerminatedBy:
"characters
"
One or more characters (or an empty string) that
terminates each of the fields in the input data file or
files. The default is as for the specified dialect, or a
tab character (\t
) if the dialect
option is omitted. This option is equivalent to the
FIELDS TERMINATED BY
option for the
LOAD DATA
statement.
fieldsEnclosedBy:
"character
"
A single character (or an empty string) that encloses each
of the fields in the input data file or files. The default
is as for the specified dialect, or the empty string if
the dialect option is omitted. This option is equivalent
to the FIELDS ENCLOSED BY
option for
the LOAD DATA
statement.
fieldsOptionallyEnclosed: [ true | false
]
Whether the character given for
fieldsEnclosedBy
encloses all of the
fields in the input data file or files
(false
), or encloses the fields only in
some cases (true
). The default is as
for the specified dialect, or false
if
the dialect option is omitted. This option makes the
fieldsEnclosedBy
option equivalent to
the FIELDS OPTIONALLY ENCLOSED BY
option for the LOAD
DATA
statement.
fieldsEscapedBy:
"character
"
The character that begins escape sequences in the input
data file or files. If this is not provided, escape
sequence interpretation does not occur. The default is as
for the specified dialect, or a backslash (\) if the
dialect option is omitted. This option is equivalent to
the FIELDS ESCAPED BY
option for the
LOAD DATA
statement.
osBucketName:
"string
"
Added in MySQL Shell 8.0.21. The name of the Oracle Cloud
Infrastructure Object Storage bucket where the input data
file is located. By default, the
[DEFAULT]
profile in the Oracle Cloud
Infrastructure CLI configuration file located at
~/.oci/config
is used to establish a
connection to the bucket. You can substitute an
alternative profile to be used for the connection with the
ociConfigFile
and
ociProfile
options. For instructions to
set up a CLI configuration file, see
SDK
and CLI Configuration File.
osNamespace:
"string
"
Added in MySQL Shell 8.0.21. The Oracle Cloud
Infrastructure namespace where the Object Storage bucket
named by osBucketName
is located. The
namespace for an Object Storage bucket is displayed in the
Bucket Information tab of the bucket details page in the
Oracle Cloud Infrastructure console, or can be obtained
using the Oracle Cloud Infrastructure command line
interface.
ociConfigFile:
"string
"
Added in MySQL Shell 8.0.21. An Oracle Cloud
Infrastructure CLI configuration file that contains the
profile to use for the connection, instead of the one in
the default location ~/.oci/config
.
ociProfile:
"string
"
Added in MySQL Shell 8.0.21. The profile name of the
Oracle Cloud Infrastructure profile to use for the
connection, instead of the [DEFAULT]
profile in the Oracle Cloud Infrastructure CLI
configuration file used for the connection.
characterSet:
"charset
"
Added in MySQL Shell 8.0.21. This option specifies a
character set encoding with which the input data is
interpreted during the import. Setting the option to
binary
means that no conversion is done
during the import. When you omit this option, the import
uses the character set specified by the
character_set_database
system variable to interpret the input data.
bytesPerChunk:
"size
"
For a list of multiple input data files, this option is
not available. For a single input data file, this option
specifies the number of bytes (plus any additional bytes
required to reach the end of the row) that threads send
for each LOAD
DATA
call to the target server. The utility
distributes the data into chunks of this size for threads
to pick up and send to the target server. The chunk size
can be specified as a number of bytes, or using the
suffixes k (kilobytes), M (megabytes), G (gigabytes). For
example, bytesPerChunk="2k"
makes
threads send chunks of approximately 2 kilobytes. The
minimum chunk size is 131072 bytes, and the default chunk
size is 50M.
threads:
number
The maximum number of parallel threads to use to send the data in the input file or files to the target server. If you do not specify a number of threads, the default maximum is 8. For a list of multiple input data files, the utility creates the specified or maximum number of threads. For a single input data file, the utility calculates an appropriate number of threads to create up to this maximum, using the following formula:
min{max{1, threads}, chunks}}
where threads
is the maximum number of
threads, and chunks
is the number of
chunks that the data will be split into, which is
calculated by dividing the file size by the
bytesPerChunk
size then adding 1. The
calculation ensures that if the maximum number of threads
exceeds the number of chunks that will actually be sent,
the utility does not create more threads than necessary.
Compressed files cannot be distributed into chunks, so instead the utility uses its parallel connections to upload multiple files at a time. If there is only one input data file, the upload of a compressed file can only use a single connection.
maxRate:
"rate
"
The maximum limit on data throughput in bytes per second
per thread. Use this option if you need to avoid
saturating the network or the I/O or CPU for the client
host or target server. The maximum rate can be specified
as a number of bytes, or using the suffixes k (kilobytes),
M (megabytes), G (gigabytes). For example,
maxRate="5M"
limits each thread to 5MB
of data per second, which for eight threads gives a
transfer rate of 40MB/second. The default is 0, meaning
that there is no limit.
showProgress: [ true | false ]
Display (true
) or hide
(false
) progress information for the
import. The default is true
if stdout
is a terminal (tty), and false
otherwise.