MySQL AI User Guide
You can automatically create training and testing datasets
with the
TRAIN_TEST_SPLIT
routine.
Review the Requirements.
Get the Required Privileges to use AutoML.
Review the Data Types Supported For Machine Learning Tasks.
The
TRAIN_TEST_SPLIT
routine takes your datasets and prepares new tables for
training and testing machine learning models. Two new tables
in the same database are created with the following names:
[original_table_name]_train
[original_table_name]_test
The split of the data between training and testing datasets depends on the machine learning task.
Classification: A stratified split of data. For each class in the dataset, 80% of the samples go into the training dataset, and the remaining go into the testing dataset. If the number of samples in the 80% subset is fewer than five, then five samples are inserted into the training dataset.
Regression: A random split of data.
Forecasting: A time-based split of data. The data is
inserted in order according to
datetime_index values. The first 80%
of the samples go into the training dataset. The
remaining samples go into the testing dataset.
Unsupervised anomaly detection: A random split of data. 80% of the samples go into the training dataset, and the remaining samples go into the testing dataset.
Semi-supervised anomaly detection: A stratified split of data.
Anomaly detection for log data: A split of data based on primary key values. The first 80% of the samples go into the training dataset. The remaining samples go into the testing dataset. Review requirements when running Anomaly Detection for Logs.
Recommendations: A random split of data.
Topic modeling: A random split of data.
To run the
TRAIN_TEST_SPLIT
routine, you use the following parameters:
table_name: You must provide the
fully qualified name of the table that contains the
dataset to split
(schema_name.table_name).
target_column_name: Classification
and semi-supervised anomaly detection tasks require a
target column. All other tasks do not require a target
column. If a target column is not required, you can set
this parameter to NULL.
options: Set the following options as
needed as key-value pairs in JSON object format. If no
options are needed, set this to NULL.
task: Set the appropriate machine
learning task: classification,
regression,
forecasting,
anomaly_detection,
log_anomaly_detection,
recommendation, or
topic_modeling. If the machine
learning task is not set, the default task is
classification.
datetime_index: Required for
forecasting tasks. The column that has datetime
values.
The following data types for this column are supported:
semisupervised: If running an
anomaly detection task, set this to
true for semi-supervised
learning, or false for
unsupervised learning. If this is set to
NULL, then the default value of
false is selected.
To automatically generate a training and testing dataset:
Run the
TRAIN_TEST_SPLIT
routine.
mysql> CALL sys.TRAIN_TEST_SPLIT('table_name', 'target_column_name', options);
Replace table_name,
target_column_name, and
options with your own values.
For example:
mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.data_files_1', 'class', JSON_OBJECT('task', 'classification'));
Confirm the two datasets are created ([original_table_name]_train and [original_table_name]_test) by querying the tables in the database.
mysql> SHOW TABLES;
+-------------------------+
| Tables_in_data_files_db |
+-------------------------+
| data_files_1 |
| data_files_1_test |
| data_files_1_train |
+-------------------------+
Learn how to Train a Model.