MySQL HeatWave User Guide
ML_PREDICT_TABLE
generates
predictions for an entire table of unlabeled data and saves the
results to an output table. MySQL HeatWave AutoML performs the predictions
in parallel.
This topic has the following sections.
ML_PREDICT_TABLE
is a compute
intensive process. Limiting operations to batches of 10 to 100
rows by splitting large tables into smaller tables is
recommended. To do this, use the batch_size
option.
A call to ML_PREDICT_TABLE
can
include columns that were not present during
ML_TRAIN
. A table can include
extra columns, and still use the MySQL HeatWave AutoML model. This allows
side by side comparisons of target column labels, ground
truth, and predictions in the same table.
ML_PREDICT_TABLE
ignores any
extra columns, and appends them to the results.
The output table includes a primary key:
If the input table has a primary key, the output table has the same primary key.
If the input table does not have a primary key, the output table has a new primary key column that auto increments.
As of MySQL 8.4.1, the name of the new primary key
column is _4aad19ca6e_pk_id
. The
input table must not have a column with the name
_4aad19ca6e_pk_id
that is not a
primary key.
Before MySQL 8.4.1, the name of the new primary key
column is _id
. The input table must
not have a column with the name _id
that is not a primary key.
The output table includes the ml_results
column, which contains the prediction results and the data.
The combination of results and data must be less than 65,532
characters.
ML_PREDICT_TABLE
supports data
drift detection for classification, regression, and anomaly
detection (as of MySQL 9.3.2) models with the following:
The options
parameter includes
the additional_details
boolean value.
The ml_results
column includes the
drift
JSON object literal.
See Analyze Data Drift.
mysql>CALL sys.ML_PREDICT_TABLE(
table_name
,model_handle
,output_table_name
), [options
]);options
: { JSON_OBJECT("key
","value
"[,"key
","value
"] ...)"key","value"
: { ['threshold', 'N
'] ['topk', 'N
'] ['recommend', {'ratings'|'items'|'users'|'users_to_items'|'items_to_users'|'items_to_items'|'users_to_users'}|NULL] ['remove_seen', {'true'|'false'}] ['batch_size', 'N
'] ['additional_details', {'true'|'false'}] ['prediction_interval', 'N
'] ['logad_options', JSON_OBJECT(("key
","value
"[,"key
","value
"] ...)"key","value"
: { ['summarize_logs', {'true'|'false'}] ['summary_threshold', 'N
'] } } }
Set the following required parameters:
table_name
: Specifies the fully
qualified name of the input table
(database_name.table_name
). The
input table should contain the same feature columns as the
training dataset. If the target column is included in the
input table, it is not considered when generating
predictions.
model_handle
: Specifies the model
handle or a session variable containing the model handle.
See Work with
Model Handles.
output_table_name
: Specifies the table
where predictions are stored. A fully qualified table name
must be specified
(database_name.table_name
). If
the table already exists, an error is returned.
Set the following options in JSON
format as
needed.
To view data drift detection values for classification and
regression models, set the
additional_details
option to
true
. The ml_results
includes the drift
JSON object literal.
batch_size
: The size of each batch (1
≤ batch_size
≤ 1,000) The default
is 1,000, and this provides the best results.
Additional options are available for recommendation, anomaly detection, and forecasting models.
Set the following options as needed for recommendation models.
threshold
: The optional threshold that
defines positive feedback, and a relevant sample. Only use
with ranking metrics. It can be used for either explicit
or implicit feedback.
topk
: The optional top number of
recommendations to provide. The default is
3
. Set a positive integer between 1 and
the number of rows in the table.
A recommendation
task with implicit
feedback can use both threshold
and
topk
.
recommend
: Specify what to recommend.
ratings
: Use this option to predict
ratings. This is the default value.
The target column is prediction
,
and the values are float
.
The input table must contain at least two columns with the same names as the user column and item column from the training model.
items
: Use this option to recommend
items for users.
The target column is
item_recommendation
, and the values
are:
JSON_OBJECT("column_item_id_name", JSON_ARRAY("item_1", ... , "item_k"), "column_rating_name" , JSON_ARRAY(rating_1, ..., rating_k))
The input table must contain at least one column with the same name as the user column from the training model.
users
: Use this option to recommend
users for items.
The target column is
user_recommendation
, and the values
are:
JSON_OBJECT("column_user_id_name", JSON_ARRAY("user_1", ... , "user_k"), "column_rating_name" , JSON_ARRAY(rating_1, ..., rating_k))
The input table must contain at least one column with the same name as the item column from the training model.
users_to_items
: This is the same as
items
.
items_to_users
: This is the same as
users
.
items_to_items
: Use this option to
recommend similar items for items.
The target column is
item_recommendation
, and the values
are:
JSON_OBJECT("column_item_id_name", JSON_ARRAY("item_1", ... , "item_k"))
The input table must contain at least one column with the same name as the item column from the training model.
users_to_users
: Use this option to
recommend similar users for users.
The target column is
user_recommendation
, and the values
are:
JSON_OBJECT("column_user_id_name", JSON_ARRAY("user_1", ... , "user_k"))
The input table must at least contain a column with the same name as the user column from the training model.
remove_seen
: If the input table
overlaps with the training table, and
remove_seen
is true
,
then the model will not repeat existing interactions. The
default is true
. Set
remove_seen
to false
to repeat existing interactions from the training table.
Set the following options as needed for anomaly detection models.
threshold
: The threshold you set on
anomaly detection models determines which rows in the
output table are labeled as anomalies with an anomaly
score of 1
, or normal with an anomaly
score of 0
. The value for the threshold
is the degree to which a row of data or log segment is
considered for anomaly detection. Any sample with an
anomaly score above the threshold is classified an
anomaly. The default value is (1 -
contamination
)-th percentile of all the
anomaly scores.
topk
: The optional top K rows to
display with the highest anomaly scores. Set a positive
integer between 1 and the number of rows in the table. If
topk
is not set,
ML_PREDICT_TABLE
uses
threshold
.
Do not set both threshold
and
topk
. Use threshold
or topk
, or set
options
to
NULL
.
logad_options
: A
JSON_OBJECT
that allows you to
configure the following options for running an anomaly
detection model on log data (MySQL 9.2.2 and later).
summarize_logs
: Allows you to
leverage MySQL HeatWave GenAI to generate textual summaries of
results. Enable this option by setting it to
TRUE
. If enabled, summaries are
generated for log segments that are labeled as an
anomaly or have anomaly scores higher than the value
set for the summary_threshold
.
summary_threshold
: Determines the
rows in the output table that are summarized. This
does not affect how the
contamination
and
threshold
options determine
anomalies. You can set a value greater than 0 and less
than 1. The default value is NULL
.
If NULL
is selected, only the log
segments tagged with is_anomaly
are
used to generate summaries.
Set the following options as needed for forecasting models.
prediction_interval
: Use this to
generate forecasted values with lower and upper bounds
based on a specific prediction interval (level of
confidence). For the
prediction_interval
value:
The default value is 0.95.
The data type for this value must be FLOAT.
The value must be greater than 0 and less than 1.
A typical usage example that specifies the fully qualified name of the table to generate predictions for, the session variable containing the model handle, and the fully qualified output table name.
mysql> CALL sys.ML_PREDICT_TABLE('census_data.census_train', @census_model, 'census_data.census_train_predictions', NULL);
To view ML_PREDICT_TABLE
results, query the output table. The table shows the
predictions and the feature column values used to make
each prediction. The table includes the primary key,
_4aad19ca6e_pk_id
, and the
ml_results
column, displays the
predictions and probabilities for each prediction.
mysql> SELECT * FROM census_train_predictions LIMIT 5;
+-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+
| _4aad19ca6e_pk_id | age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | revenue | Prediction | ml_results |
+-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+
| 1 | 37 | Private | 99146 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 1977 | 50 | United-States | >50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.58, ">50K": 0.42}} |
| 2 | 34 | Private | 27409 | 9th | 5 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.76, ">50K": 0.24}} |
| 3 | 30 | Private | 299507 | Assoc-acdm | 12 | Separated | Other-service | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.99, ">50K": 0.01}} |
| 4 | 62 | Self-emp-not-inc | 102631 | Some-college | 10 | Widowed | Farming-fishing | Unmarried | White | Female | 0 | 0 | 50 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.9, ">50K": 0.1}} |
| 5 | 51 | Private | 153486 | Some-college | 10 | Married-civ-spouse | Handlers-cleaners | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.7, ">50K": 0.3}} |
+-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+
5 rows in set (0.0014 sec)
The following example generates a table of recommendations. The output recommends the top three items that particular users will like.
mysql>CALL sys.ML_PREDICT_TABLE('mlcorpus.test_sample', @model, 'mlcorpus.table_predictions_users', JSON_OBJECT("recommend", "items", "topk", 3));
Query OK, 0 rows affected (5.0672 sec) mysql>SELECT * FROM mlcorpus.table_predictions_users LIMIT 3;
+-------------------+---------+---------+--------+--------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | user_id | item_id | rating | ml_results | +-------------------+---------+---------+--------+--------------------------------------------------------------------------------+ | 1 | 1026 | 13763 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.43, 3.37, 3.18]}} | | 2 | 992 | 16114 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.42, 3.38, 3.17]}} | | 3 | 1863 | 4527 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.42, 3.37, 3.18]}} | +-------------------+---------+---------+--------+--------------------------------------------------------------------------------+
The following example generates a table of anomaly
detection predictions. A threshold value of 1% is
specified. The ml_results
column
displays the prediction if each row is an anomaly, and the
probability for that prediction.
mysql>CALL sys.ML_PREDICT_TABLE('volcano_data.volcano_data_train', 'anomaly_model', 'volcano_data.volcano_predictions', JSON_OBJECT('threshold', 0.01));
Query OK, 0 rows affected (4.8868 sec) mysql>SELECT * FROM mlcorpus_anomaly_detection.volcanoes-predictions_threshold LIMIT 5;
+-------------------+-------+-------+----------+--------+--------------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | V1 | V2 | V3 | target | ml_results | +-------------------+-------+-------+----------+--------+--------------------------------------------------------------------------------------------+ | 1 | 806.0 | 962.0 | 0.353207 | n | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.978, "anomaly": 0.022}} | | 2 | 326.0 | 254.0 | 0.368590 | n | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9873, "anomaly": 0.0127}} | | 3 | 586.0 | 654.0 | 0.409559 | n | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9786, "anomaly": 0.0214}} | | 4 | 370.0 | 816.0 | 0.404861 | n | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9885, "anomaly": 0.0115}} | | 5 | 918.0 | 952.0 | 0.419940 | n | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9767, "anomaly": 0.0233}} | +-------------------+-------+-------+----------+--------+--------------------------------------------------------------------------------------------+ 5 rows in set (0.0409 sec)
The following example generates a table of anomaly
detection predictions by using semi-supervised learning.
It overrides the ensemble_score
value
from the ML_TRAIN
routine
to a new value of 0.5. The ml_results
column displays the prediction if each row is an anomaly,
and the probability for that prediction.
mysql>CALL sys.ML_PREDICT_TABLE('anomaly_data.anomaly_data_train', 'semisupervised_model', 'anomaly_data.anomaly_predictions_semisupervised', CAST('{"experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON));
Query OK, 0 rows affected (1.5698 sec) mysql>SELECT * FROM anomaly_predictions_semisupervised LIMIT 5;
+-------------------+----------------------+---------------------+---------------------+----------------------+----------------------+---------------------+----------------------+----------+--------------------+---------+--------------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | att1 | att2 | att3 | att4 | att5 | att6 | att7 | att8 | att9 | 1utlier | ml_results | +-------------------+----------------------+---------------------+---------------------+----------------------+----------------------+---------------------+----------------------+----------+--------------------+---------+--------------------------------------------------------------------------------------------+ | 1 | 0.114285714285714280 | 0.50000000000000000 | 0.46808510638297873 | 0.020145495243424735 | 0.252808988764044950 | 0.08848812744214006 | 0.666666666666666600 | 0.753125 | 0.6804123711340206 | 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.992, "anomaly": 0.008}} | | 2 | 0.171428571428571430 | 0.47159090909090910 | 0.25531914893617020 | 0.017907106883044210 | 0.258426966292134850 | 0.08806732792305380 | 0.472222222222222200 | 0.712500 | 0.6855670103092784 | 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9986, "anomaly": 0.0014}} | | 3 | 0.142857142857142850 | 0.50568181818181820 | 0.12765957446808510 | 0.020145495243424735 | 0.258426966292134850 | 0.08854824165915239 | 0.416666666666666700 | 0.693750 | 0.6804123711340206 | 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9945, "anomaly": 0.0055}} | | 4 | 0.114285714285714280 | 0.47727272727272730 | 0.25531914893617020 | 0.020145495243424735 | 0.252808988764044950 | 0.08860835587616471 | 0.527777777777777800 | 0.718750 | 0.6804123711340206 | 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9984, "anomaly": 0.0016}} | | 5 | 0.114285714285714280 | 0.47727272727272730 | 0.34042553191489360 | 0.016787912702853944 | 0.247191011235955050 | 0.08794709948902915 | 0.583333333333333400 | 0.740625 | 0.6855670103092784 | 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9989, "anomaly": 0.0011}} | +-------------------+----------------------+---------------------+---------------------+----------------------+----------------------+---------------------+----------------------+----------+--------------------+---------+--------------------------------------------------------------------------------------------+ 5 rows in set (0.0491 sec)
The following example generates a table of anomaly detection predictions for log data (MySQL 9.2.2 and later). It disables log summaries in the results.
mysql>CALL sys.ML_PREDICT_TABLE('mlcorpus.`log_anomaly_just_patterns`', @logad_model, 'mlcorpus.log_anomaly_test_out', JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', FALSE)));
mysql>SELECT * FROM mlcorpus.log_anomaly_test_out LIMIT 1;
+----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | id | parsed_log_segment | ml_results | +----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | 1 | 2024-04-11T14:39:45.443597Z 1 [Note] [MY-013546] [InnoDB] Atomic write enabled | {"index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], | | | 2024-04-11T14:39:45.443618Z 1 [Note] [MY-012932] [InnoDB] PUNCH HOLE support available | "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.55, "anomaly": 0.45}} | | | 2024-04-11T14:39:45.443631Z 1 [Note] [MY-012944] [InnoDB] Uses event mutexes | | | | 2024-04-11T14:39:45.443635Z 1 [Note] [MY-012945] [InnoDB] GCC builtin __atomic_thread_fence() is used for memory barrier | | | | 2024-04-11T14:39:45.443646Z 1 [Note] [MY-012948] [InnoDB] Compressed tables use zlib 1.2.13 | | | | 2024-04-11T14:40:25.128143Z 0 [Note] [MY-010264] [Server] - '127.0.0.1' resolves to '127.0.0.1'; | | | | 2024-04-11T14:40:25.128182Z 0 [Note] [MY-010251] [Server] Server socket created on IP: '127.0.0.1'. | | | | 2024-04-11T14:40:25.128245Z 0 [Note] [MY-010252] [Server] Server hostname (bind-address): '10.0.1.125'; port: 3306 | | | | 2024-04-11T14:40:25.128272Z 0 [Note] [MY-010264] [Server] - '10.0.1.125' resolves to '10.0.1.125'; | | | | 2024-04-26T13:01:30.287325Z 0 [Warning] [MY-015116] [Server] Background histogram update on nexus.fetches: | | | | Lock wait timeout exceeded; try restarting transaction | | +----+------+------+----------+--------+----------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+