MySQL AI User Guide
ML_PREDICT_TABLE
generates predictions for an entire table of unlabeled data.
MySQL HeatWave AutoML performs the predictions in parallel.
This topic has the following sections.
ML_PREDICT_TABLE
is a compute intensive process. If
ML_PREDICT_TABLE
takes a long time to complete, manually limit input tables to
a maximum of 1,000 rows.
A call to
ML_PREDICT_TABLE
can include columns that were not present during
ML_TRAIN
.
A table can include extra columns, and still use the
MySQL HeatWave AutoML model. This allows side by side comparisons of
target column labels, ground truth, and predictions in the
same table.
ML_PREDICT_TABLE
ignores any extra columns, and appends them to the results.
The output table includes a primary key:
If the input table has a primary key, the output table has the same primary key.
If the input table does not have a primary key, the output
table has a new primary key column that auto increments.
The name of the new primary key column is
_4aad19ca6e_pk_id
. The input table must
not have a column with the name
_4aad19ca6e_pk_id
that is not a primary
key.
The output of predictions includes the
ml_results
column, which contains the
prediction results and the data. The combination of results
and data must be less than 65,532 characters.
You have the option to specify the input table and output table as the same table if specific conditions are met. See Input Tables and Output Tables to learn more.
ML_PREDICT_TABLE
supports data drift detection for classification and
regression models with the following:
The options
parameter includes
the additional_details
boolean value.
The ml_results
column includes the
drift
JSON object literal.
See Analyze Data Drift.
mysql>CALL sys.ML_PREDICT_TABLE(
table_name
,model_handle
,output_table_name
), [options
]);options
: { JSON_OBJECT("key
","value
"[,"key
","value
"] ...)"key","value"
: { ['threshold', 'N
'] ['topk', 'N
'] ['recommend', {'ratings'|'items'|'users'|'users_to_items'|'items_to_users'|'items_to_items'|'users_to_users'}|NULL] ['remove_seen', {'true'|'false'}] ['additional_details', {'true'|'false'}] ['prediction_interval', 'N
'] ['logad_options', JSON_OBJECT(("key
","value
"[,"key
","value
"] ...)"key","value"
: { ['summarize_logs', {'true'|'false'}] ['summary_threshold', 'N
'] } } }
Set the following required parameters:
table_name
: Specifies the fully
qualified name of the input table
(database_name.table_name
). The
input table should contain the same feature columns as the
training dataset. If the target column is included in the
input table, it is not considered when generating
predictions.
model_handle
: Specifies the model
handle or a session variable containing the model handle.
See Work with
Model Handles.
output_table_name
: Specifies the table
where predictions are stored. A fully qualified table name
must be specified
(database_name.table_name
). You
have the option to specify the input table and output
table as the same table if specific conditions are met.
See
Input Tables and Output Tables
to learn more.
Set the following options in JSON
format as
needed.
To view data drift detection values for classification and
regression models, set the
additional_details
option to
true
. The ml_results
includes the drift
JSON object literal.
Additional options are available for recommendation, anomaly detection, and forecasting models.
Set the following options as needed for recommendation models.
threshold
: The optional threshold that
defines positive feedback, and a relevant sample. Only use
with ranking metrics. It can be used for either explicit
or implicit feedback.
topk
: The optional top number of
recommendations to provide. The default is
3
. Set a positive integer between 1 and
the number of rows in the table.
A recommendation
task with implicit
feedback can use both threshold
and
topk
.
recommend
: Specify what to recommend.
ratings
: Use this option to predict
ratings. This is the default value.
The target column is prediction
,
and the values are float
.
The input table must contain at least two columns with the same names as the user column and item column from the training model.
items
: Use this option to recommend
items for users.
The target column is
item_recommendation
, and the values
are:
JSON_OBJECT("column_item_id_name", JSON_ARRAY("item_1", ... , "item_k"), "column_rating_name" , JSON_ARRAY(rating_1, ..., rating_k))
The input table must contain at least one column with the same name as the user column from the training model.
users
: Use this option to recommend
users for items.
The target column is
user_recommendation
, and the values
are:
JSON_OBJECT("column_user_id_name", JSON_ARRAY("user_1", ... , "user_k"), "column_rating_name" , JSON_ARRAY(rating_1, ..., rating_k))
The input table must contain at least one column with the same name as the item column from the training model.
users_to_items
: This is the same as
items
.
items_to_users
: This is the same as
users
.
items_to_items
: Use this option to
recommend similar items for items.
The target column is
item_recommendation
, and the values
are:
JSON_OBJECT("column_item_id_name", JSON_ARRAY("item_1", ... , "item_k"))
The input table must contain at least one column with the same name as the item column from the training model.
users_to_users
: Use this option to
recommend similar users for users.
The target column is
user_recommendation
, and the values
are:
JSON_OBJECT("column_user_id_name", JSON_ARRAY("user_1", ... , "user_k"))
The input table must at least contain a column with the same name as the user column from the training model.
remove_seen
: If the input table
overlaps with the training table, and
remove_seen
is true
,
then the model will not repeat existing interactions. The
default is true
. Set
remove_seen
to false
to repeat existing interactions from the training table.
If you run
ML_PREDICT_TABLE
with the log_anomaly_detection
task, at
least one column must act as the primary key to establish the
temporal order of logs.
Set the following options as needed for anomaly detection models.
threshold
: The threshold you set on
anomaly detection models determines which rows in the
output table are labeled as anomalies with an anomaly
score of 1
, or normal with an anomaly
score of 0
. The value for the threshold
is the degree to which a row of data or log segment is
considered for anomaly detection. Any sample with an
anomaly score above the threshold is classified an
anomaly. The default value is (1 -
contamination
)-th percentile of all the
anomaly scores.
topk
: The optional top K rows to
display with the highest anomaly scores. Set a positive
integer between 1 and the number of rows in the table. If
topk
is not set,
ML_PREDICT_TABLE
uses threshold
.
Do not set both threshold
and
topk
. Use threshold
or topk
, or set
options
to
NULL
.
logad_options
: A
JSON_OBJECT
that allows you to
configure the following options for running an anomaly
detection model on log data.
summarize_logs
: Allows you to
leverage MySQL HeatWave GenAI to generate textual summaries of
results. Enable this option by setting it to
TRUE
. If enabled, summaries are
generated for log segments that are labeled as an
anomaly or have anomaly scores higher than the value
set for the summary_threshold
.
summary_threshold
: Determines the
rows in the output table that are summarized. This
does not affect how the
contamination
and
threshold
options determine
anomalies. You can set a value greater than 0 and less
than 1. The default value is NULL
.
If NULL
is selected, only the log
segments tagged with is_anomaly
are
used to generate summaries.
Set the following options as needed for forecasting models.
prediction_interval
: Use this to
generate forecasted values with lower and upper bounds
based on a specific prediction interval (level of
confidence). For the
prediction_interval
value:
The default value is 0.95.
The data type for this value must be FLOAT.
The value must be greater than 0 and less than 1.
A typical usage example that specifies the fully qualified name of the table to generate predictions for, the session variable containing the model handle, and the fully qualified output table name.
mysql> CALL sys.ML_PREDICT_TABLE('census_data.census_train', @census_model, 'census_data.census_train_predictions', NULL);
To view
ML_PREDICT_TABLE
results, query the output table. The table shows the
predictions and the feature column values used to make
each prediction. The table includes the primary key,
_4aad19ca6e_pk_id
, and the
ml_results
column, which uses
JSON
format:
mysql> SELECT * FROM census_train_predictions LIMIT 5;
+-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+
| _4aad19ca6e_pk_id | age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | revenue | Prediction | ml_results |
+-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+
| 1 | 37 | Private | 99146 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 1977 | 50 | United-States | >50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.58, ">50K": 0.42}} |
| 2 | 34 | Private | 27409 | 9th | 5 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.76, ">50K": 0.24}} |
| 3 | 30 | Private | 299507 | Assoc-acdm | 12 | Separated | Other-service | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.99, ">50K": 0.01}} |
| 4 | 62 | Self-emp-not-inc | 102631 | Some-college | 10 | Widowed | Farming-fishing | Unmarried | White | Female | 0 | 0 | 50 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.9, ">50K": 0.1}} |
| 5 | 51 | Private | 153486 | Some-college | 10 | Married-civ-spouse | Handlers-cleaners | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.7, ">50K": 0.3}} |
+-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+
5 rows in set (0.0014 sec)
The following example generates a table of recommendations. The output recommends the top three items that particular users will like.
mysql>CALL sys.ML_PREDICT_TABLE('mlcorpus.test_sample', @model, 'mlcorpus.table_predictions_users', JSON_OBJECT("recommend", "items", "topk", 3));
Query OK, 0 rows affected (5.0672 sec) mysql>SELECT * FROM mlcorpus.table_predictions_users LIMIT 3;
+-------------------+---------+---------+--------+--------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | user_id | item_id | rating | ml_results | +-------------------+---------+---------+--------+--------------------------------------------------------------------------------+ | 1 | 1026 | 13763 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.43, 3.37, 3.18]}} | | 2 | 992 | 16114 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.42, 3.38, 3.17]}} | | 3 | 1863 | 4527 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.42, 3.37, 3.18]}} | +-------------------+---------+---------+--------+--------------------------------------------------------------------------------+
The following example generates a table of anomaly detection predictions. A threshold value of 1% is specified.
mysql>CALL sys.ML_PREDICT_TABLE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', @anomaly, 'mlcorpus_anomaly_detection.volcanoes-predictions_threshold', JSON_OBJECT('threshold', 0.01));
Query OK, 0 rows affected (12.77 sec) mysql>SELECT * FROM mlcorpus_anomaly_detection.volcanoes-predictions_threshold LIMIT 5;
+-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | V1 | V2 | V3 | target | ml_results | +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | 1 | 128 | 802 | 0.47255 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | | 2 | 631 | 642 | 0.387302 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.96, 'anomaly': 0.04}} | | 3 | 438 | 959 | 0.556034 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.74, 'anomaly': 0.26}} | | 4 | 473 | 779 | 0.407626 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.87, 'anomaly': 0.13}} | | 5 | 67 | 933 | 0.383843 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ 5 rows in set (0.00 sec)
The following example generates a table of anomaly
detection predictions by using semi-supervised learning.
It overrides the ensemble_score
value
from the
ML_TRAIN
routine to a new value of 0.5.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.anomaly_train',@semsup_gknn, 'mlcorpus.preds_gknn_weighted', CAST('{"experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON));
The following example generates a table of anomaly detection predictions for log data. It disables log summaries in the results.
mysql>CALL sys.ML_PREDICT_TABLE('mlcorpus.`log_anomaly_just_patterns`', @logad_model, 'mlcorpus.log_anomaly_test_out', JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', FALSE)));
mysql>SELECT * FROM mlcorpus.log_anomaly_test_out LIMIT 1;
+----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | id | parsed_log_segment | ml_results | +----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | 1 | 2024-04-11T14:39:45.443597Z 1 [Note] [MY-013546] [InnoDB] Atomic write enabled | {"index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], | | | 2024-04-11T14:39:45.443618Z 1 [Note] [MY-012932] [InnoDB] PUNCH HOLE support available | "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.55, "anomaly": 0.45}} | | | 2024-04-11T14:39:45.443631Z 1 [Note] [MY-012944] [InnoDB] Uses event mutexes | | | | 2024-04-11T14:39:45.443635Z 1 [Note] [MY-012945] [InnoDB] GCC builtin __atomic_thread_fence() is used for memory barrier | | | | 2024-04-11T14:39:45.443646Z 1 [Note] [MY-012948] [InnoDB] Compressed tables use zlib 1.2.13 | | | | 2024-04-11T14:40:25.128143Z 0 [Note] [MY-010264] [Server] - '127.0.0.1' resolves to '127.0.0.1'; | | | | 2024-04-11T14:40:25.128182Z 0 [Note] [MY-010251] [Server] Server socket created on IP: '127.0.0.1'. | | | | 2024-04-11T14:40:25.128245Z 0 [Note] [MY-010252] [Server] Server hostname (bind-address): '10.0.1.125'; port: 3306 | | | | 2024-04-11T14:40:25.128272Z 0 [Note] [MY-010264] [Server] - '10.0.1.125' resolves to '10.0.1.125'; | | | | 2024-04-26T13:01:30.287325Z 0 [Warning] [MY-015116] [Server] Background histogram update on nexus.fetches: | | | | Lock wait timeout exceeded; try restarting transaction | | +----+------+------+----------+--------+----------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+