10.2.6 ML_PREDICT

ML_PREDICT_TABLE Overview

ML_PREDICT_TABLE is a compute intensive process. Depending on your MySQL version, we recommend the following:

Before MySQL 9.4.1, limit operations to batches of rows by splitting large tables into smaller tables by using the batch_size option.
For MySQL 9.4.1 and later, if ML_PREDICT_TABLE takes a long time to complete, manually limit input tables to a maximum of 1,000 rows.

A call to ML_PREDICT_TABLE can include columns that were not present during ML_TRAIN. A table can include extra columns, and still use the MySQL HeatWave AutoML model. This allows side by side comparisons of target column labels, ground truth, and predictions in the same table. ML_PREDICT_TABLE ignores any extra columns, and appends them to the results.

The output table includes a primary key:

If the input table has a primary key, the output table has the same primary key.
If the input table does not have a primary key, the output table has a new primary key column that auto increments.
- As of MySQL 8.4.1, the name of the new primary key column is _4aad19ca6e_pk_id. The input table must not have a column with the name _4aad19ca6e_pk_id that is not a primary key.
- Before MySQL 8.4.1, the name of the new primary key column is _id. The input table must not have a column with the name _id that is not a primary key.

The output of predictions includes the ml_results column, which contains the prediction results and the data. The combination of results and data must be less than 65,532 characters.

As of MySQL 9.4.1, you have the option to specify the input table and output table as the same table if specific conditions are met. See Input Tables and Output Tables to learn more.

ML_PREDICT_TABLE supports data drift detection for classification, regression, and anomaly detection (as of MySQL 9.3.2) models with the following:

The options parameter includes the additional_details boolean value.
The ml_results column includes the drift JSON object literal.

See Analyze Data Drift.

ML_PREDICT_TABLE Syntax

mysql> CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name), options);
 
options: {JSON_OBJECT(keyvalue[, keyvalue]...)|NULL}
keyvalue:
{
  'threshold', 'N'
  |'topk', 'N'
  |'recommend', {'ratings'|'items'|'users'|'users_to_items'|'items_to_users'|'items_to_items'|'users_to_users'|NULL}
  |'remove_seen', {'true'|'false'}
  |'batch_size', 'N'
  |'additional_details', {'true'|'false'}
  |'prediction_interval', 'N'
  |'item_metadata', JSON_OBJECT('table_name'[,'database_name.table_name'] ...)
  |'user_metadata', JSON_OBJECT('table_name'[,'database_name.table_name'] ...)
  |'logad_options', JSON_OBJECT(("key","value"[,"key","value"] ...)
}

logad_options: {JSON_OBJECT(keyvalue[, keyvalue]...)}
keyvalue:
{
  summarize_logs', {'true'|'false'}
  |'summary_threshold', 'N'

Required ML_PREDICT_TABLE Parameters

Set the following required parameters:

table_name: Specifies the fully qualified name of the input table (database_name.table_name). The input table should contain the same feature columns as the training dataset. If the target column is included in the input table, it is not considered when generating predictions.
model_handle: Specifies the model handle or a session variable containing the model handle. See Work with Model Handles.
output_table_name: Specifies the table where predictions are stored. A fully qualified table name must be specified (database_name.table_name). As of MySQL 9.4.1, you have the option to specify the input table and output table as the same table if specific conditions are met. See Input Tables and Output Tables to learn more.

ML_PREDICT_TABLE Options

Set the following options in JSON format as needed.

To view data drift detection values for classification and regression models, set the additional_details option to true. The ml_results includes the drift JSON object literal.
batch_size: Deprecated as of MySQL 9.4.1. The size of each batch. You can set a value between 1 and 1,000. The default is 1,000, and this provides the best results. If on MySQL 9.4.1 and later and ML_PREDICT_TABLE takes a long time to complete, manually limit input tables to a maximum of 1,000 rows.

Additional options are available for recommendation, anomaly detection, and forecasting models.

Options for Recommendation Models

Set the following options as needed for recommendation models.

threshold: The optional threshold that defines positive feedback, and a relevant sample. Only use with ranking metrics. It can be used for either explicit or implicit feedback.
topk: The optional top number of recommendations to provide. The default is 3. Set a positive integer between 1 and the number of rows in the table.
A recommendation task with implicit feedback can use both threshold and topk.
recommend: Specify what to recommend.
- ratings: Use this option to predict ratings. This is the default value.
  The target column is prediction, and the values are float.
  The input table must contain at least two columns with the same names as the user column and item column from the training model.
- items: Use this option to recommend items for users.
  The target column is item_recommendation, and the values are:
```
JSON_OBJECT("column_item_id_name", JSON_ARRAY("item_1", ... , "item_k"), "column_rating_name" , JSON_ARRAY(rating_1, ..., rating_k))
```
  The input table must contain at least one column with the same name as the user column from the training model.
- users: Use this option to recommend users for items.
  The target column is user_recommendation, and the values are:
```
JSON_OBJECT("column_user_id_name", JSON_ARRAY("user_1", ... , "user_k"), "column_rating_name" , JSON_ARRAY(rating_1, ..., rating_k))
```
  The input table must contain at least one column with the same name as the item column from the training model.
- users_to_items: This is the same as items.
- items_to_users: This is the same as users.
- items_to_items: Use this option to recommend similar items for items.
  The target column is item_recommendation, and the values are:
```
JSON_OBJECT("column_item_id_name", JSON_ARRAY("item_1", ... , "item_k"))
```
  The input table must contain at least one column with the same name as the item column from the training model.
- users_to_users: Use this option to recommend similar users for users.
  The target column is user_recommendation, and the values are:
```
JSON_OBJECT("column_user_id_name", JSON_ARRAY("user_1", ... , "user_k"))
```
  The input table must at least contain a column with the same name as the user column from the training model.
remove_seen: If the input table overlaps with the training table, and remove_seen is true, then the model will not repeat existing interactions. The default is true. Set remove_seen to false to repeat existing interactions from the training table.
item_metadata: As of MySQL 9.5.0, it defines the table that has item descriptions. It is a JSON object that has the table_name option as a key, which specifies the table that has item descriptions. One column must be the same as the item_id in the input table.
user_metadata: As of MySQL 9.5.0, it defines the table that has user descriptions. It is a JSON object that has the table_name option as a key, which specifies the table that has user descriptions. One column must be the same as the user_id in the input table.
- table_name: To be used with the item_metadata and user_metadata options. It specifies the table name that has item or user descriptions. It must be a string in a fully qualified format (schema_name.table_name) that specifies the table name.

Requirements and Options for Anomaly Detection Models

As of MySQL 9.4.1, if you run ML_PREDICT_TABLE with the log_anomaly_detection task, at least one column must act as the primary key to establish the temporal order of logs.

Set the following options as needed for anomaly detection models.

threshold: The threshold you set on anomaly detection models determines which rows in the output table are labeled as anomalies with an anomaly score of 1, or normal with an anomaly score of 0. The value for the threshold is the degree to which a row of data or log segment is considered for anomaly detection. Any sample with an anomaly score above the threshold is classified an anomaly. The default value is (1 - contamination)-th percentile of all the anomaly scores.
topk: The optional top K rows to display with the highest anomaly scores. Set a positive integer between 1 and the number of rows in the table. If topk is not set, ML_PREDICT_TABLE uses threshold.
Do not set both threshold and topk. Use threshold or topk, or set options to NULL.
logad_options: A JSON_OBJECT that allows you to configure the following options for running an anomaly detection model on log data (MySQL 9.2.2 and later).
- summarize_logs: Allows you to leverage MySQL HeatWave GenAI to generate textual summaries of results. Enable this option by setting it to TRUE. If enabled, summaries are generated for log segments that are labeled as an anomaly or have anomaly scores higher than the value set for the summary_threshold.
- summary_threshold: Determines the rows in the output table that are summarized. This does not affect how the contamination and threshold options determine anomalies. You can set a value greater than 0 and less than 1. The default value is NULL. If NULL is selected, only the log segments tagged with is_anomaly are used to generate summaries.

Options for Forecasting Models

Set the following options as needed for forecasting models.

prediction_interval: Use this to generate forecasted values with lower and upper bounds based on a specific prediction interval (level of confidence). For the prediction_interval value:
- The default value is 0.95.
- The data type for this value must be FLOAT.
- The value must be greater than 0 and less than 1.

Syntax Examples

A typical usage example that specifies the fully qualified name of the table to generate predictions for, the session variable containing the model handle, and the fully qualified output table name.

mysql> CALL sys.ML_PREDICT_TABLE('census_data.census_train', @census_model, 'census_data.census_train_predictions', NULL);

To view ML_PREDICT_TABLE results, query the output table. The table shows the predictions and the feature column values used to make each prediction. The table includes the primary key, _4aad19ca6e_pk_id, and the ml_results column, displays the predictions and probabilities for each prediction.

mysql> SELECT * FROM census_train_predictions LIMIT 5;
+-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+
| _4aad19ca6e_pk_id | age | workclass        | fnlwgt | education    | education-num | marital-status     | occupation        | relationship | race  | sex    | capital-gain | capital-loss | hours-per-week | native-country | revenue | Prediction | ml_results                                                                            |
+-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+
|                 1 |  37 | Private          |  99146 | Bachelors    |            13 | Married-civ-spouse | Exec-managerial   | Husband      | White | Male   |            0 |         1977 |             50 | United-States  | >50K    | <=50K      | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.58, ">50K": 0.42}} |
|                 2 |  34 | Private          |  27409 | 9th          |             5 | Married-civ-spouse | Craft-repair      | Husband      | White | Male   |            0 |            0 |             50 | United-States  | <=50K   | <=50K      | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.76, ">50K": 0.24}} |
|                 3 |  30 | Private          | 299507 | Assoc-acdm   |            12 | Separated          | Other-service     | Unmarried    | White | Female |            0 |            0 |             40 | United-States  | <=50K   | <=50K      | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.99, ">50K": 0.01}} |
|                 4 |  62 | Self-emp-not-inc | 102631 | Some-college |            10 | Widowed            | Farming-fishing   | Unmarried    | White | Female |            0 |            0 |             50 | United-States  | <=50K   | <=50K      | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.9, ">50K": 0.1}}   |
|                 5 |  51 | Private          | 153486 | Some-college |            10 | Married-civ-spouse | Handlers-cleaners | Husband      | White | Male   |            0 |            0 |             40 | United-States  | <=50K   | <=50K      | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.7, ">50K": 0.3}}   |
+-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+
5 rows in set (0.0014 sec)

The following example generates a table of recommendations. The output recommends the top three items that particular users will like.

mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.test_sample',  @model, 'mlcorpus.table_predictions_users',  
                                  JSON_OBJECT("recommend", "items", "topk", 3));
Query OK, 0 rows affected (5.0672 sec)

mysql> SELECT * FROM mlcorpus.table_predictions_users LIMIT 3;
+-------------------+---------+---------+--------+--------------------------------------------------------------------------------+
| _4aad19ca6e_pk_id | user_id | item_id | rating | ml_results                                                                     |
+-------------------+---------+---------+--------+--------------------------------------------------------------------------------+
|                 1 | 1026    | 13763   |      1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.43, 3.37, 3.18]}} |
|                 2 | 992     | 16114   |      1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.42, 3.38, 3.17]}} |
|                 3 | 1863    | 4527    |      1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.42, 3.37, 3.18]}} |
+-------------------+---------+---------+--------+--------------------------------------------------------------------------------+

The following example generates a table of anomaly detection predictions. A threshold value of 1% is specified. The ml_results column displays the prediction if each row is an anomaly, and the probability for that prediction.

mysql> CALL sys.ML_PREDICT_TABLE('volcano_data.volcano_data_train', 'anomaly_model', 'volcano_data.volcano_predictions', 
                                  JSON_OBJECT('threshold', 0.01));
Query OK, 0 rows affected (4.8868 sec)

mysql> SELECT * FROM mlcorpus_anomaly_detection.volcanoes-predictions_threshold LIMIT 5;
+-------------------+-------+-------+----------+--------+--------------------------------------------------------------------------------------------+
| _4aad19ca6e_pk_id | V1    | V2    | V3       | target | ml_results                                                                                 |
+-------------------+-------+-------+----------+--------+--------------------------------------------------------------------------------------------+
|                 1 | 806.0 | 962.0 | 0.353207 | n      | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.978, "anomaly": 0.022}}   |
|                 2 | 326.0 | 254.0 | 0.368590 | n      | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9873, "anomaly": 0.0127}} |
|                 3 | 586.0 | 654.0 | 0.409559 | n      | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9786, "anomaly": 0.0214}} |
|                 4 | 370.0 | 816.0 | 0.404861 | n      | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9885, "anomaly": 0.0115}} |
|                 5 | 918.0 | 952.0 | 0.419940 | n      | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9767, "anomaly": 0.0233}} |
+-------------------+-------+-------+----------+--------+--------------------------------------------------------------------------------------------+
5 rows in set (0.0409 sec)

The following example generates a table of anomaly detection predictions by using semi-supervised learning. It overrides the ensemble_score value from the ML_TRAIN routine to a new value of 0.5. The ml_results column displays the prediction if each row is an anomaly, and the probability for that prediction.

mysql> CALL sys.ML_PREDICT_TABLE('anomaly_data.anomaly_data_train', 'semisupervised_model', 
                                  'anomaly_data.anomaly_predictions_semisupervised', 
                                  CAST('{"experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON));
Query OK, 0 rows affected (1.5698 sec)
mysql> SELECT * FROM anomaly_predictions_semisupervised LIMIT 5;
+-------------------+----------------------+---------------------+---------------------+----------------------+----------------------+---------------------+----------------------+----------+--------------------+---------+--------------------------------------------------------------------------------------------+
| _4aad19ca6e_pk_id | att1                 | att2                | att3                | att4                 | att5                 | att6                | att7                 | att8     | att9               | 1utlier | ml_results                                                                                 |
+-------------------+----------------------+---------------------+---------------------+----------------------+----------------------+---------------------+----------------------+----------+--------------------+---------+--------------------------------------------------------------------------------------------+
|                 1 | 0.114285714285714280 | 0.50000000000000000 | 0.46808510638297873 | 0.020145495243424735 | 0.252808988764044950 | 0.08848812744214006 | 0.666666666666666600 | 0.753125 | 0.6804123711340206 |       0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.992, "anomaly": 0.008}}   |
|                 2 | 0.171428571428571430 | 0.47159090909090910 | 0.25531914893617020 | 0.017907106883044210 | 0.258426966292134850 | 0.08806732792305380 | 0.472222222222222200 | 0.712500 | 0.6855670103092784 |       0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9986, "anomaly": 0.0014}} |
|                 3 | 0.142857142857142850 | 0.50568181818181820 | 0.12765957446808510 | 0.020145495243424735 | 0.258426966292134850 | 0.08854824165915239 | 0.416666666666666700 | 0.693750 | 0.6804123711340206 |       0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9945, "anomaly": 0.0055}} |
|                 4 | 0.114285714285714280 | 0.47727272727272730 | 0.25531914893617020 | 0.020145495243424735 | 0.252808988764044950 | 0.08860835587616471 | 0.527777777777777800 | 0.718750 | 0.6804123711340206 |       0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9984, "anomaly": 0.0016}} |
|                 5 | 0.114285714285714280 | 0.47727272727272730 | 0.34042553191489360 | 0.016787912702853944 | 0.247191011235955050 | 0.08794709948902915 | 0.583333333333333400 | 0.740625 | 0.6855670103092784 |       0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9989, "anomaly": 0.0011}} |
+-------------------+----------------------+---------------------+---------------------+----------------------+----------------------+---------------------+----------------------+----------+--------------------+---------+--------------------------------------------------------------------------------------------+
5 rows in set (0.0491 sec)

The following example generates a table of anomaly detection predictions for log data (MySQL 9.2.2 and later). It disables log summaries in the results.

mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.`log_anomaly_just_patterns`', @logad_model, 'mlcorpus.log_anomaly_test_out', 
                                  JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', FALSE)));
mysql> SELECT * FROM mlcorpus.log_anomaly_test_out LIMIT 1; 
+----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+
| id | parsed_log_segment                                                                                                       | ml_results                                                                            |
+----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+
|  1 | 2024-04-11T14:39:45.443597Z 1 [Note] [MY-013546] [InnoDB] Atomic write enabled                                           | {"index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],                                        |
|    | 2024-04-11T14:39:45.443618Z 1 [Note] [MY-012932] [InnoDB] PUNCH HOLE support available                                   | "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.55, "anomaly": 0.45}} |
|    | 2024-04-11T14:39:45.443631Z 1 [Note] [MY-012944] [InnoDB] Uses event mutexes                                             |                                                                                       |
|    | 2024-04-11T14:39:45.443635Z 1 [Note] [MY-012945] [InnoDB] GCC builtin __atomic_thread_fence() is used for memory barrier |                                                                                       |
|    | 2024-04-11T14:39:45.443646Z 1 [Note] [MY-012948] [InnoDB] Compressed tables use zlib 1.2.13                              |                                                                                       |
|    | 2024-04-11T14:40:25.128143Z 0 [Note] [MY-010264] [Server] - '127.0.0.1' resolves to '127.0.0.1';                         |                                                                                       |
|    | 2024-04-11T14:40:25.128182Z 0 [Note] [MY-010251] [Server] Server socket created on IP: '127.0.0.1'.                      |                                                                                       |
|    | 2024-04-11T14:40:25.128245Z 0 [Note] [MY-010252] [Server] Server hostname (bind-address): '10.0.1.125'; port: 3306       |                                                                                       |
|    | 2024-04-11T14:40:25.128272Z 0 [Note] [MY-010264] [Server] - '10.0.1.125' resolves to '10.0.1.125';                       |                                                                                       |
|    | 2024-04-26T13:01:30.287325Z 0 [Warning] [MY-015116] [Server] Background histogram update on nexus.fetches:               |                                                                                       |
|    | Lock wait timeout exceeded; try restarting transaction                                                                   |                                                                                       |
+----+------+------+----------+--------+----------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+

10.2.6 ML_PREDICT_TABLE