MySQL HeatWave User Guide
After training the model, you can generate predictions.
To generate predictions, use the sample data from the two
anomaly detection datasets:
credit_card_train
and
training_data
. Both datasets have labeled
and unlabeled rows, but only the dataset for semi-supervised
learning uses this for training. The other dataset for log
data is trained using unsupervised learning. Having labels for
both datasets allows you to compare the predictions to the
actual values and determine if the predictions are reliable.
Once you determine the trained model is reliable for
generating predictions, you can start using it on unseen data.
Anomaly detection models produce anomaly scores, which
represent the degree to which a data point deviates from the
expected normal behavior. Higher scores indicate a greater
degree of abnormality, potentially signaling an anomaly that
warrants further investigation. In the results,
is_anomaly
generates a value of 1 for an
anomaly, or 0 for normal. The normal
value
represents the degree to which the row of data or log segment
exhibits normal behavior. The anomaly
value
represents the degree to which the row of data or log segment
exhibits anomalous behavior.
To detect anomalies, run the ML_PREDICT
routines on data with the same columns as the training model.
To detect anomalies in row data, you can run the
ML_PREDICT_ROW
or
ML_PREDICT_TABLE
routines.
To detect anomalies in log data (MySQL 9.2.2 and later),
you can only run the
ML_PREDICT_TABLE
routine.
This topic has the following sections.
Complete the following tasks:
The threshold you set on anomaly detection models determines which rows in the output table are labeled as anomalies. The value for the threshold is the degree to which a row of data or log segment is considered for anomaly detection. Any sample with an anomaly score above the threshold is classified an anomaly.
There are two ways to set threshold values for anomaly detection models.
Set the Contamination Value
You can set the contamination
option for
the ML_TRAIN
routine. This
option uses the following calculation to set the threshold:
(1 - contamination
)-th percentile of all
the anomaly scores.
The default contamination
value is 0.01.
The default threshold
value based on the
default contamination
value is the
0.99-th percentile of all the anomaly scores.
Set the Threshold Value
You can set the threshold
option for the
ML_PREDICT_TABLE
,
ML_PREDICT_ROW
, and
ML_SCORE
routines. The value
must be greater than 0 and less than 1.
If no value is set for the threshold
option, the value set for the
contamination
option in the
ML_TRAIN
routine determines
the threshold.
The following additional options are available:
An alternative to threshold
is
topk
. The results include the top K
rows with the highest anomaly scores. The
ML_PREDICT_TABLE
and
ML_SCORE
routines include
the topk
option, which is an integer
between 1 and the table length.
ML_SCORE
includes an
options parameter in JSON
format. The
options are threshold
and
topk
.
When running a semi-supervised model, the
ML_PREDICT_ROW
,
ML_PREDICT_TABLE
, and
ML_SCORE
routines have
the supervised_submodel_weight
option. It allows you to override the
ensemble_score
weighting estimated
during ML_TRAIN
with a
new value. The value must be greater than 0 and less
than 1.
When you run anomaly detection on log data (MySQL 9.2.2 and later), you have the option to leverage MySQL HeatWave GenAI for textual summaries of the results. To create summaries, use the following options:
summarize_logs
: Enable summaries by
setting this to TRUE
. If enabled,
summaries are generated for log segments that are
labeled as an anomaly or exceed the value set for the
summary_threshold
.
summary_threshold
: Determines the
rows in the output table that are summarized. This does
not affect how the contamination
and
threshold
options determine
anomalies. You can set a value greater than 0 and less
than 1. The default value is NULL
.
Summaries are generated for the following:
All rows labeled as anomalies.
If a value is set for
summary_threshold
, any
non-anomaly rows that exceed the value of the
summary_threshold
.
If the default NULL
value is used for
summary_threshold
, then only rows
labeled as anomalies are summarized.
Enabling the summary_threshold
option
and setting a very low threshold value can potentially
lead to a high number of summaries being generated, which
may substantially increase the time required to generate
output tables.
If not already done, load the model. You can use the
session variable for the model that is valid for the
duration of the connection. Alternatively, you can use
the model handle previously set. For the option to set
the user name, you can set it to
NULL
.
The following example uses the session variable.
mysql> CALL sys.ML_MODEL_LOAD(@semi_supervised_model, NULL);
The following example uses the model handle.
mysql> CALL sys.ML_MODEL_LOAD('anomaly_detection_semi_supervised_use_case', NULL);
Make predictions for the test dataset by using the
ML_PREDICT_TABLE
routine.
mysql> CALL sys.ML_PREDICT_TABLE(table_name
, model_handle
, output_table_name
), [options
]);
Replace table_name
,
model_handle
, and
output_table_name
with your
own values. Add options
as
needed.
The following example runs
ML_PREDICT_TABLE
on the testing dataset previously created.
mysql> CALL sys.ML_PREDICT_TABLE('anomaly_data.credit_card_train', @semi_supervised_model, 'anomaly_data.credit_card_predictions_semi', JSON_OBJECT('threshold', 0.55));
Where:
anomaly_data.credit_card_train
is
the fully qualified name of the input table that
contains the data to generate predictions for
(database_name.table_name
).
@model
is the session variable
for the model handle.
anomaly_data.credit_card_predictions_semi
is the fully qualified name of the output table with
predictions
(database_name.table_name
).
JSON_OBJECT('threshold', 0.55)
sets a threshold value of 55%, which means any row
that generates an anomaly score of over 55% is
labeled as an anomaly.
Query the target
and
ml_results
columns from the output
table. This allows you to compare the real value with
the generated anomaly prediction. Review
is_anomaly
to see if the row is
labeled as an anomaly (1) or normal (0). Review the
anomaly score for each prediction next to
normal
and
anomaly
. If needed, you can also
query all the columns from the table (SELECT *
FROM credit_card_predictions_semi
) to review
all the data at once.
mysql> SELECT target, ml_results FROM credit_card_predictions_semi;
+--------+--------------------------------------------------------------------------------------------+
| target | ml_results |
+--------+--------------------------------------------------------------------------------------------+
| 0 | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.43, "anomaly": 0.57}} |
| 1 | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.4377, "anomaly": 0.5623}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8677, "anomaly": 0.1323}} |
| NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8652, "anomaly": 0.1348}} |
| 1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.4921, "anomaly": 0.5079}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8487, "anomaly": 0.1513}} |
| NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.7622, "anomaly": 0.2378}} |
| 1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.57, "anomaly": 0.43}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8317, "anomaly": 0.1683}} |
| NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8539, "anomaly": 0.1461}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9264, "anomaly": 0.0736}} |
| 1 | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.4079, "anomaly": 0.5921}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8379, "anomaly": 0.1621}} |
| NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.7971, "anomaly": 0.2029}} |
| 1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.4623, "anomaly": 0.5377}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8816, "anomaly": 0.1184}} |
| NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8267, "anomaly": 0.1733}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8816, "anomaly": 0.1184}} |
| 1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.4661, "anomaly": 0.5339}} |
| NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8202, "anomaly": 0.1798}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9113, "anomaly": 0.0887}} |
| 1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.5078, "anomaly": 0.4922}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9378, "anomaly": 0.0622}} |
| NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8963, "anomaly": 0.1037}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.5262, "anomaly": 0.4738}} |
| 1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.5002, "anomaly": 0.4998}} |
| NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8767, "anomaly": 0.1233}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8878, "anomaly": 0.1122}} |
| 1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.4661, "anomaly": 0.5339}} |
| 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9037, "anomaly": 0.0963}} |
| NULL | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.4171, "anomaly": 0.5829}} |
+--------+--------------------------------------------------------------------------------------------+
31 rows in set (0.0479 sec)
To learn more about generating predictions for one or more rows of data, see Generate Predictions for a Row of Data.
If not already done, load the model. You can use the
session variable for the model that is valid for the
duration of the connection. Alternatively, you can use
the model handle previously set. For the option to set
the user name, you can set it to
NULL
.
The following example uses the session variable.
mysql> CALL sys.ML_MODEL_LOAD(@unsupervised_log_model, NULL);
The following example uses the model handle.
mysql> CALL sys.ML_MODEL_LOAD('anomaly_detection_log_use_case', NULL);
Make predictions for the test dataset by using the
ML_PREDICT_TABLE
routine.
mysql> CALL sys.ML_PREDICT_TABLE(table_name
, model_handle
, output_table_name
), [options
]);
Replace table_name
,
model_handle
, and
output_table_name
with your
own values. Add options
as
needed.
The following example runs
ML_PREDICT_TABLE
on the testing dataset previously created.
mysql> CALL sys.ML_PREDICT_TABLE('anomaly_log_data.testing_data', @unsupervised_log_model, 'anomaly_log_data.log_predictions_unsupervised',
JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', TRUE)));
Where:
anomaly_log_data.testing_data
is
the fully qualified name of the input table that
contains the data to generate predictions for
(database_name.table_name
).
@model
is the session variable
for the model handle.
anomaly_log_data.log_predictions_unsupervised
is the fully qualified name of the output table with
predictions
(database_name.table_name
).
JSON_OBJECT('logad_options',
JSON_OBJECT('summarize_logs', TRUE))
enables the textual summaries generated by
MySQL HeatWave GenAI. No threshold is set for the summaries, so
the default value of any labeled anomaly generates a
summary.
Query the output table and compare the real value with
the generated anomaly prediction. Use
\G
to view the output in an easily
readable format.
mysql> SELECT * FROM log_predictions_unsupervised\G
*************************** 1. row ***************************
id: 1
parsed_log_segment: User login failed: Invalid credentials Server response time increased Normal database query Unusual network traffic from IP: 10.0.0.5 System update completed successfully
Error log: Stack trace included User activity: Admin accessed settings Unlabeled log: Further investigation needed Security alert: Potential malware detected System shutdown initiated
ml_results: {"summary": "\nHere is a concise summary of the text:\n\nThe system encountered several issues, including an invalid login attempt, increased server response time, and unusual network traffic.
A potential malware detection was also triggered, prompting a security alert. The system has been shut down for further investigation.",
"index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.0, "anomaly": 1.0}}
*************************** 2. row ***************************
id: 2
parsed_log_segment: Unusual network traffic from IP: 10.0.0.5 System update completed successfully Error log: Stack trace included User activity: Admin accessed settings Unlabeled log: Further investigation needed Security alert: Potential malware detected System shutdown initiated
ml_results: {"summary": "\nHere is a concise summary:\n\nA system update was completed successfully, but an error log indicates potential malware detection and requires further investigation.",
"index_map": [4, 5, 6, 7, 8, 9, 10], "predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.0, "anomaly": 1.0}}
*************************** 3. row ***************************
id: 3
parsed_log_segment: User activity: Admin accessed settings Unlabeled log: Further investigation needed Security alert: Potential malware detected System shutdown initiated
ml_results: {"summary": "\nAn administrator has accessed the system settings, triggered a security alert for potential malware detection, and initiated a system shutdown.",
"index_map": [7, 8, 9, 10], "predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.0, "anomaly": 1.0}}
*************************** 4. row ***************************
id: 4
parsed_log_segment: System shutdown initiated
ml_results: {"summary": "\nThe system is shutting down.", "index_map": [10], "predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.0, "anomaly": 1.0}}
The size of the output table is based on the
window_size
and
window_stride
parameters when the
model is trained. Since this use case does not set these
parameters, the default values of 10 for
window_size
and 3 for
window_stride
is used. See
Log Anomaly Detection Options
to learn more.
Review the following in the output table:
is_anomaly
to see if the row is
labeled as an anomaly (1) or normal (0).
normal
and
anomaly
to see the anomaly score
for each.
index_map
to see which rows in
the input table are included in the prediction based
on the window_size
and
window_stride
.
summary
to see the generated text
summary describing the anomaly.
Learn how to Score an Anomaly Detection Model.