MySQL HeatWave User Guide
After preparing the data for an anomaly detection model, you can train the model.
This topic has the following sections.
Review and complete all the tasks to Prepare Data for an Anomaly Detection Model.
Consider the following based on the type of anomaly detection you are running:
Set the task
parameter to
anomaly_detection
for running anomaly
detection on table data, or
log_anomaly_detection
for running
anomaly detection on log data (MySQL 9.2.2 and later).
If running an unsupervised model, the
target_column_name
parameter must be
set to NULL
.
If running a semi-supervised model:
The target_column_name
parameter
must specify a column whose only allowed values are
0 (normal), 1 (anomalous), and NULL (unlabeled). All
rows are used to train the unsupervised component,
while the rows with a value different than NULL are
used to train the supervised component.
The experimental
option must be
set to semisupervised
.
If running anomaly detection on log data (MySQL 9.2.2 and later), the input table can only have the following columns:
The column containing the logs.
If including logs from different sources, a column
containing the source of each log. Identify this
column with the log_source_column
option.
If including labeled data, a column identifying the labeled log lines. See Semi-supervised Anomaly Detection to learn more.
At least one column must act as the primary key to
establish the temporal order of logs. If the primary
key column (or columns) is not one of the previous
required columns (log data, source of log, or
label), then you must use the
exclude_column_list
option when
running ML_TRAIN
to
exclude all primary key columns that don't include
required data. See
Syntax Examples for Anomaly Detection Training
to review relevant examples.
If the input table has additional columns to the
ones permitted, you must use the
exclude_column_list
option to
exclude irrelevant columns.
Use the following JSON
options
:
contamination
: Represents an estimate
of the percentage of outliers in the training table.
The contamination factor is calculated as: estimated number of rows with anomalies/total number of rows in the training table.
The contamination value must be greater than 0 and less than 0.5. The default value is 0.01.
As of MySQL 8.4.0, model_list
is
supported to allow the selection of the Principal
Component Analysis (PCA) model and Generalized Local
Outlier Factor (GLOF) model. If no option is specified,
the default model is Generalized kth Nearest Neighbors
(GkNN). Selecting more than one model or an unsupported
model produces an error.
MySQL 9.0.1-u1 introduces the following options to train a semi-supervised anomaly detection model:
supervised_submodel_options
: Allows
you to set optional override parameters for the
supervised model component. The only model supported is
DistanceWeightedKNNClassifier
. The
following parameters are supported:
n_neighbors
: Sets the desired k
value that checks the k closest neighbors for each
unclassified point. The default value is 5 and the
value must be an integer greater than 0.
min_labels
: Sets the minimum
number of labeled data points required to train the
supervised component. If fewer labeled data points
are provided during training of the model,
ML_TRAIN
fails. The
default value is 20 and the value must be an integer
greater than 0.
ensemble_score
: This option specifies
the metric to use to score the ensemble of unsupervised
and supervised components. It identifies the optimal
weight between the two components based on the metric.
The supported metrics are accuracy
,
precision
, recall
,
and f1
. The default metric is
f1
.
MySQL 9.2.2 introduces anomaly detection for log data. The
following options are available as a separate
JSON_OBJECT
named
logad_options
:
additional_masking_regex
: Allows you
to mask log data by using regular expression in a
JSON_ARRAY
. By default, the following
parameters are automatically masked during training and
when generating anomaly scores.
IP
DATETIME
TIME
HEX
IPPORT
OCID
window_size
: Specifies the maximum
number of log lines to be grouped for anomaly detection.
The default value is 10.
window_stride
: Specifies the stride
value to use for segmenting log lines. For example,
there is log A, B, C, D, and E. The
window_size
is 3, and the
window_stride
is 2. The first row has
log A, B, and C. The second row has log C, D, and E. If
this value is equal to window_size
,
there is no overlapping of log segments. The default
value is 3.
log_source_column
: Specifies the
column name that contains the source identifier of the
respective log lines. Log lines are grouped according to
their respective source (for example, logs from multiple
MySQL databases that are in the same table). By default,
all log lines are assumed to be from the same source.
The following options are not supported for anomaly detection:
exclude_model_list
optimization_metric
Before MySQL 8.4.0, model_list
is not
supported because the only supported algorithm model is
Generalized kth Nearest Neighbors (GkNN). As of MySQL
8.4.0, model_list
is supported to
allow the selection of the Principal Component Analysis
(PCA) model and Generalized Local Outlier Factor (GLOF)
model.
You cannot run the following routines for a trained anomaly detection model:
ML_PREDICT_ROW
(only for
anomaly detection for logs)
Train the model with the
ML_TRAIN
routine and use the
credit_card_train
table previously
created. Before training the model, it is good practice to
define the model handle instead of automatically creating
one. See
Define Model Handle.
Optionally, set the value of the session variable, which sets the model handle to this same value.
mysql> SET @variable
= 'model_handle
';
Replace @variable
and
model_handle
with your own
definitions. For example:
mysql> SET @semi_supervised_model='anomaly_detection_semi_supervised_use_case';
The model handle is set to
anomaly_detection_semi_supervised_use_case
.
Run the ML_TRAIN
routine.
mysql> CALL sys.ML_TRAIN('table_name
', 'target_column_name
', JSON_OBJECT('task', 'task_name
'), model_handle
);
Replace table_name
,
target_column_name
,
task_name
, and
model_handle
with your own
values.
The following example runs
ML_TRAIN
on the training
dataset previously created.
mysql> CALL sys.ML_TRAIN('anomaly_data.credit_card_train', "target", CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {}}}' as JSON), @semi_supervised_model);
Where:
anomaly_data.credit_card_train
is
the fully qualified name of the table that contains
the training dataset
(database_name.table_name
).
target
is the name of the target
column, which contains ground truth values to use
for semi-supervised learning.
CAST('{"task": "anomaly_detection",
"experimental": {"semisupervised": {}}}' as
JSON)
specifies the machine learning task
type. The experimental
parameter
is required to use a semi-supervised learning model.
All default values are used for semi-supervised
learning.
@semi_supervised_model
is the
session variable previously set that defines the
model handle to the name defined by the user:
anomaly_detection_semi_supervised_use_case
.
If you do not define the model handle before
training the model, the model handle is
automatically generated, and the session variable
only stores the model handle for the duration of the
connection. User variables are written as
@
.
Any valid name for a user-defined variable is
permitted. See
Work with
Model Handles to learn more.
var_name
When the training operation finishes, the model handle
is assigned to the
@semi_supervised_model
session
variable, and the model is stored in the model catalog.
View the entry in the model catalog with the following
query. Replace user1
with
your MySQL account name.
mysql> SELECT model_id, model_handle, train_table_name FROM ML_SCHEMA_user1
.MODEL_CATALOG WHERE model_handle = 'anomaly_detection_semi_supervised_use_case';
+----------+----------------------------------------------+-------------------------------------+
| model_id | model_handle | train_table_name |
+----------+----------------------------------------------+-------------------------------------+
| 3 | anomaly_detection_semi_supervised_use_case | anomaly_data.credit_card_train |
+----------+----------------------------------------------+-------------------------------------+
Train the model with the
ML_TRAIN
routine and use the
training_data
table previously created.
Before training the model, it is good practice to define the
model handle instead of automatically creating one. See
Define Model Handle.
Optionally, set the value of the session variable, which sets the model handle to this same value.
mysql> SET @variable
= 'model_handle
';
Replace @variable
and
model_handle
with your own
definitions. For example:
mysql> SET @unsupervised_log_model='anomaly_detection_log_use_case';
The model handle is set to
anomaly_detection_log_use_case
.
Run the ML_TRAIN
routine.
mysql> CALL sys.ML_TRAIN('table_name
', 'target_column_name
', JSON_OBJECT('task', 'task_name
'), model_handle
);
Replace table_name
,
target_column_name
,
task_name
, and
model_handle
with your own
values.
The following example runs
ML_TRAIN
on the training
dataset previously created.
mysql> CALL sys.ML_TRAIN('anomaly_log_data.training_data', NULL, JSON_OBJECT('task', 'log_anomaly_detection', 'exclude_column_list',
JSON_ARRAY('log_id', 'timestamp', 'target')), @unsupervised_log_model);
Where:
anomaly_log_data.training_data
is
the fully qualified name of the table that contains
the training dataset
(database_name.table_name
).
NULL
is set for the target column
because it is an unsupervised learning model, so no
labeled data is used to train the model.
JSON_OBJECT('task',
'log_anomaly_detection'
specifies the
machine learning task type.
'exclude_column_list', JSON_ARRAY('log_id',
'timestamp', 'target')
sets the required
options to run the model for anomaly detection on
logs. The columns log_id
and
timestamp
are excluded because
they are not any of the required columns for
training. See
Requirements for Anomaly Detection Training
to learn more. The target
column
is excluded because it is an unsupervised learning
model.
@unsupervised_log_model
is the
session variable previously set that defines the
model handle to the name defined by the user:
anomaly_detection_log_use_case
.
If you do not define the model handle before
training the model, the model handle is
automatically generated, and the session variable
only stores the model handle for the duration of the
connection. User variables are written as
@
.
Any valid name for a user-defined variable is
permitted. See
Work with
Model Handles to learn more.
var_name
When the training operation finishes, the model handle
is assigned to the
@unsupervised_log_model
session
variable, and the model is stored in the model catalog.
View the entry in the model catalog with the following
query. Replace user1
with
your MySQL account name.
mysql> SELECT model_id, model_handle, train_table_name FROM ML_SCHEMA_user1
.MODEL_CATALOG WHERE model_handle = 'anomaly_detection_log_use_case';
+----------+----------------------------------------------+-------------------------------------+
| model_id | model_handle | train_table_name |
+----------+----------------------------------------------+-------------------------------------+
| 4 | anomaly_detection_log_use_case | anomaly_log_data.training_data |
+----------+----------------------------------------------+-------------------------------------+
Learn how to Generate Predictions for an Anomaly Detection Model
Review additional Syntax Examples for Anomaly Detection Training