MySQL HeatWave User Guide
MySQL HeatWave AutoML includes data drift detection for the following models:
Classification
Regression
Anomaly detection (as of MySQL 9.3.2)
Review how to Create a Machine Learning Model.
Review use cases for Classification Data and Regression Analysis.
Machine learning typically makes an assumption that the training data and test data are similar. Over time, the similarity between the training data and the test data can decrease. This is known as data drift.
You can monitor data drift in the model catalog and when
running the ML_PREDICT_ROW
and
ML_PREDICT_TABLE
routines.
For the model catalog, the model_metadata
column includes the training_drift_metric
JSON object literal, which contains mean
and variance
numeric values. See
Model
Metadata.
mean
and variance
indicate the quality of the trained drift detector, and both
values should be low. The more important value is
mean
, and if it is greater than 1.0, then
drift evaluation for the test results might not be reliable.
For the ML_PREDICT_ROW
and
ML_PREDICT_TABLE
routines, the
options
parameter includes the
additional_details
boolean value. If this
option is enabled, the ml_results
column
includes the drift
JSON object literal,
which contains the metric
numeric value and
the attribution_percent
JSON object
literal.
metric
indicates the similarity between
training and test data. A low value indicates similar
values. A value grater than 1.0 indicates data drift, and
the prediction results are questionable.
attribution_percent
indicates the top
three features that contribute to data drift for each
result. The higher the percentage value, the greater the
contribution.
The workflow to analyze data drift includes the following:
Run ML_TRAIN
to train the
machine learning model with either the
classification
or
regression
task.
When training is complete, query the
model_metadata
column and review the
mean
and variance
values.
Run the ML_PREDICT_ROW
or
ML_PREDICT_TABLE
routines
on the trained model with the
additional_details
option set to
true
.
Review the drift
parameter in
ml_results
.
To analyze data drift in model metadata:
Train the model with
ML_TRAIN
.
mysql> CALL sys.ML_TRAIN('table_name
', 'target_column_name
', JSON_OBJECT('task', 'task_name
'), @variable
);
Replace table_name
,
target_column_name
,
task_name
, and
variable
with your own values.
For example:
mysql> CALL sys.ML_TRAIN('census_data.census_train', 'revenue', JSON_OBJECT('task', 'classification'), @census_model);
Where:
census_data.census_train
is the
fully qualified name of the table that contains the
training dataset
(schema_name.table_name
).
revenue
is the name of the target
column, which contains ground truth values.
JSON_OBJECT('task',
'classification')
specifies the machine
learning task type.
@census_model
is the name of the
user-defined session variable that stores the model
handle for the duration of the connection. User
variables are written as
@
.
Any valid name for a user-defined variable is
permitted. For example, var_name
@my_model
.
Learn more about
Model
Handles.
Query the model_metadata
column from
the model catalog. Optionally, use
JSON_PRETTY
to view the output in an
easily readable format.
mysql> SELECT JSON_PRETTY(model_metadata) FROM ML_SCHEMA_user1
.MODEL_CATALOG WHERE model_handle=model_handle
;
Replace user1
with your own
user name and model_handle
with
your own model handle. For example:
mysql> SELECT JSON_PRETTY(model_metadata) FROM ML_SCHEMA_user1.MODEL_CATALOG WHERE model_handle=@census_model;
+---------------------------------------------------------+
| JSON_PRETTY(model_metadata) |
+---------------------------------------------------------+
| {
"task": "classification",
"notes": null,
"chunks": 1,
"format": "HWMLv2.0",
"n_rows": 100,
"status": "Ready",
"options": {
"task": "classification",
"model_explainer": "permutation_importance",
"prediction_explainer": "permutation_importance"
},
"n_columns": 14,
"column_names": [
"age",
"workclass",
"fnlwgt",
"education",
"education-num",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capital-loss",
"hours-per-week",
"native-country"
],
"contamination": null,
"model_quality": "high",
"training_time": 73.90254211425781,
"algorithm_name": "RandomForestClassifier",
"training_score": -0.35963335633277893,
"build_timestamp": 1744377124,
"n_selected_rows": 80,
"training_params": {
"recommend": "ratings",
"force_use_X": false,
"recommend_k": 3,
"remove_seen": true,
"ranking_topk": 10,
"lsa_components": 100,
"ranking_threshold": 1,
"feedback_threshold": 1
},
"train_table_name": "census_data.census_train",
"model_explanation": {
"permutation_importance": {
"age": -0.0057,
"sex": 0.0002,
"race": 0.0001,
"fnlwgt": 0.0103,
"education": 0.0108,
"workclass": 0.0189,
"occupation": 0.0,
"capital-gain": 0.0304,
"capital-loss": 0.0,
"relationship": 0.0195,
"education-num": 0.0152,
"hours-per-week": 0.0235,
"marital-status": 0.0099,
"native-country": 0.0
}
},
"n_selected_columns": 11,
"target_column_name": "revenue",
"optimization_metric": "neg_log_loss",
"selected_column_names": [
"age",
"capital-gain",
"education",
"education-num",
"fnlwgt",
"hours-per-week",
"marital-status",
"race",
"relationship",
"sex",
"workclass"
],
"training_drift_metric": {
"mean": 0.3535,
"variance": 0.0597
}
} |
+---------------------------------------------------------+
1 row in set (0.0009 sec)
Where:
JSON_PRETTY
displays the
information in an easily readable format.
ML_SCHEMA_user1.MODEL_CATALOG
refers to the model catalog name. Replace
user1
with your own user name.
model_handle
refers to the session
variable for the trained model,
@census_model
. Learn more about
Model
Handles.
For training_drift_metric
, the output
generates a mean
value of 0.3535 and a
variance
value of 0.0597, which indicates
acceptable data drift.
To analyze data drift detection with a table of predictions:
If not done already, train the model to use. See Analyze Data Drift in Model Metadata.
Load the trained model. Update
@census_model
with your own session
variable for the trained model.
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
Run ML_PREDICT_TABLE
to
generate a table of predictions.
mysql> CALL sys.ML_PREDICT_TABLE(table_name
, model_handle
, output_table_name
), [options
]);
Replace table_name
,
model_handle
,
output_table_name
),and
options
with your own values.
For example:
mysql> CALL sys.ML_PREDICT_TABLE('census_data.`census_test`', @census_model, 'census_data.`census_test_predictions`', JSON_OBJECT('additional_details', true));
Where:
census_data.census_test
is the
fully qualified name of the test dataset table
(database_name.table_name
).
@census_model
is the session
variable that contains the model handle. See
Work with
Model Handles.
census_data.census_test_predictions
is the output table where predictions are stored.
JSON_OBJECT
includes the
additional_details
option set to
true
, so
ml_results
includes values for
metric
and
attribution_percent
.
Since a metric
value over 1.0 indicates
data drift, query rows in the output table that only have
a metric value over 1.0.
mysql> SELECT ml_results FROM table_name
WHERE JSON_EXTRACT(ml_results, '$.drift.metric') > 1.0;
Replace table_name
with your
own value. For example:
mysql> SELECT ml_results FROM census_test_predictions WHERE JSON_EXTRACT(ml_results, '$.drift.metric') > 1.0;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ml_results |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.67, ">50K": 0.33}, "drift": {"metric": 2.46, "attribution_percent": {"relationship": 22.07, "education-num": 19.92, "education": 12.36}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.9, ">50K": 0.1}, "drift": {"metric": 1.32, "attribution_percent": {"age": 31.25, "relationship": 17.36, "capital-gain": 17.03}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.99, ">50K": 0.01}, "drift": {"metric": 1.1, "attribution_percent": {"capital-gain": 30.94, "relationship": 17.8, "workclass": 17.49}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.78, ">50K": 0.22}, "drift": {"metric": 2.42, "attribution_percent": {"hours-per-week": 29.37, "age": 28.52, "capital-gain": 23.85}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.97, ">50K": 0.03}, "drift": {"metric": 1.09, "attribution_percent": {"education": 22.18, "relationship": 16.57, "capital-gain": 13.57}}} |
| {"predictions": {"revenue": ">50K"}, "probabilities": {"<=50K": 0.32, ">50K": 0.68}, "drift": {"metric": 3.18, "attribution_percent": {"relationship": 26.41, "education-num": 12.8, "capital-gain": 8.16}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.96, ">50K": 0.04}, "drift": {"metric": 1.11, "attribution_percent": {"marital-status": 23.34, "race": 16.02, "education": 12.83}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.89, ">50K": 0.11}, "drift": {"metric": 1.4, "attribution_percent": {"age": 27.26, "race": 18.98, "relationship": 15.49}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.91, ">50K": 0.09}, "drift": {"metric": 1.99, "attribution_percent": {"race": 23.89, "capital-gain": 21.38, "education": 16.22}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.78, ">50K": 0.22}, "drift": {"metric": 2.33, "attribution_percent": {"capital-gain": 31.64, "hours-per-week": 15.49, "education": 7.94}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.89, ">50K": 0.11}, "drift": {"metric": 1.38, "attribution_percent": {"sex": 23.55, "workclass": 23.55, "education-num": 15.71}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.62, ">50K": 0.38}, "drift": {"metric": 4.33, "attribution_percent": {"fnlwgt": 21.08, "relationship": 14.24, "workclass": 5.27}}} |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
12 rows in set (0.0014 sec)
The output displays the rows with high metric values (> 1.0), indicating data drift.
To anayze data drift detection with one or more rows of predictions:
If not done already, train the model to use. See Analyze Data Drift in Model Metadata.
Load the trained model. Update
@census_model
with your own session
variable for the trained model.
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
Run ML_PREDICT_ROW
to
generate predictions for a defined number of rows.
mysql> SELECT sys.ML_PREDICT_ROW(JSON_OBJECT("output_col_name
", schema
.`input_col_name
`, "output_col_name
", schema
.`input_col_name
`, ...), model_handle
, options
) FROM input_table_name
LIMIT N
;
The following example generates predictions for three rows of the table. The output is similar to the previous example.
mysql> SELECT sys.ML_PREDICT_ROW(JSON_OBJECT(
"age", census_test.`age`,
"workclass", census_test.`workclass`,
"fnlwgt", census_test.`fnlwgt`,
"education", census_test.`education`,
"education-num", census_test.`education-num`,
"marital-status", census_test.`marital-status`,
"occupation", census_test.`occupation`,
"relationship", census_test.`relationship`,
"race", census_test.`race`,
"sex", census_test.`sex`,
"capital-gain", census_test.`capital-gain`,
"capital-loss", census_test.`capital-loss`,
"hours-per-week", census_test.`hours-per-week`,
"native-country", census_test.`native-country`),
@census_model, JSON_OBJECT('additional_details', TRUE))FROM census_data.census_test LIMIT 3;
+--------------------------------------------------+
| sys.ML_PREDICT_ROW(JSON_OBJECT(
"age", census_test.`age`,
"workclass", census_test.`workclass`,
"fnlwgt", census_test.`fnlwgt`,
"education", census_test.`education`,
"education-num", census_test.`education-num`,
"ma |
+-------------------------------------------------+
|{ |
| "age": 37, |
| "sex": "Male", |
| "race": "White", |
| "fnlwgt": 99146, |
| "education": "Bachelors", |
| "workclass": "Private", |
| "Prediction": "<=50K", |
| "ml_results": { |
| "drift": { |
| "metric": 0, |
| "attribution_percent": { |
| "age": 0, |
| "fnlwgt": 46.67, |
| "capital-gain": 0}}, |
| "predictions": { |
| "revenue": "<=50K"}, |
| "probabilities": { |
| ">50K": 0.42, |
| "<=50K": 0.58}}, |
| "occupation": "Exec-managerial", |
| "capital-gain": 0, |
| "capital-loss": 1977, |
| "relationship": "Husband", |
| "education-num": 13, |
| "hours-per-week": 50, |
| "marital-status": "Married-civ-spouse", |
| "native-country": "United-States"} |
|{ |
| "age": 34, |
| "sex": "Male", |
| "race": "White", |
| "fnlwgt": 27409, |
| "education": "9th", |
| "workclass": "Private", |
| "Prediction": "<=50K", |
| "ml_results": { |
| "drift": { |
| "metric": 0.1, |
| "attribution_percent": { |
| "fnlwgt": 25, |
| "education": 33.31, |
| "workclass": 16.22}}, |
| "predictions": { |
| "revenue": "<=50K"}, |
| "probabilities": { |
| ">50K": 0.24, |
| "<=50K": 0.76}}, |
| "occupation": "Craft-repair", |
| "capital-gain": 0, |
| "capital-loss": 0, |
| "relationship": "Husband", |
| "education-num": 5, |
| "hours-per-week": 50, |
| "marital-status": "Married-civ-spouse", |
| "native-country": "United-States"} |
|{ |
| "age": 30, |
| "sex": "Female", |
| "race": "White", |
| "fnlwgt": 299507, |
| "education": "Assoc-acdm", |
| "workclass": "Private", |
| "Prediction": "<=50K", |
| "ml_results": { |
| "drift": { |
| "metric": 0.26, |
| "attribution_percent": { |
| "relationship": 21.36, |
| "education-num": 28.33, |
| "hours-per-week": 33.21}}, |
| "predictions": { |
| "revenue": "<=50K"}, |
| "probabilities": { |
| ">50K": 0.01, |
| "<=50K": 0.99}}, |
| "occupation": "Other-service", |
| "capital-gain": 0, |
| "capital-loss": 0, |
| "relationship": "Unmarried", |
| "education-num": 12, |
| "hours-per-week": 40, |
| "marital-status": "Separated", |
| "native-country": "United-States"} |
+-------------------------------------------------+
10 rows in set (6.8109 sec)
Where:
The first JSON_OBJECT
has output
column names and key-value pairs of the columns in the
trained table.
@census_model
is the session
variable that contains the model handle. Learn more
about Model
Handles.
The second JSON_OBJECT
includes the
additional_details
option set to
true
, so
ml_results
includes values for
metric
and
attribution_percent
.
census_data.census_test
is the
fully qualified name of the test dataset table
(database_name.table_name
).
The LIMIT
of 3 means that the
output includes a maximum of three rows from the
trained table.
The output allows you to review data drift values for the selected rows.
Review Machine Learning Use Cases to create machine learning models with sample datasets.