4.7.8 Analyzing Data Drift

Before You Begin

Review how to Create a Machine Learning Model.
Review use cases for Classification Data and Regression Analysis.

Data Drift Detection Overview

Machine learning typically makes an assumption that the training data and test data are similar. Over time, the similarity between the training data and the test data can decrease. This is known as data drift.

You can monitor data drift in the model catalog and when running the ML_PREDICT_ROW and ML_PREDICT_TABLE routines.

For the model catalog, the model_metadata column includes the training_drift_metric JSON object literal, which contains mean and variance numeric values. See Model Metadata.

mean and variance indicate the quality of the trained drift detector, and both values should be low. The more important value is mean, and if it is greater than 1.0, then drift evaluation for the test results might not be reliable.

For the ML_PREDICT_ROW and ML_PREDICT_TABLE routines, the options parameter includes the additional_details boolean value. If this option is enabled, the ml_results column includes the drift JSON object literal, which contains the metric numeric value and the attribution_percent JSON object literal.

metric indicates the similarity between training and test data. A low value indicates similar values. A value grater than 1.0 indicates data drift, and the prediction results are questionable.
attribution_percent indicates the top three features that contribute to data drift for each result. The higher the percentage value, the greater the contribution.

Workflow to Analyze Data Drift

The workflow to analyze data drift includes the following:

Run ML_TRAIN to train the machine learning model with either the classification or regression task.
When training is complete, query the model_metadata column and review the mean and variance values.
Run the ML_PREDICT_ROW or ML_PREDICT_TABLE routines on the trained model with the additional_details option set to true.
Review the drift parameter in ml_results.

Analyzing Data Drift in Model Metadata

To analyze data drift in model metadata:

Train the model with ML_TRAIN.
```
mysql> CALL sys.ML_TRAIN('table_name', 'target_column_name', JSON_OBJECT('task', 'task_name'), @variable);
```
Replace table_name, target_column_name, task_name, and variable with your own values. For example:
```
mysql> CALL sys.ML_TRAIN('census_data.census_train', 'revenue', JSON_OBJECT('task', 'classification'), @census_model);
```
Where:
- census_data.census_train is the fully qualified name of the table that contains the training dataset (schema_name.table_name).
- revenue is the name of the target column, which contains ground truth values.
- JSON_OBJECT('task', 'classification') specifies the machine learning task type.
- @census_model is the name of the user-defined session variable that stores the model handle for the duration of the connection. User variables are written as @var_name. Any valid name for a user-defined variable is permitted. For example, @my_model. Learn more about Model Handles.

Query the model_metadata column from the model catalog. Optionally, use JSON_PRETTY to view the output in an easily readable format.

mysql> SELECT JSON_PRETTY(model_metadata) FROM ML_SCHEMA_user1.MODEL_CATALOG WHERE model_handle=model_handle;

Replace user1 with your own user name and model_handle with your own model handle. For example:

mysql> SELECT JSON_PRETTY(model_metadata) FROM ML_SCHEMA_user1.MODEL_CATALOG WHERE model_handle=@census_model;
+---------------------------------------------------------+
| JSON_PRETTY(model_metadata)                             |
+---------------------------------------------------------+
| {
  "task": "classification",
  "notes": null,
  "chunks": 1,
  "format": "HWMLv2.0",
  "n_rows": 100,
  "status": "Ready",
  "options": {
    "task": "classification",
    "model_explainer": "permutation_importance",
    "prediction_explainer": "permutation_importance"
  },
  "n_columns": 14,
  "column_names": [
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country"
  ],
  "contamination": null,
  "model_quality": "high",
  "training_time": 73.90254211425781,
  "algorithm_name": "RandomForestClassifier",
  "training_score": -0.35963335633277893,
  "build_timestamp": 1744377124,
  "n_selected_rows": 80,
  "training_params": {
    "recommend": "ratings",
    "force_use_X": false,
    "recommend_k": 3,
    "remove_seen": true,
    "ranking_topk": 10,
    "lsa_components": 100,
    "ranking_threshold": 1,
    "feedback_threshold": 1
  },
  "train_table_name": "census_data.census_train",
  "model_explanation": {
    "permutation_importance": {
      "age": -0.0057,
      "sex": 0.0002,
      "race": 0.0001,
      "fnlwgt": 0.0103,
      "education": 0.0108,
      "workclass": 0.0189,
      "occupation": 0.0,
      "capital-gain": 0.0304,
      "capital-loss": 0.0,
      "relationship": 0.0195,
      "education-num": 0.0152,
      "hours-per-week": 0.0235,
      "marital-status": 0.0099,
      "native-country": 0.0
    }
  },
  "n_selected_columns": 11,
  "target_column_name": "revenue",
  "optimization_metric": "neg_log_loss",
  "selected_column_names": [
    "age",
    "capital-gain",
    "education",
    "education-num",
    "fnlwgt",
    "hours-per-week",
    "marital-status",
    "race",
    "relationship",
    "sex",
    "workclass"
  ],
  "training_drift_metric": {
    "mean": 0.3535,
    "variance": 0.0597
  }
} |
+---------------------------------------------------------+
1 row in set (0.0009 sec)

Where:

JSON_PRETTY displays the information in an easily readable format.
ML_SCHEMA_user1.MODEL_CATALOG refers to the model catalog name. Replace user1 with your own user name.
model_handle refers to the session variable for the trained model, @census_model. Learn more about Model Handles.

For training_drift_metric, the output generates a mean value of 0.3535 and a variance value of 0.0597, which indicates acceptable data drift.

Analyzing Data Drift Detection with ML_PREDICT_TABLE

To analyze data drift detection with a table of predictions:

If not done already, train the model to use. See Analyzing Data Drift in Model Metadata.
Load the trained model. Update @census_model with your own session variable for the trained model.
```
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
```
Run ML_PREDICT_TABLE to generate a table of predictions.
```
mysql> CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name), [options]);
```
Replace table_name, model_handle, output_table_name),and options with your own values. For example:
```
mysql> CALL sys.ML_PREDICT_TABLE('census_data.`census_test`', @census_model, 'census_data.`census_test_predictions`', 
                                  JSON_OBJECT('additional_details', true));
```
Where:
- census_data.census_test is the fully qualified name of the test dataset table (database_name.table_name).
- @census_model is the session variable that contains the model handle. See Work with Model Handles.
- census_data.census_test_predictions is the output table where predictions are stored.
- JSON_OBJECT includes the additional_details option set to true, so ml_results includes values for metric and attribution_percent.

Since a metric value over 1.0 indicates data drift, query rows in the output table that only have a metric value over 1.0.

mysql> SELECT ml_results FROM table_name WHERE JSON_EXTRACT(ml_results, '$.drift.metric') > 1.0;

Replace table_name with your own value. For example:

mysql> SELECT ml_results FROM census_test_predictions WHERE JSON_EXTRACT(ml_results, '$.drift.metric') > 1.0;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ml_results                                                                                                                                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.67, ">50K": 0.33}, "drift": {"metric": 2.46, "attribution_percent": {"relationship": 22.07, "education-num": 19.92, "education": 12.36}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.9, ">50K": 0.1}, "drift": {"metric": 1.32, "attribution_percent": {"age": 31.25, "relationship": 17.36, "capital-gain": 17.03}}}          |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.99, ">50K": 0.01}, "drift": {"metric": 1.1, "attribution_percent": {"capital-gain": 30.94, "relationship": 17.8, "workclass": 17.49}}}    |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.78, ">50K": 0.22}, "drift": {"metric": 2.42, "attribution_percent": {"hours-per-week": 29.37, "age": 28.52, "capital-gain": 23.85}}}      |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.97, ">50K": 0.03}, "drift": {"metric": 1.09, "attribution_percent": {"education": 22.18, "relationship": 16.57, "capital-gain": 13.57}}}  |
| {"predictions": {"revenue": ">50K"}, "probabilities": {"<=50K": 0.32, ">50K": 0.68}, "drift": {"metric": 3.18, "attribution_percent": {"relationship": 26.41, "education-num": 12.8, "capital-gain": 8.16}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.96, ">50K": 0.04}, "drift": {"metric": 1.11, "attribution_percent": {"marital-status": 23.34, "race": 16.02, "education": 12.83}}}        |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.89, ">50K": 0.11}, "drift": {"metric": 1.4, "attribution_percent": {"age": 27.26, "race": 18.98, "relationship": 15.49}}}                 |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.91, ">50K": 0.09}, "drift": {"metric": 1.99, "attribution_percent": {"race": 23.89, "capital-gain": 21.38, "education": 16.22}}}          |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.78, ">50K": 0.22}, "drift": {"metric": 2.33, "attribution_percent": {"capital-gain": 31.64, "hours-per-week": 15.49, "education": 7.94}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.89, ">50K": 0.11}, "drift": {"metric": 1.38, "attribution_percent": {"sex": 23.55, "workclass": 23.55, "education-num": 15.71}}}          |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.62, ">50K": 0.38}, "drift": {"metric": 4.33, "attribution_percent": {"fnlwgt": 21.08, "relationship": 14.24, "workclass": 5.27}}}         |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
12 rows in set (0.0014 sec)

The output displays the rows with high metric values (> 1.0), indicating data drift.

Analyzing Data Drift Detection with ML_PREDICT_ROW

To analyze data drift detection with one or more rows of predictions:

If not done already, train the model to use. See Analyzing Data Drift in Model Metadata.
Load the trained model. Update @census_model with your own session variable for the trained model.
```
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
```

Run ML_PREDICT_ROW to generate predictions for a defined number of rows.

mysql> SELECT sys.ML_PREDICT_ROW(JSON_OBJECT("output_col_name", schema.`input_col_name`, 
                                              "output_col_name", schema.`input_col_name`, ...), 
                                              model_handle, options) FROM input_table_name LIMIT N;

The following example generates predictions for three rows of the table. The output is similar to the previous example.

mysql> SELECT sys.ML_PREDICT_ROW(JSON_OBJECT(
	"age", census_test.`age`,
	"workclass", census_test.`workclass`,
	"fnlwgt", census_test.`fnlwgt`,
	"education", census_test.`education`,
	"education-num", census_test.`education-num`,
	"marital-status", census_test.`marital-status`,
	"occupation", census_test.`occupation`,
	"relationship", census_test.`relationship`,
	"race", census_test.`race`,
	"sex", census_test.`sex`,
	"capital-gain", census_test.`capital-gain`,
	"capital-loss", census_test.`capital-loss`,
	"hours-per-week", census_test.`hours-per-week`,
	"native-country", census_test.`native-country`),
	@census_model, JSON_OBJECT('additional_details', TRUE))FROM census_data.census_test LIMIT 3;
+--------------------------------------------------+
| sys.ML_PREDICT_ROW(JSON_OBJECT(
"age", census_test.`age`,
"workclass", census_test.`workclass`,
"fnlwgt", census_test.`fnlwgt`,
"education", census_test.`education`,
"education-num", census_test.`education-num`,
"ma                                               |
+-------------------------------------------------+
|{                                                |
|    "age": 37,                                   |
|    "sex": "Male",                               |
|    "race": "White",                             |
|    "fnlwgt": 99146,                             |
|    "education": "Bachelors",                    |
|    "workclass": "Private",                      |
|    "Prediction": "<=50K",                       |
|    "ml_results": {                              |
|        "drift": {                               |
|            "metric": 0,                         |
|            "attribution_percent": {             |
|                "age": 0,                        |
|                "fnlwgt": 46.67,                 |
|                "capital-gain": 0}},             |
|        "predictions": {                         |
|            "revenue": "<=50K"},                 |
|        "probabilities": {                       |
|            ">50K": 0.42,                        |
|            "<=50K": 0.58}},                     |
|    "occupation": "Exec-managerial",             |
|    "capital-gain": 0,                           |
|    "capital-loss": 1977,                        |
|    "relationship": "Husband",                   |
|    "education-num": 13,                         |
|    "hours-per-week": 50,                        |
|    "marital-status": "Married-civ-spouse",      |
|    "native-country": "United-States"}           |
|{                                                |
|    "age": 34,                                   |
|    "sex": "Male",                               |
|    "race": "White",                             |
|    "fnlwgt": 27409,                             |
|    "education": "9th",                          |
|    "workclass": "Private",                      |
|    "Prediction": "<=50K",                       |
|    "ml_results": {                              |
|        "drift": {                               |
|            "metric": 0.1,                       |
|            "attribution_percent": {             |
|                "fnlwgt": 25,                    |
|                "education": 33.31,              |
|                "workclass": 16.22}},            |
|        "predictions": {                         |
|            "revenue": "<=50K"},                 |
|        "probabilities": {                       |
|            ">50K": 0.24,                        |
|            "<=50K": 0.76}},                     |
|    "occupation": "Craft-repair",                |
|    "capital-gain": 0,                           |
|    "capital-loss": 0,                           |
|    "relationship": "Husband",                   |
|    "education-num": 5,                          |
|    "hours-per-week": 50,                        |
|    "marital-status": "Married-civ-spouse",      |
|    "native-country": "United-States"}           |
|{                                                |
|    "age": 30,                                   |
|    "sex": "Female",                             |
|    "race": "White",                             |
|    "fnlwgt": 299507,                            |
|    "education": "Assoc-acdm",                   |
|    "workclass": "Private",                      |
|    "Prediction": "<=50K",                       |
|    "ml_results": {                              |
|        "drift": {                               |
|            "metric": 0.26,                      |
|            "attribution_percent": {             |
|                "relationship": 21.36,           |
|                "education-num": 28.33,          |
|                "hours-per-week": 33.21}},       |
|        "predictions": {                         |
|            "revenue": "<=50K"},                 |
|        "probabilities": {                       |
|            ">50K": 0.01,                        |
|            "<=50K": 0.99}},                     |
|    "occupation": "Other-service",               |
|    "capital-gain": 0,                           |
|    "capital-loss": 0,                           |
|    "relationship": "Unmarried",                 |
|    "education-num": 12,                         |
|    "hours-per-week": 40,                        |
|    "marital-status": "Separated",               |
|    "native-country": "United-States"}           |
+-------------------------------------------------+
10 rows in set (6.8109 sec)

Where:

The first JSON_OBJECT has output column names and key-value pairs of the columns in the trained table.
@census_model is the session variable that contains the model handle. Learn more about Model Handles.
The second JSON_OBJECT includes the additional_details option set to true, so ml_results includes values for metric and attribution_percent.
census_data.census_test is the fully qualified name of the test dataset table (database_name.table_name).
The LIMIT of 3 means that the output includes a maximum of three rows from the trained table.

The output allows you to review data drift values for the selected rows.

What's Next

Review Machine Learning Use Cases to create machine learning models with sample datasets.