MySQL AI User Guide
MySQL AI includes data drift detection for classification and regression models.
Review how to Create a Machine Learning Model.
Review use cases for Classification Data and Regression Analysis.
Machine learning typically makes an assumption that the training data and test data are similar. Over time, the similarity between the training data and the test data can decrease. This is known as data drift.
          You can monitor data drift in the model catalog and when
          running the
          ML_PREDICT_ROW
          and
          ML_PREDICT_TABLE
          routines.
        
          For the model catalog, the model_metadata
          column includes the training_drift_metric
          JSON object literal, which contains mean
          and variance numeric values. See
          Model
          Metadata.
        
          mean and variance
          indicate the quality of the trained drift detector, and both
          values should be low. The more important value is
          mean, and if it is greater than 1.0, then
          drift evaluation for the test results might not be reliable.
        
          For the
          ML_PREDICT_ROW
          and
          ML_PREDICT_TABLE
          routines, the options parameter
          includes the additional_details boolean
          value. If this option is enabled, the
          ml_results column includes the
          drift JSON object literal, which contains
          the metric numeric value and the
          attribution_percent JSON object literal.
        
              metric indicates the similarity between
              training and test data. A low value indicates similar
              values. A value grater than 1.0 indicates data drift, and
              the prediction results are questionable.
            
              attribution_percent indicates the top
              three features that contribute to data drift for each
              result. The higher the percentage value, the greater the
              contribution.
            
The workflow to analyze data drift includes the following:
              Run
              ML_TRAIN
              to train the machine learning model with either the
              classification or
              regression task.
            
              When training is complete, query the
              model_metadata column and review the
              mean and variance
              values.
            
              Run the
              ML_PREDICT_ROW
              or
              ML_PREDICT_TABLE
              routines on the trained model with the
              additional_details option set to
              true.
            
              Review the drift parameter in
              ml_results.
            
To analyze data drift in model metadata:
              Train the model with
              ML_TRAIN.
            
mysql> CALL sys.ML_TRAIN('table_name', 'target_column_name', JSON_OBJECT('task', 'task_name'), @variable);
              Replace table_name,
              target_column_name,
              task_name, and
              variable with your own values.
              For example:
            
mysql> CALL sys.ML_TRAIN('census_data.census_train', 'revenue', JSON_OBJECT('task', 'classification'), @census_model);
Where:
                  census_data.census_train is the
                  fully qualified name of the table that contains the
                  training dataset
                  (schema_name.table_name).
                
                  revenue is the name of the target
                  column, which contains ground truth values.
                
                  JSON_OBJECT('task',
                  'classification') specifies the machine
                  learning task type.
                
                  @census_model is the name of the
                  user-defined session variable that stores the model
                  handle for the duration of the connection. User
                  variables are written as
                  @.
                  Any valid name for a user-defined variable is
                  permitted. For example, var_name@my_model.
                  Learn more about
                  Model
                  Handles.
                
              Query the model_metadata column from
              the model catalog. Optionally, use
              JSON_PRETTY to view the output in an
              easily readable format.
            
mysql> SELECT JSON_PRETTY(model_metadata) FROM ML_SCHEMA_user1.MODEL_CATALOG WHERE model_handle=model_handle;
              Replace user1 with your own
              user name and model_handle with
              your own model handle. For example:
            
mysql> SELECT JSON_PRETTY(model_metadata) FROM ML_SCHEMA_user1.MODEL_CATALOG WHERE model_handle=@census_model;
+---------------------------------------------------------+
| JSON_PRETTY(model_metadata)                             |
+---------------------------------------------------------+
| {
  "task": "classification",
  "notes": null,
  "chunks": 1,
  "format": "HWMLv2.0",
  "n_rows": 100,
  "status": "Ready",
  "options": {
    "task": "classification",
    "model_explainer": "permutation_importance",
    "prediction_explainer": "permutation_importance"
  },
  "n_columns": 14,
  "column_names": [
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country"
  ],
  "contamination": null,
  "model_quality": "high",
  "training_time": 73.90254211425781,
  "algorithm_name": "RandomForestClassifier",
  "training_score": -0.35963335633277893,
  "build_timestamp": 1744377124,
  "n_selected_rows": 80,
  "training_params": {
    "recommend": "ratings",
    "force_use_X": false,
    "recommend_k": 3,
    "remove_seen": true,
    "ranking_topk": 10,
    "lsa_components": 100,
    "ranking_threshold": 1,
    "feedback_threshold": 1
  },
  "train_table_name": "census_data.census_train",
  "model_explanation": {
    "permutation_importance": {
      "age": -0.0057,
      "sex": 0.0002,
      "race": 0.0001,
      "fnlwgt": 0.0103,
      "education": 0.0108,
      "workclass": 0.0189,
      "occupation": 0.0,
      "capital-gain": 0.0304,
      "capital-loss": 0.0,
      "relationship": 0.0195,
      "education-num": 0.0152,
      "hours-per-week": 0.0235,
      "marital-status": 0.0099,
      "native-country": 0.0
    }
  },
  "n_selected_columns": 11,
  "target_column_name": "revenue",
  "optimization_metric": "neg_log_loss",
  "selected_column_names": [
    "age",
    "capital-gain",
    "education",
    "education-num",
    "fnlwgt",
    "hours-per-week",
    "marital-status",
    "race",
    "relationship",
    "sex",
    "workclass"
  ],
  "training_drift_metric": {
    "mean": 0.3535,
    "variance": 0.0597
  }
} |
+---------------------------------------------------------+
1 row in set (0.0009 sec)
Where:
                  JSON_PRETTY displays the
                  information in an easily readable format.
                
                  ML_SCHEMA_user1.MODEL_CATALOG
                  refers to the model catalog name. Replace
                  user1 with your own user name.
                
                  model_handle refers to the session
                  variable for the trained model,
                  @census_model. Learn more about
                  Model
                  Handles.
                
          For training_drift_metric, the output
          generates a mean value of 0.3535 and a
          variance value of 0.0597, which indicates
          acceptable data drift.
        
To analyze data drift detection with a table of predictions:
If not done already, train the model to use. See Analyzing Data Drift in Model Metadata.
              Load the trained model. Update
              @census_model with your own session
              variable for the trained model.
            
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
              Run
              ML_PREDICT_TABLE
              to generate a table of predictions.
            
mysql> CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name), [options]);
              Replace table_name,
              model_handle,
              output_table_name),and
              options with your own values.
              For example:
            
mysql> CALL sys.ML_PREDICT_TABLE('census_data.`census_test`', @census_model, 'census_data.`census_test_predictions`', 
                                  JSON_OBJECT('additional_details', true));
Where:
                  census_data.census_test is the
                  fully qualified name of the test dataset table
                  (database_name.table_name).
                
                  @census_model is the session
                  variable that contains the model handle. See
                  Work with
                  Model Handles.
                
                  census_data.census_test_predictions
                  is the output table where predictions are stored.
                
                  JSON_OBJECT includes the
                  additional_details option set to
                  true, so
                  ml_results includes values for
                  metric and
                  attribution_percent.
                
              Since a metric value over 1.0 indicates
              data drift, query rows in the output table that only have
              a metric value over 1.0.
            
mysql> SELECT ml_results FROM table_name WHERE JSON_EXTRACT(ml_results, '$.drift.metric') > 1.0;
              Replace table_name with your
              own value. For example:
            
mysql> SELECT ml_results FROM census_test_predictions WHERE JSON_EXTRACT(ml_results, '$.drift.metric') > 1.0;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ml_results                                                                                                                                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.67, ">50K": 0.33}, "drift": {"metric": 2.46, "attribution_percent": {"relationship": 22.07, "education-num": 19.92, "education": 12.36}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.9, ">50K": 0.1}, "drift": {"metric": 1.32, "attribution_percent": {"age": 31.25, "relationship": 17.36, "capital-gain": 17.03}}}          |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.99, ">50K": 0.01}, "drift": {"metric": 1.1, "attribution_percent": {"capital-gain": 30.94, "relationship": 17.8, "workclass": 17.49}}}    |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.78, ">50K": 0.22}, "drift": {"metric": 2.42, "attribution_percent": {"hours-per-week": 29.37, "age": 28.52, "capital-gain": 23.85}}}      |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.97, ">50K": 0.03}, "drift": {"metric": 1.09, "attribution_percent": {"education": 22.18, "relationship": 16.57, "capital-gain": 13.57}}}  |
| {"predictions": {"revenue": ">50K"}, "probabilities": {"<=50K": 0.32, ">50K": 0.68}, "drift": {"metric": 3.18, "attribution_percent": {"relationship": 26.41, "education-num": 12.8, "capital-gain": 8.16}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.96, ">50K": 0.04}, "drift": {"metric": 1.11, "attribution_percent": {"marital-status": 23.34, "race": 16.02, "education": 12.83}}}        |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.89, ">50K": 0.11}, "drift": {"metric": 1.4, "attribution_percent": {"age": 27.26, "race": 18.98, "relationship": 15.49}}}                 |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.91, ">50K": 0.09}, "drift": {"metric": 1.99, "attribution_percent": {"race": 23.89, "capital-gain": 21.38, "education": 16.22}}}          |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.78, ">50K": 0.22}, "drift": {"metric": 2.33, "attribution_percent": {"capital-gain": 31.64, "hours-per-week": 15.49, "education": 7.94}}} |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.89, ">50K": 0.11}, "drift": {"metric": 1.38, "attribution_percent": {"sex": 23.55, "workclass": 23.55, "education-num": 15.71}}}          |
| {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.62, ">50K": 0.38}, "drift": {"metric": 4.33, "attribution_percent": {"fnlwgt": 21.08, "relationship": 14.24, "workclass": 5.27}}}         |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
12 rows in set (0.0014 sec)
The output displays the rows with high metric values (> 1.0), indicating data drift.
To analyze data drift detection with one or more rows of predictions:
If not done already, train the model to use. See Analyzing Data Drift in Model Metadata.
              Load the trained model. Update
              @census_model with your own session
              variable for the trained model.
            
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
              Run
              ML_PREDICT_ROW
              to generate predictions for a defined number of rows.
            
mysql> SELECT sys.ML_PREDICT_ROW(JSON_OBJECT("output_col_name", schema.`input_col_name`, 
                                              "output_col_name", schema.`input_col_name`, ...), 
                                              model_handle, options) FROM input_table_name LIMIT N;
The following example generates predictions for three rows of the table. The output is similar to the previous example.
mysql> SELECT sys.ML_PREDICT_ROW(JSON_OBJECT(
	"age", census_test.`age`,
	"workclass", census_test.`workclass`,
	"fnlwgt", census_test.`fnlwgt`,
	"education", census_test.`education`,
	"education-num", census_test.`education-num`,
	"marital-status", census_test.`marital-status`,
	"occupation", census_test.`occupation`,
	"relationship", census_test.`relationship`,
	"race", census_test.`race`,
	"sex", census_test.`sex`,
	"capital-gain", census_test.`capital-gain`,
	"capital-loss", census_test.`capital-loss`,
	"hours-per-week", census_test.`hours-per-week`,
	"native-country", census_test.`native-country`),
	@census_model, JSON_OBJECT('additional_details', TRUE))FROM census_data.census_test LIMIT 3;
+--------------------------------------------------+
| sys.ML_PREDICT_ROW(JSON_OBJECT(
"age", census_test.`age`,
"workclass", census_test.`workclass`,
"fnlwgt", census_test.`fnlwgt`,
"education", census_test.`education`,
"education-num", census_test.`education-num`,
"ma                                               |
+-------------------------------------------------+
|{                                                |
|    "age": 37,                                   |
|    "sex": "Male",                               |
|    "race": "White",                             |
|    "fnlwgt": 99146,                             |
|    "education": "Bachelors",                    |
|    "workclass": "Private",                      |
|    "Prediction": "<=50K",                       |
|    "ml_results": {                              |
|        "drift": {                               |
|            "metric": 0,                         |
|            "attribution_percent": {             |
|                "age": 0,                        |
|                "fnlwgt": 46.67,                 |
|                "capital-gain": 0}},             |
|        "predictions": {                         |
|            "revenue": "<=50K"},                 |
|        "probabilities": {                       |
|            ">50K": 0.42,                        |
|            "<=50K": 0.58}},                     |
|    "occupation": "Exec-managerial",             |
|    "capital-gain": 0,                           |
|    "capital-loss": 1977,                        |
|    "relationship": "Husband",                   |
|    "education-num": 13,                         |
|    "hours-per-week": 50,                        |
|    "marital-status": "Married-civ-spouse",      |
|    "native-country": "United-States"}           |
|{                                                |
|    "age": 34,                                   |
|    "sex": "Male",                               |
|    "race": "White",                             |
|    "fnlwgt": 27409,                             |
|    "education": "9th",                          |
|    "workclass": "Private",                      |
|    "Prediction": "<=50K",                       |
|    "ml_results": {                              |
|        "drift": {                               |
|            "metric": 0.1,                       |
|            "attribution_percent": {             |
|                "fnlwgt": 25,                    |
|                "education": 33.31,              |
|                "workclass": 16.22}},            |
|        "predictions": {                         |
|            "revenue": "<=50K"},                 |
|        "probabilities": {                       |
|            ">50K": 0.24,                        |
|            "<=50K": 0.76}},                     |
|    "occupation": "Craft-repair",                |
|    "capital-gain": 0,                           |
|    "capital-loss": 0,                           |
|    "relationship": "Husband",                   |
|    "education-num": 5,                          |
|    "hours-per-week": 50,                        |
|    "marital-status": "Married-civ-spouse",      |
|    "native-country": "United-States"}           |
|{                                                |
|    "age": 30,                                   |
|    "sex": "Female",                             |
|    "race": "White",                             |
|    "fnlwgt": 299507,                            |
|    "education": "Assoc-acdm",                   |
|    "workclass": "Private",                      |
|    "Prediction": "<=50K",                       |
|    "ml_results": {                              |
|        "drift": {                               |
|            "metric": 0.26,                      |
|            "attribution_percent": {             |
|                "relationship": 21.36,           |
|                "education-num": 28.33,          |
|                "hours-per-week": 33.21}},       |
|        "predictions": {                         |
|            "revenue": "<=50K"},                 |
|        "probabilities": {                       |
|            ">50K": 0.01,                        |
|            "<=50K": 0.99}},                     |
|    "occupation": "Other-service",               |
|    "capital-gain": 0,                           |
|    "capital-loss": 0,                           |
|    "relationship": "Unmarried",                 |
|    "education-num": 12,                         |
|    "hours-per-week": 40,                        |
|    "marital-status": "Separated",               |
|    "native-country": "United-States"}           |
+-------------------------------------------------+
10 rows in set (6.8109 sec)
Where:
                  The first JSON_OBJECT has output
                  column names and key-value pairs of the columns in the
                  trained table.
                
                  @census_model is the session
                  variable that contains the model handle. Learn more
                  about Model
                  Handles.
                
                  The second JSON_OBJECT includes the
                  additional_details option set to
                  true, so
                  ml_results includes values for
                  metric and
                  attribution_percent.
                
                  census_data.census_test is the
                  fully qualified name of the test dataset table
                  (database_name.table_name).
                
                  The LIMIT of 3 means that the
                  output includes a maximum of three rows from the
                  trained table.
                
The output allows you to review data drift values for the selected rows.
Review Machine Learning Use Cases to create machine learning models with sample datasets.