MySQL HeatWave User Guide

6.7.4.6 Generate Predictions for an Anomaly Detection Model

After training the model, you can generate predictions.

To generate predictions, use the sample data from the two anomaly detection datasets: credit_card_train and training_data. Both datasets have labeled and unlabeled rows, but only the dataset for semi-supervised learning uses this for training. The other dataset for log data is trained using unsupervised learning. Having labels for both datasets allows you to compare the predictions to the actual values and determine if the predictions are reliable. Once you determine the trained model is reliable for generating predictions, you can start using it on unseen data.

Anomaly detection models produce anomaly scores, which represent the degree to which a data point deviates from the expected normal behavior. Higher scores indicate a greater degree of abnormality, potentially signaling an anomaly that warrants further investigation. In the results, is_anomaly generates a value of 1 for an anomaly, or 0 for normal. The normal value represents the degree to which the row of data or log segment exhibits normal behavior. The anomaly value represents the degree to which the row of data or log segment exhibits anomalous behavior.

To detect anomalies, run the ML_PREDICT routines on data with the same columns as the training model.

This topic has the following sections.

Before You Begin

Complete the following tasks:

Anomaly Detection Model Options

The threshold you set on anomaly detection models determines which rows in the output table are labeled as anomalies. The value for the threshold is the degree to which a row of data or log segment is considered for anomaly detection. Any sample with an anomaly score above the threshold is classified an anomaly.

There are two ways to set threshold values for anomaly detection models.

Set the Contamination Value

You can set the contamination option for the ML_TRAIN routine. This option uses the following calculation to set the threshold: (1 - contamination)-th percentile of all the anomaly scores.

The default contamination value is 0.01. The default threshold value based on the default contamination value is the 0.99-th percentile of all the anomaly scores.

Set the Threshold Value

You can set the threshold option for the ML_PREDICT_TABLE, ML_PREDICT_ROW, and ML_SCORE routines. The value must be greater than 0 and less than 1.

If no value is set for the threshold option, the value set for the contamination option in the ML_TRAIN routine determines the threshold.

The following additional options are available:

  • An alternative to threshold is topk. The results include the top K rows with the highest anomaly scores. The ML_PREDICT_TABLE and ML_SCORE routines include the topk option, which is an integer between 1 and the table length.

  • ML_SCORE includes an options parameter in JSON format. The options are threshold and topk.

  • When running a semi-supervised model, the ML_PREDICT_ROW, ML_PREDICT_TABLE, and ML_SCORE routines have the supervised_submodel_weight option. It allows you to override the ensemble_score weighting estimated during ML_TRAIN with a new value. The value must be greater than 0 and less than 1.

Options for Anomaly Detection on Log Data

When you run anomaly detection on log data (MySQL 9.2.2 and later), you have the option to leverage MySQL HeatWave GenAI for textual summaries of the results. To create summaries, use the following options:

  • summarize_logs: Enable summaries by setting this to TRUE. If enabled, summaries are generated for log segments that are labeled as an anomaly or exceed the value set for the summary_threshold.

  • summary_threshold: Determines the rows in the output table that are summarized. This does not affect how the contamination and threshold options determine anomalies. You can set a value greater than 0 and less than 1. The default value is NULL.

    Summaries are generated for the following:

    • All rows labeled as anomalies.

    • If a value is set for summary_threshold, any non-anomaly rows that exceed the value of the summary_threshold.

    If the default NULL value is used for summary_threshold, then only rows labeled as anomalies are summarized.

Note

Enabling the summary_threshold option and setting a very low threshold value can potentially lead to a high number of summaries being generated, which may substantially increase the time required to generate output tables.

Generate Predictions for a Semi-Supervised Anomaly Detection Model
  1. If not already done, load the model. You can use the session variable for the model that is valid for the duration of the connection. Alternatively, you can use the model handle previously set. For the option to set the user name, you can set it to NULL.

    The following example uses the session variable.

    mysql> CALL sys.ML_MODEL_LOAD(@semi_supervised_model, NULL);
    

    The following example uses the model handle.

    mysql> CALL sys.ML_MODEL_LOAD('anomaly_detection_semi_supervised_use_case', NULL);
    
  2. Make predictions for the test dataset by using the ML_PREDICT_TABLE routine.

    mysql> CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name), [options]);
    

    Replace table_name, model_handle, and output_table_name with your own values. Add options as needed.

    The following example runs ML_PREDICT_TABLE on the testing dataset previously created.

    mysql> CALL sys.ML_PREDICT_TABLE('anomaly_data.credit_card_train', @semi_supervised_model, 'anomaly_data.credit_card_predictions_semi', JSON_OBJECT('threshold', 0.55));
    

    Where:

    • anomaly_data.credit_card_train is the fully qualified name of the input table that contains the data to generate predictions for (database_name.table_name).

    • @model is the session variable for the model handle.

    • anomaly_data.credit_card_predictions_semi is the fully qualified name of the output table with predictions (database_name.table_name).

    • JSON_OBJECT('threshold', 0.55) sets a threshold value of 55%, which means any row that generates an anomaly score of over 55% is labeled as an anomaly.

  3. Query the target and ml_results columns from the output table. This allows you to compare the real value with the generated anomaly prediction. Review is_anomaly to see if the row is labeled as an anomaly (1) or normal (0). Review the anomaly score for each prediction next to normal and anomaly. If needed, you can also query all the columns from the table (SELECT * FROM credit_card_predictions_semi) to review all the data at once.

    mysql> SELECT target, ml_results FROM credit_card_predictions_semi;
    +--------+--------------------------------------------------------------------------------------------+
    | target | ml_results                                                                                 |
    +--------+--------------------------------------------------------------------------------------------+
    |      0 | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.43, "anomaly": 0.57}}     |
    |      1 | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.4377, "anomaly": 0.5623}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8677, "anomaly": 0.1323}} |
    |   NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8652, "anomaly": 0.1348}} |
    |      1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.4921, "anomaly": 0.5079}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8487, "anomaly": 0.1513}} |
    |   NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.7622, "anomaly": 0.2378}} |
    |      1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.57, "anomaly": 0.43}}     |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8317, "anomaly": 0.1683}} |
    |   NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8539, "anomaly": 0.1461}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9264, "anomaly": 0.0736}} |
    |      1 | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.4079, "anomaly": 0.5921}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8379, "anomaly": 0.1621}} |
    |   NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.7971, "anomaly": 0.2029}} |
    |      1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.4623, "anomaly": 0.5377}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8816, "anomaly": 0.1184}} |
    |   NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8267, "anomaly": 0.1733}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8816, "anomaly": 0.1184}} |
    |      1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.4661, "anomaly": 0.5339}} |
    |   NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8202, "anomaly": 0.1798}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9113, "anomaly": 0.0887}} |
    |      1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.5078, "anomaly": 0.4922}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9378, "anomaly": 0.0622}} |
    |   NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8963, "anomaly": 0.1037}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.5262, "anomaly": 0.4738}} |
    |      1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.5002, "anomaly": 0.4998}} |
    |   NULL | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8767, "anomaly": 0.1233}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8878, "anomaly": 0.1122}} |
    |      1 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.4661, "anomaly": 0.5339}} |
    |      0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9037, "anomaly": 0.0963}} |
    |   NULL | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.4171, "anomaly": 0.5829}} |
    +--------+--------------------------------------------------------------------------------------------+
    31 rows in set (0.0479 sec)

To learn more about generating predictions for one or more rows of data, see Generate Predictions for a Row of Data.

Generate Predictions for an Unsupervised Anomaly Detection Model on Log Data
  1. If not already done, load the model. You can use the session variable for the model that is valid for the duration of the connection. Alternatively, you can use the model handle previously set. For the option to set the user name, you can set it to NULL.

    The following example uses the session variable.

    mysql> CALL sys.ML_MODEL_LOAD(@unsupervised_log_model, NULL);
    

    The following example uses the model handle.

    mysql> CALL sys.ML_MODEL_LOAD('anomaly_detection_log_use_case', NULL);
    
  2. Make predictions for the test dataset by using the ML_PREDICT_TABLE routine.

    mysql> CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name), [options]);
    

    Replace table_name, model_handle, and output_table_name with your own values. Add options as needed.

    The following example runs ML_PREDICT_TABLE on the testing dataset previously created.

    mysql> CALL sys.ML_PREDICT_TABLE('anomaly_log_data.testing_data', @unsupervised_log_model, 'anomaly_log_data.log_predictions_unsupervised', 
           JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', TRUE)));
    

    Where:

    • anomaly_log_data.testing_data is the fully qualified name of the input table that contains the data to generate predictions for (database_name.table_name).

    • @model is the session variable for the model handle.

    • anomaly_log_data.log_predictions_unsupervised is the fully qualified name of the output table with predictions (database_name.table_name).

    • JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', TRUE)) enables the textual summaries generated by MySQL HeatWave GenAI. No threshold is set for the summaries, so the default value of any labeled anomaly generates a summary.

  3. Query the output table and compare the real value with the generated anomaly prediction. Use \G to view the output in an easily readable format.

    mysql> SELECT * FROM log_predictions_unsupervised\G
    *************************** 1. row ***************************
                    id: 1
    parsed_log_segment: User login failed: Invalid credentials Server response time increased Normal database query Unusual network traffic from IP: 10.0.0.5 System update completed successfully 
                                           Error log: Stack trace included User activity: Admin accessed settings Unlabeled log: Further investigation needed Security alert: Potential malware detected System shutdown initiated
            ml_results: {"summary": "\nHere is a concise summary of the text:\n\nThe system encountered several issues, including an invalid login attempt, increased server response time, and unusual network traffic. 
                                       A potential malware detection was also triggered, prompting a security alert. The system has been shut down for further investigation.", 
                                       "index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.0, "anomaly": 1.0}}
    *************************** 2. row ***************************
                    id: 2
    parsed_log_segment: Unusual network traffic from IP: 10.0.0.5 System update completed successfully Error log: Stack trace included User activity: Admin accessed settings Unlabeled log: Further investigation needed Security alert: Potential malware detected System shutdown initiated
            ml_results: {"summary": "\nHere is a concise summary:\n\nA system update was completed successfully, but an error log indicates potential malware detection and requires further investigation.", 
                                    "index_map": [4, 5, 6, 7, 8, 9, 10], "predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.0, "anomaly": 1.0}}
    *************************** 3. row ***************************
                    id: 3
    parsed_log_segment: User activity: Admin accessed settings Unlabeled log: Further investigation needed Security alert: Potential malware detected System shutdown initiated
            ml_results: {"summary": "\nAn administrator has accessed the system settings, triggered a security alert for potential malware detection, and initiated a system shutdown.", 
                                    "index_map": [7, 8, 9, 10], "predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.0, "anomaly": 1.0}}
    *************************** 4. row ***************************
                    id: 4
    parsed_log_segment: System shutdown initiated
            ml_results: {"summary": "\nThe system is shutting down.", "index_map": [10], "predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.0, "anomaly": 1.0}}

    The size of the output table is based on the window_size and window_stride parameters when the model is trained. Since this use case does not set these parameters, the default values of 10 for window_size and 3 for window_stride is used. See Log Anomaly Detection Options to learn more.

    Review the following in the output table:

    • is_anomaly to see if the row is labeled as an anomaly (1) or normal (0).

    • normal and anomaly to see the anomaly score for each.

    • index_map to see which rows in the input table are included in the prediction based on the window_size and window_stride.

    • summary to see the generated text summary describing the anomaly.

What's Next