MySQL AI User Guide
Anomaly detection for logs allows you to detect anomalies in log data. To perform anomaly detection on logs, log data is cleaned, segmented, and encoded before running anomaly detection. This feature leverages the log template miner Drain3.
Consider the following when running anomaly detection on logs.
The input table can only have the following columns:
The column containing the logs.
If including logs from different sources, a column containing the source of each log. The values in this column contain the names of the sources that each log belongs to. These values are used to group each host's logs together. If this column is not present, it is assumed that all logs originate from the same source.
If including labeled data, a column identifying the labeled log lines. See Semi-supervised Anomaly Detection to learn more.
                  At least one column must act as the primary key to
                  establish the temporal order of logs. If the primary
                  key column (or columns) is not one of the previous
                  required columns (log data, source of log, or label),
                  then you must use the
                  exclude_column_list option when
                  running
                  ML_TRAIN
                  to exclude all primary key columns that don't include
                  required data. See
                  Syntax Examples for Anomaly Detection Training
                  to review relevant examples.
                
              If the input table has additional columns to the ones
              permitted, you must use the
              exclude_column_list option when running
              ML_TRAIN
              to exclude irrelevant columns.
            
The data collected for anomaly detection can be unsupervised or semi-supervised. To run semi-supervised anomaly detection, you can provide a separate column in the input table with labels for the labeled log lines. This column labels identified anomalous logs with a value of 1, non-anomalous logs with 0, and unlabeled logs with NULL. See Semi-supervised Anomaly Detection to learn more.
In addition to the anomaly scores included in the output table, you have the option to leverage the GenAI feature of MySQL AI to provide textual log summaries.
              By default the following parameters are masked in the
              input data (training or test data): IP, DATETIME, TIME,
              HEX, IPPORT, and OCID. You have the option to mask
              additional regex patterns with the
              additional_masking_regex option.
            
              MySQL AI uses a combination of a keyword feature extractor
              and an embedding model to train models. This allows
              trained models to capture semantic meanings in log data.
              You have the option to select the keyword model and any
              embedding model supported by MySQL AI with the training
              options keyword_model and
              embedding_model. The available keyword
              model options are tf-idf and
              NULL. To review supported embedding
              models, run the following query: SELECT
              sys.ML_LIST_LLMS(); and see models that have
              capabilities with
              TEXT_EMBEDDINGS. The default keyword
              feature extractor is tf-idf, and the
              default embedding model is
              multilingual-e5-small. Using an
              embedding model causes higher memory usage. You can set
              either embedding_model or
              keyword_model to
              NULL, but you cannot set both to
              NULL.
            
Learn more about the following:
Learn how to Prepare Data for an Anomaly Detection Model.