MySQL HeatWave User Guide

6.8 Use MySQL HeatWave AutoML with Lakehouse

If you have external data that you want to use with MySQL HeatWave AutoML, you can load it directly from Object Storage with Lakehouse.

This topic has the following sections.

Before You Begin

MySQL HeatWave AutoML and Lakehouse Overview

For MySQL HeatWave on OCI, you have three options to access external data from Object Storage: Pre-Authenticated Requests (PAR), Resource Principals, or Uniform Resource Identifiers (URI) (As of MySQL 9.3.1).

See the following to learn more:

After loading data from Object Storage into Lakehouse, you can start running AutoML routines. You can run the following routines as needed with no changes:

If you run ML_PREDICT_ROW or ML_EXPLAIN_ROW on data loaded from Object Storage, you cannot use a FROM clause.

MySQL HeatWave AutoML commands operate on data loaded into MySQL HeatWave. If the original Lakehouse data in Object Storage is deleted or modified, this does not affect a MySQL HeatWave AutoML command until the data is unloaded from MySQL HeatWave.

Workflow to Use MySQL HeatWave AutoML with Lakehouse

The workflow to use MySQL HeatWave AutoML with Lakehouse includes the following:

  1. Upload the file that has the data to a bucket in your Oracle Cloud account. See Upload File to Object Storage.

  2. Load the external file into MySQL HeatWave and create an external Lakehouse table. You have two options to do this. The preferred method is to automate the process with the heatwave_load command. See Load Data into an External Lakehouse Table. Alternatively, you can manually load the data. See Section 4.3.9, “Load Structured Data Manually” to learn more.

  3. Run MySQL HeatWave AutoML routines as needed. For ML_PREDICT_ROW and ML_EXPLAIN_ROW, you cannot use the FROM clause in Lakehouse tables. See Use ML_PREDICT_ROW and ML_EXPLAIN_ROW with Lakehouse Data to review how to do this.

MySQL HeatWave AutoML and Lakehouse Use Case

The following tasks use data from: Bank Marketing. This data is used for a classification machine learning model, and predicts if a client will subscribe to a term deposit. The target column is y.

To access the file from Object Storage, you create a pre-authenticated request. This is a temporary request that allows you to access files in Object Storage. See Object Storage Pre-Authenticated Requests to learn more.

Upload File to Object Storage

To upload the file that has the data to Object Storage:

  1. Visit Bank Marketing and download the bank+marketing.zip file.

  2. Unzip the file, and then unzip the bank.zip file. Refer to the bank.csv file for the remaining steps.

  3. Access your Oracle Cloud account and upload the bank.csv file to Object Storage. See Uploading an Object Storage Object to a Bucket.

  4. Create a pre-authenticated request for the bank.csv file. Select the Object Pre-Authenticated Request Target and enter the bank.csv Object Name. See Creating a Pre-Authenticated Request in Object Storage in the MySQL HeatWave on OCI Service Guide.

Load Data into an External Lakehouse Table

After uploading the file to Object Storage, set up the name of the table that will store the data, the format of the external file, and the pre-authorized request link to access the external file. You then use the heatwave_load command to allow Auto Parallel Load to automatically infer the characteristics of the table columns and load the data into an external Lakehouse table. To learn more, see Section 4.3.8.1, “Load Data from Object Storage”.

  1. Connect to your MySQL HeatWave Database System.

  2. Create a session variable with the characteristics of the data to load and the external file:

    mysql> SET @input_list = '[{
      "db_name": "bank_marketing",
      "tables": [{
        "table_name": "bank_train",
        "engine_attribute": {
          "dialect": {"format": "csv", "has_header": true},
          "file": [{"par": "pre_authenticated_request"}]
        }
      }]
    }]';
    

    Where:

    • @input_list is the name of the session variable.

    • db_name identifies the database name to store the table. If the database does not exist, MySQL HeatWave automatically creates it.

    • table_name sets the table name to store the data: bank_train. If the table does not exist, MySQL HeatWave automatically creates it.

    • format defines the format of the external file: csv.

    • has_header identifies a header in the external file. Auto Parallel Load then infers the column names from the first row in the file.

    • par sets the pre-authenticated request link to access the file. Replace pre_authenticated_request with your own link.

    If you are on MySQL 9.1.2 and earlier, you need to update dialect with the field delimiter and record delimiter parameters. Set field delimiter with a value of ",", and record delimiter with a value of "\\n". As of MySQL 9.2.0, Lakehouse Autopilot can automatically detect these values. See Lakehouse External Table Syntax to learn more.

  3. Use the HEATWAVE_LOAD command to automatically load the data into an external Lakehouse table with the default settings. You have the option to test the load before implementing it. See Section 4.3.8.1, “Load Data from Object Storage” to learn more about different modes.

    mysql> CALL sys.HEATWAVE_LOAD(CAST(@input_list AS JSON), NULL);
    +------------------------------------------+
    | INITIALIZING HEATWAVE AUTO PARALLEL LOAD |
    +------------------------------------------+
    | Version: 4.24                            |
    |                                          |
    | Load Mode: normal                        |
    | Load Policy: disable_unsupported_columns |
    | Output Mode: normal                      |
    |                                          |
    +------------------------------------------+
    6 rows in set (0.0676 sec)
    
    +--------------------------------------------------------------------------------------------------------------------+
    | LAKEHOUSE AUTO SCHEMA INFERENCE                                                                                    |
    +--------------------------------------------------------------------------------------------------------------------+
    | Verifying external lakehouse tables: 1                                                                             |
    |                                                                                                                    |
    | SCHEMA                   TABLE                    TABLE IS           RAW     NUM. OF      ESTIMATED     SUMMARY OF |
    | NAME                     NAME                     CREATED      FILE SIZE     COLUMNS      ROW COUNT     ISSUES     |
    | ------                   -----                    --------     ---------     -------      ---------     ---------- |
    | `bank_marketing`         `bank_train`             NO          450.66 KiB          17         4.52 K                |
    |                                                                                                                    |
    | New schemas to be created: 0                                                                                       |
    | External lakehouse tables to be created: 1                                                                         |
    |                                                                                                                    |
    +--------------------------------------------------------------------------------------------------------------------+
    10 rows in set (0.0676 sec)
    
    +------------------------------------------------------------------------+
    | OFFLOAD ANALYSIS                                                       |
    +------------------------------------------------------------------------+
    | Verifying input schemas: 1                                             |
    | User excluded items: 0                                                 |
    |                                                                        |
    | SCHEMA                       OFFLOADABLE    OFFLOADABLE     SUMMARY OF |
    | NAME                              TABLES        COLUMNS     ISSUES     |
    | ------                       -----------    -----------     ---------- |
    | `bank_marketing`                       1             17                |
    |                                                                        |
    | Total offloadable schemas: 1                                           |
    |                                                                        |
    +------------------------------------------------------------------------+
    10 rows in set (0.0676 sec)
    
    +-----------------------------------------------------------------------------------------------------------------------------+
    | CAPACITY ESTIMATION                                                                                                         |
    +-----------------------------------------------------------------------------------------------------------------------------+
    | Default encoding for string columns: VARLEN (unless specified in the schema)                                                |
    | Estimating memory footprint for 1 schema(s)                                                                                 |
    |                                                                                                                             |
    |                                TOTAL       ESTIMATED       ESTIMATED       TOTAL     DICTIONARY      VARLEN       ESTIMATED |
    | SCHEMA                   OFFLOADABLE   HEATWAVE NODE      MYSQL NODE      STRING        ENCODED     ENCODED            LOAD |
    | NAME                          TABLES       FOOTPRINT       FOOTPRINT     COLUMNS        COLUMNS     COLUMNS            TIME |
    | ------                   -----------       ---------       ---------     -------     ----------     -------       --------- |
    | `bank_marketing`                   1        3.10 MiB        1.12 MiB          10              0          10          7.00 s |
    |                                                                                                                             |
    | Sufficient MySQL host memory available to load all tables.                                                                  |
    | Sufficient HeatWave cluster memory available to load all tables.                                                            |
    |                                                                                                                             |
    +-----------------------------------------------------------------------------------------------------------------------------+
    12 rows in set (0.0676 sec)
    
    +---------------------------------------------------------------------------------------------------------------------------------------+
    | EXECUTING LOAD SCRIPT                                                                                                                 |
    +---------------------------------------------------------------------------------------------------------------------------------------+
    | HeatWave Load script generated                                                                                                        |
    |   Retrieve load script containing 2 generated DDL command(s) using the query below:                                                   |
    | Deprecation Notice: "heatwave_load_report" will be deprecated, please switch to "heatwave_autopilot_report"                           |
    |   SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id;                            |
    |                                                                                                                                       |
    | Adjusting load parallelism dynamically per internal/external table.                                                                   |
    | Using current parallelism of 32 thread(s) as maximum for internal tables.                                                             |
    |                                                                                                                                       |
    | Warning: Executing the generated script may alter column definitions and secondary engine flags in the schema                         |
    |                                                                                                                                       |
    | Using SQL_MODE: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION |
    |                                                                                                                                       |
    | Proceeding to load 1 table(s) into HeatWave.                                                                                          |
    |                                                                                                                                       |
    | Applying changes will take approximately 7.01 s                                                                                       |
    |                                                                                                                                       |
    +---------------------------------------------------------------------------------------------------------------------------------------+
    16 rows in set (0.0676 sec)
    
    +-----------------------------------------------+
    | TABLE LOAD                                    |
    +-----------------------------------------------+
    | TABLE (1 of 1): `bank_marketing`.`bank_train` |
    | Commands executed successfully: 2 of 2        |
    | Warnings encountered: 0                       |
    | Table load succeeded!                         |
    |   Total columns loaded: 17                    |
    |   Elapsed time: 3.50 s                        |
    |                                               |
    +-----------------------------------------------+
    7 rows in set (0.0676 sec)
    
    +-------------------------------------------------------------------------------+
    | LOAD SUMMARY                                                                  |
    +-------------------------------------------------------------------------------+
    |                                                                               |
    | SCHEMA                          TABLES       TABLES      COLUMNS         LOAD |
    | NAME                            LOADED       FAILED       LOADED     DURATION |
    | ------                          ------       ------      -------     -------- |
    | `bank_marketing`                     1            0           17       3.50 s |
    |                                                                               |
    | Total errors encountered: 0                                                   |
    | Total warnings encountered: 0                                                 |
    |                                                                               |
    +-------------------------------------------------------------------------------+
    9 rows in set (0.0676 sec)
    
    Query OK, 0 rows affected (0.0676 sec)
    
    

    A summary is generated that allows you to review the details of the load and any warnings or errors.

Once the data is loaded into Lakehouse, you can use the following routines as needed with no changes:

For ML_PREDICT_ROW and ML_EXPLAIN_ROW, you must avoid the FROM clause. The following task shows how to use these commands specifically for data loaded into MySQL HeatWave from Lakehouse.

Use ML_PREDICT_ROW and ML_EXPLAIN_ROW with Lakehouse Data

To avoid using the FROM clause, you can insert the data to predict and explain directly into a JSON object.

  1. Train the table. Setting NULL to all JSON options means that the default task of classification is used for training.

    mysql> CALL sys.ML_TRAIN('bank_marketing.bank_train', 'y', JSON_OBJECT('task', 'classification'), @bank_model);
    

    Where:

    • bank_marketing.bank_train is the fully qualified name of the table that contains the training dataset (schema_name.table_name).

    • y is the name of the target column, which contains ground truth values.

    • JSON_OBJECT('task', 'classification') specifies the machine learning task type.

    • @bank_model is the name of the user-defined session variable that stores the model handle for the duration of the connection. User variables are written as @var_name. Any valid name for a user-defined variable is permitted. For example, @my_model. Learn more about Model Handles.

  2. After training, load the trained model.

    mysql> CALL sys.ML_MODEL_LOAD(@bank_model, NULL);
    
  3. Insert the data to predict and explain directly into a JSON object named @row_input.

    mysql> SET @row_input = JSON_OBJECT(
            'age', 37,
            'job', 'admin.',
            'marital', 'married',
            'education', 'unknown',
            'default', 'no',
            'balance', 734,
            'housing', 'yes',
            'loan', 'no',
            'contact', 'unknown',
            'day', 21,
            'month', 'may',
            'duration', 1106,
            'campaign', 1,
            'pdays', -1,
            'previous', 0,
            'poutcome', 'unknown',
            'y', 'no');
    
  4. Run ML_PREDICT_ROW to generate predictions on the data in the JSON object. Optionally, use \G to view the output in an easily readable format.

    mysql> SELECT sys.ML_PREDICT_ROW(@row_input, @bank_model, NULL)\G
    *************************** 1. row ***************************
    sys.ML_PREDICT_ROW(@row_input, @bank_model, NULL): 
    {"y": "no", "age": 37, "day": 21, "job": "admin.", "loan": "no", "month": "may", "pdays": -1, "balance": 734, "contact": "unknown", "default": "no", "housing": "yes", "marital": "married", "campaign": 1, "duration": 1106, "poutcome": "unknown", "previous": 0, "education": "unknown", 
    "Prediction": "no", "ml_results": {"predictions": {"y": "no"}, "probabilities": {"no": 0.7052, "yes": 0.2948}}}
    1 row in set (1.0027 sec)
    

    The prediction gives a value of no with a probability of 65%. This is the same as the labeled value for the row in the y column. This demonstrates that the trained model gave a reliable prediction.

  5. Run ML_EXPLAIN_ROW with the permutation_importance prediction explainer to generate an explanation on the prediction previously generated. Optionally, use \G to view the output in an easily readable format.

    mysql> SELECT sys.ML_EXPLAIN_ROW(@row_input, @bank_model, 
            JSON_OBJECT('prediction_explainer', 'permutation_importance'))\G
    *************************** 1. row ***************************
    sys.ML_EXPLAIN_ROW(@row_input, @bank_model, 
            JSON_OBJECT('prediction_explainer', 'permutation_importance')): {"y": "no", "age": 37, "day": 21, "job": "admin.", "loan": "no", 
            "Notes": "month (may) had the largest impact towards predicting no, whereas duration (1106) contributed the most against predicting no", 
            "month": "may", "pdays": -1, "balance": 734, "contact": "unknown", "default": "no", "housing": "yes", "marital": "married", "campaign": 1, "duration": 1106, "poutcome": "unknown", "previous": 0, "education": "unknown", 
            "Prediction": "no", "ml_results": {"notes": "month (may) had the largest impact towards predicting no, whereas duration (1106) contributed the most against predicting no", 
            "predictions": {"y": "no"}, "attributions": {"month": 0.24, "pdays": 0.02, "contact": 0.15, "default": 0.0, "housing": 0.05, "duration": -0.29, "poutcome": -0.0, "previous": -0.02}}, "month_attribution": 0.24, "pdays_attribution": 0.02, "contact_attribution": 0.15, "default_attribution": 0, "housing_attribution": 0.05, "duration_attribution": -0.29, "poutcome_attribution": 0, "previous_attribution": -0.02}
    1 row in set (5.0770 sec)

    The explanation determines that the may value in the month column had the greatest impact on generating the previous prediction of no. It also determines that duration contributed the most against the prediction of no.

What's Next