MySQL HeatWave User Guide
If you have external data that you want to use with MySQL HeatWave AutoML, you can load it directly from Object Storage with Lakehouse.
This topic has the following sections.
Lakehouse must be enabled in the MySQL HeatWave cluster. See Additional MySQL HeatWave Lakehouse Requirements to learn more.
For MySQL HeatWave on OCI, you have three options to access external data from Object Storage: Pre-Authenticated Requests (PAR), Resource Principals, or Uniform Resource Identifiers (URI) (As of MySQL 9.3.1).
See the following to learn more:
Resource Principals in the MySQL HeatWave on OCI Service Guide.
Uploading an Object Storage Object to a Bucket in the MySQL HeatWave on OCI Service Guide.
Object Storage Pre-Authenticated Requests in the MySQL HeatWave on OCI Service Guide.
After loading data from Object Storage into Lakehouse, you can start running AutoML routines. You can run the following routines as needed with no changes:
If you run ML_PREDICT_ROW
or
ML_EXPLAIN_ROW
on data loaded
from Object Storage, you cannot use a FROM
clause.
MySQL HeatWave AutoML commands operate on data loaded into MySQL HeatWave. If the original Lakehouse data in Object Storage is deleted or modified, this does not affect a MySQL HeatWave AutoML command until the data is unloaded from MySQL HeatWave.
The workflow to use MySQL HeatWave AutoML with Lakehouse includes the following:
Upload the file that has the data to a bucket in your Oracle Cloud account. See Upload File to Object Storage.
Load the external file into MySQL HeatWave and create an external
Lakehouse table. You have two options to do this. The
preferred method is to automate the process with the
heatwave_load
command. See
Load Data into an External Lakehouse Table.
Alternatively, you can manually load the data. See
Section 4.3.9, “Load Structured Data Manually” to
learn more.
Run MySQL HeatWave AutoML routines as needed. For
ML_PREDICT_ROW
and
ML_EXPLAIN_ROW
, you cannot
use the FROM
clause in Lakehouse
tables. See
Use ML_PREDICT_ROW and ML_EXPLAIN_ROW with Lakehouse Data
to review how to do this.
The following tasks use data from:
Bank
Marketing. This data is used for a classification
machine learning model, and predicts if a client will subscribe
to a term deposit. The target column is y
.
To access the file from Object Storage, you create a pre-authenticated request. This is a temporary request that allows you to access files in Object Storage. See Object Storage Pre-Authenticated Requests to learn more.
To upload the file that has the data to Object Storage:
Visit
Bank
Marketing and download the
bank+marketing.zip
file.
Unzip the file, and then unzip the
bank.zip
file. Refer to the
bank.csv
file for the remaining steps.
Access your Oracle
Cloud account and upload the
bank.csv
file to Object Storage. See
Uploading an Object Storage Object to a Bucket.
Create a pre-authenticated request for the
bank.csv
file. Select the
Object
Pre-Authenticated
Request Target and enter the
bank.csv
Object
Name. See
Creating a Pre-Authenticated Request in Object
Storage in the MySQL HeatWave on OCI Service
Guide.
After uploading the file to Object Storage, set up the name of
the table that will store the data, the format of the external
file, and the pre-authorized request link to access the external
file. You then use the heatwave_load
command
to allow Auto Parallel Load to automatically infer the characteristics of the
table columns and load the data into an external Lakehouse
table. To learn more, see
Section 4.3.8.1, “Load Data from Object Storage”.
Create a session variable with the characteristics of the data to load and the external file:
mysql> SET @input_list = '[{
"db_name": "bank_marketing",
"tables": [{
"table_name": "bank_train",
"engine_attribute": {
"dialect": {"format": "csv", "has_header": true},
"file": [{"par": "pre_authenticated_request"}]
}
}]
}]';
Where:
@input_list
is the name of the
session variable.
db_name
identifies the database name
to store the table. If the database does not exist,
MySQL HeatWave automatically creates it.
table_name
sets the table name to
store the data: bank_train
. If the
table does not exist, MySQL HeatWave automatically creates it.
format
defines the format of the
external file: csv
.
has_header
identifies a header in the
external file. Auto Parallel Load then infers the column names from
the first row in the file.
par
sets the pre-authenticated
request link to access the file. Replace
pre_authenticated_request
with your own link.
If you are on MySQL 9.1.2 and earlier, you need to update
dialect
with the field
delimiter
and record delimiter
parameters. Set field delimiter
with a
value of ","
, and record
delimiter
with a value of
"\\n"
. As of MySQL 9.2.0, Lakehouse
Autopilot can automatically detect these values. See
Lakehouse
External Table Syntax to learn more.
Use the HEATWAVE_LOAD
command to
automatically load the data into an external Lakehouse
table with the default settings. You have the option to test
the load before implementing it. See
Section 4.3.8.1, “Load Data from Object Storage” to
learn more about different modes.
mysql> CALL sys.HEATWAVE_LOAD(CAST(@input_list AS JSON), NULL);
+------------------------------------------+
| INITIALIZING HEATWAVE AUTO PARALLEL LOAD |
+------------------------------------------+
| Version: 4.24 |
| |
| Load Mode: normal |
| Load Policy: disable_unsupported_columns |
| Output Mode: normal |
| |
+------------------------------------------+
6 rows in set (0.0676 sec)
+--------------------------------------------------------------------------------------------------------------------+
| LAKEHOUSE AUTO SCHEMA INFERENCE |
+--------------------------------------------------------------------------------------------------------------------+
| Verifying external lakehouse tables: 1 |
| |
| SCHEMA TABLE TABLE IS RAW NUM. OF ESTIMATED SUMMARY OF |
| NAME NAME CREATED FILE SIZE COLUMNS ROW COUNT ISSUES |
| ------ ----- -------- --------- ------- --------- ---------- |
| `bank_marketing` `bank_train` NO 450.66 KiB 17 4.52 K |
| |
| New schemas to be created: 0 |
| External lakehouse tables to be created: 1 |
| |
+--------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.0676 sec)
+------------------------------------------------------------------------+
| OFFLOAD ANALYSIS |
+------------------------------------------------------------------------+
| Verifying input schemas: 1 |
| User excluded items: 0 |
| |
| SCHEMA OFFLOADABLE OFFLOADABLE SUMMARY OF |
| NAME TABLES COLUMNS ISSUES |
| ------ ----------- ----------- ---------- |
| `bank_marketing` 1 17 |
| |
| Total offloadable schemas: 1 |
| |
+------------------------------------------------------------------------+
10 rows in set (0.0676 sec)
+-----------------------------------------------------------------------------------------------------------------------------+
| CAPACITY ESTIMATION |
+-----------------------------------------------------------------------------------------------------------------------------+
| Default encoding for string columns: VARLEN (unless specified in the schema) |
| Estimating memory footprint for 1 schema(s) |
| |
| TOTAL ESTIMATED ESTIMATED TOTAL DICTIONARY VARLEN ESTIMATED |
| SCHEMA OFFLOADABLE HEATWAVE NODE MYSQL NODE STRING ENCODED ENCODED LOAD |
| NAME TABLES FOOTPRINT FOOTPRINT COLUMNS COLUMNS COLUMNS TIME |
| ------ ----------- --------- --------- ------- ---------- ------- --------- |
| `bank_marketing` 1 3.10 MiB 1.12 MiB 10 0 10 7.00 s |
| |
| Sufficient MySQL host memory available to load all tables. |
| Sufficient HeatWave cluster memory available to load all tables. |
| |
+-----------------------------------------------------------------------------------------------------------------------------+
12 rows in set (0.0676 sec)
+---------------------------------------------------------------------------------------------------------------------------------------+
| EXECUTING LOAD SCRIPT |
+---------------------------------------------------------------------------------------------------------------------------------------+
| HeatWave Load script generated |
| Retrieve load script containing 2 generated DDL command(s) using the query below: |
| Deprecation Notice: "heatwave_load_report" will be deprecated, please switch to "heatwave_autopilot_report" |
| SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id; |
| |
| Adjusting load parallelism dynamically per internal/external table. |
| Using current parallelism of 32 thread(s) as maximum for internal tables. |
| |
| Warning: Executing the generated script may alter column definitions and secondary engine flags in the schema |
| |
| Using SQL_MODE: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION |
| |
| Proceeding to load 1 table(s) into HeatWave. |
| |
| Applying changes will take approximately 7.01 s |
| |
+---------------------------------------------------------------------------------------------------------------------------------------+
16 rows in set (0.0676 sec)
+-----------------------------------------------+
| TABLE LOAD |
+-----------------------------------------------+
| TABLE (1 of 1): `bank_marketing`.`bank_train` |
| Commands executed successfully: 2 of 2 |
| Warnings encountered: 0 |
| Table load succeeded! |
| Total columns loaded: 17 |
| Elapsed time: 3.50 s |
| |
+-----------------------------------------------+
7 rows in set (0.0676 sec)
+-------------------------------------------------------------------------------+
| LOAD SUMMARY |
+-------------------------------------------------------------------------------+
| |
| SCHEMA TABLES TABLES COLUMNS LOAD |
| NAME LOADED FAILED LOADED DURATION |
| ------ ------ ------ ------- -------- |
| `bank_marketing` 1 0 17 3.50 s |
| |
| Total errors encountered: 0 |
| Total warnings encountered: 0 |
| |
+-------------------------------------------------------------------------------+
9 rows in set (0.0676 sec)
Query OK, 0 rows affected (0.0676 sec)
A summary is generated that allows you to review the details of the load and any warnings or errors.
Once the data is loaded into Lakehouse, you can use the following routines as needed with no changes:
For ML_PREDICT_ROW
and
ML_EXPLAIN_ROW
, you must avoid
the FROM
clause. The following task shows how
to use these commands specifically for data loaded into MySQL HeatWave
from Lakehouse.
To avoid using the FROM
clause, you can
insert the data to predict and explain directly into a
JSON
object.
Train the table. Setting NULL
to all JSON
options means that the default task
of
classification
is used for training.
mysql> CALL sys.ML_TRAIN('bank_marketing.bank_train', 'y', JSON_OBJECT('task', 'classification'), @bank_model);
Where:
bank_marketing.bank_train
is the
fully qualified name of the table that contains the
training dataset
(schema_name.table_name
).
y
is the name of the target column,
which contains ground truth values.
JSON_OBJECT('task', 'classification')
specifies the machine learning task type.
@bank_model
is the name of the
user-defined session variable that stores the model
handle for the duration of the connection. User
variables are written as
@
.
Any valid name for a user-defined variable is permitted.
For example, var_name
@my_model
. Learn more
about Model
Handles.
After training, load the trained model.
mysql> CALL sys.ML_MODEL_LOAD(@bank_model, NULL);
Insert the data to predict and explain directly into a JSON
object named @row_input
.
mysql> SET @row_input = JSON_OBJECT(
'age', 37,
'job', 'admin.',
'marital', 'married',
'education', 'unknown',
'default', 'no',
'balance', 734,
'housing', 'yes',
'loan', 'no',
'contact', 'unknown',
'day', 21,
'month', 'may',
'duration', 1106,
'campaign', 1,
'pdays', -1,
'previous', 0,
'poutcome', 'unknown',
'y', 'no');
Run ML_PREDICT_ROW
to
generate predictions on the data in the JSON object.
Optionally, use \G
to view the output in
an easily readable format.
mysql> SELECT sys.ML_PREDICT_ROW(@row_input, @bank_model, NULL)\G
*************************** 1. row ***************************
sys.ML_PREDICT_ROW(@row_input, @bank_model, NULL):
{"y": "no", "age": 37, "day": 21, "job": "admin.", "loan": "no", "month": "may", "pdays": -1, "balance": 734, "contact": "unknown", "default": "no", "housing": "yes", "marital": "married", "campaign": 1, "duration": 1106, "poutcome": "unknown", "previous": 0, "education": "unknown",
"Prediction": "no", "ml_results": {"predictions": {"y": "no"}, "probabilities": {"no": 0.7052, "yes": 0.2948}}}
1 row in set (1.0027 sec)
The prediction gives a value of no
with a
probability of 65%. This is the same as the labeled value
for the row in the y
column. This
demonstrates that the trained model gave a reliable
prediction.
Run ML_EXPLAIN_ROW
with the
permutation_importance
prediction
explainer to generate an explanation on the prediction
previously generated. Optionally, use \G
to view the output in an easily readable format.
mysql> SELECT sys.ML_EXPLAIN_ROW(@row_input, @bank_model,
JSON_OBJECT('prediction_explainer', 'permutation_importance'))\G
*************************** 1. row ***************************
sys.ML_EXPLAIN_ROW(@row_input, @bank_model,
JSON_OBJECT('prediction_explainer', 'permutation_importance')): {"y": "no", "age": 37, "day": 21, "job": "admin.", "loan": "no",
"Notes": "month (may) had the largest impact towards predicting no, whereas duration (1106) contributed the most against predicting no",
"month": "may", "pdays": -1, "balance": 734, "contact": "unknown", "default": "no", "housing": "yes", "marital": "married", "campaign": 1, "duration": 1106, "poutcome": "unknown", "previous": 0, "education": "unknown",
"Prediction": "no", "ml_results": {"notes": "month (may) had the largest impact towards predicting no, whereas duration (1106) contributed the most against predicting no",
"predictions": {"y": "no"}, "attributions": {"month": 0.24, "pdays": 0.02, "contact": 0.15, "default": 0.0, "housing": 0.05, "duration": -0.29, "poutcome": -0.0, "previous": -0.02}}, "month_attribution": 0.24, "pdays_attribution": 0.02, "contact_attribution": 0.15, "default_attribution": 0, "housing_attribution": 0.05, "duration_attribution": -0.29, "poutcome_attribution": 0, "previous_attribution": -0.02}
1 row in set (5.0770 sec)
The explanation determines that the may
value in the month
column had the
greatest impact on generating the previous prediction of
no
. It also determines that
duration
contributed the most against the
prediction of no
.
Learn how to Load Data from Object Storage into MySQL HeatWave Cluster using MySQL HeatWave Lakehouse.
Review how to Create a Machine Learning Model.
Review Machine Learning Use Cases to create machine learning models with sample datasets.