MySQL HeatWave User Guide
This topic describes how to prepare the data to use for a regression machine learning model. It uses a data sample generated by OCI GenAI. The regression use-case is to predict house prices based on the size of the house, the address of the house, and the state the house is located in. To prepare the data for this use case, you set up a training dataset and a testing dataset. The training dataset has 20 records, and the testing dataset has 10 records. In a real-life use case, you should prepare a larger amount of records for training and testing, and ensure the predictions are valid and reliable before testing on unlabeled data. To ensure reliable predictions, you should create an additional validation dataset. You can reserve 20% of the records in the training dataset to create the validation dataset.
Learn how to Prepare Data.
To prepare the data for the regression model:
Create and use the database to store the data.
mysql>CREATE DATABASE regression_data;
mysql>USE regression_data;
Create the table to insert the sample data into. This is the training dataset.
mysql> CREATE TABLE house_price_training (
id INT PRIMARY KEY,
house_size INT,
address TEXT,
state TEXT,
price INT
);
Insert the sample data into the table. Copy and paste the following commands.
INSERT INTO house_price_training (id, house_size, address, state, price)
VALUES
(1, 1500, '123 Main St', 'California', 500000),
(2, 2000, '456 Elm St', 'Texas', 650000),
(3, 1800, '789 Oak Ave', 'New York', 700000),
(4, 1200, '222 Pine Rd', 'Florida', 420000),
(5, 1600, '555 Maple Lane', 'Washington', 550000),
(6, 2500, '888 River Blvd', 'California', 800000),
(7, 1300, '333 Creek St', 'Texas', 480000),
(8, 1700, '666 Mountain Rd', 'Colorado', 520000),
(9, 1400, '999 Valley View', 'New York', 580000),
(10, 1900, '111 Ocean Blvd', 'Florida', 620000),
(11, 1550, '2222 Lake Dr', 'Illinois', 540000),
(12, 2100, '3333 Forest Ave', 'Texas', 750000),
(13, 1650, '4444 Desert Rd', 'Arizona', 570000),
(14, 1250, '5555 Riverbank St', 'Washington', 450000),
(15, 1850, '6666 Sky Blvd', 'California', 720000),
(16, 1350, '7777 Meadow Lane', 'Ohio', 490000),
(17, 2050, '8888 Hill St', 'New York', 850000),
(18, 1450, '9999 Creek Rd', 'Florida', 590000),
(19, 1750, '10101 Ocean Ave', 'Texas', 680000),
(20, 1580, '11111 Pine St', 'Illinois', 560000);
Create the table to use for generating predictions and
explanations. This is the test dataset. It has the same
columns as the training dataset, but the target column,
price
, is not considered when
generating predictions or explanations.
mysql> CREATE TABLE house_price_testing (
id INT PRIMARY KEY,
house_size INT,
address TEXT,
state TEXT,
price INT
);
Insert the sample data into the table. Copy and paste the following commands.
INSERT INTO house_price_testing (id, house_size, address, state, price)
VALUES
(1, 1400, '500 Elm St', 'Nevada', 470000),
(2, 1900, '200 River Rd', 'California', 630000),
(3, 1600, '300 Mountain Ave', 'Colorado', 530000),
(4, 2200, '400 Lake Blvd', 'New York', 780000),
(5, 1300, '500 Creek Lane', 'Texas', 460000),
(6, 1700, '600 Valley View Rd', 'Florida', 510000),
(7, 1500, '700 Ocean St', 'Washington', 500000),
(8, 1800, '800 Sky Blvd', 'Oregon', 600000),
(9, 1200, '900 Meadow Ave', 'Illinois', 430000),
(10, 2100, '1000 Hill Rd', 'New Jersey', 760000);
Learn how to Train a Regression Model.