5.8.2 Data Placement Keys

When data is loaded into MySQL HeatWave, it is partitioned by the table primary key and sliced horizontally for distribution among MySQL HeatWave nodes by default. However, this may not be optimal for queries involving JOIN or GROUP BY operations as they incur costs due to data redistribution among MySQL HeatWave nodes at query execution time

In case of such queries, the data placement key feature permits partitioning data by JOIN or GROUP BY key columns instead of primary keys. To use this feature, define data placement keys on the most frequently used JOIN keys and the keys of the longest running queries.

Tip

For data placement key recommendations, use MySQL HeatWave Autopilot Advisor after loading tables into MySQL HeatWave and running queries. For more information, see Section 5.8, “Optimize Workloads for OLAP”.

This topic includes the following sections:

Before You Begin

Review the requirements.
Connect to your MySQL DB System.
The examples in this topic use the sample database airportdb. To learn how to download the sample database, see AirportDB Analytics Quickstart.

Define a Data Placement Key

Defining a data placement key requires adding a column comment with the data placement keyword string:

RAPID_COLUMN=DATA_PLACEMENT_KEY=N

where N is an index value that defines the priority order of data placement keys. The rules for assigning a data placement index are:

The index must start with 1.
Index values range from 1 to 16.
Each column must have an unique index value. For example, you cannot assign the index value of 2 to more than one column in the same table.
Always assign consecutive index values, without gaps.For example, if you define a data placement key column with an index value of 3, there must also be two other data placement key columns with index values of 1 and 2, respectively.

The following example shows multiple columns defined as data placement keys. Although a primary key is defined, data is partitioned by the data placement keys, which are prioritized over the primary key.

mysql> CREATE TABLE airline_discount(
id INT PRIMARY KEY, 
date DATE COMMENT 'RAPID_COLUMN=DATA_PLACEMENT_KEY=1',
discount FLOAT COMMENT 'RAPID_COLUMN=DATA_PLACEMENT_KEY=2');
Query OK, 0 rows affected (0.2448 sec)

When defining multiple columns as data placement keys, prioritize the keys according to query cost. For example, assign DATA_PLACEMENT_KEY=1 to the key of the costliest query, and DATA_PLACEMENT_KEY=2 to the key of the second costliest query, and so on.

Note

Other information is permitted in column comments. For example, it is permitted to specify a column description alongside a data placement keyword string:

COMMENT 'column_description RAPID_COLUMN=DATA_PLACEMENT_KEY=1'

View Data Placement Keys

To view the data placement type of the tables loaded into MySQL HeatWave, PrimaryKey means the data placement key is using the primary key of the table:

mysql> SELECT SCHEMA_NAME, TABLE_NAME, DATA_PLACEMENT_TYPE, LOAD_STATUS 
FROM performance_schema.rpd_table_id 
JOIN performance_schema.rpd_tables USING (ID);

You can also view data placement keys for an individual table using SHOW CREATE TABLE.

For information about data placement key index values, see Section 5.8.2, “Data Placement Keys”.

Verify Data Placement Partition Usage for JOIN and GROUP BY Queries

To determine whether a JOIN or GROUP BY query used data placement partitions, examine the QEP_TEXT column in the performance_schema.rpd_query_stats table. This column contains prepart (pre-partitioning) data, which indicates how the query was distributed across MySQL HeatWave nodes.

For GROUP BY operations, the prepart data is displayed as:

"prepart": #

Where, # represents the number of nodes.

For JOIN operations, the prepart data appears as:

"prepart": [#, #]

where [#, #] represents the number of MySQL HeatWave nodes used in the left and the right branches respectively. If either value is greater than 1, that branch used data placement partitions. If both values are 1 (i.e., "prepart": [1, 1]), then data placement partitions is not used both the branches.

To query QEP_TEXT prepart data for the last executed query:

mysql> SELECT CONCAT( '"prepart":[', (JSON_EXTRACT(QEP_TEXT->>"$**.prepart", '$[0][0]')), 
          "," ,(JSON_EXTRACT(QEP_TEXT->>"$**.prepart", '$[0][1]')) , ']' )
          FROM performance_schema.rpd_query_stats 
          WHERE query_id = (select max(query_id)
          FROM performance_schema.rpd_query_stats);
+-----------------------------------------------------------------------------+
| concat( '"prepart":[', (JSON_EXTRACT(QEP_TEXT->>"$**.prepart", '$[0][0]')), |
|"," ,(JSON_EXTRACT(QEP_TEXT->>"$**.prepart", '$[0][1]')) , ']' )             |
+-----------------------------------------------------------------------------+
| "prepart":[2,2]                                                             |
+-----------------------------------------------------------------------------+

Refer to the following example to understand how to use MySQL HeatWave data placement keys to improve query performance by controlling data distribution and verifying execution plan efficiency.

To begin with, check whether the target table (airline_discount) has any existing data placement keys, using SELECT statement:

mysql> SELECT COLUMN_NAME, COLUMN_COMMENT 
      FROM INFORMATION_SCHEMA.COLUMNS
WHERE 
      TABLE_NAME LIKE 'airline_discount' 
AND 
      COLUMN_COMMENT LIKE '%DATA_PLACEMENT_KEY%';/
   add output

Modify the table (airline_discount) to add (airline_discount)the from and to columns as placement keys, so that MySQL HeatWave can locate related data across nodes and reduce data movement during joins.
```
mysql> ALTER TABLE flight
      MODIFY `from` smallint NOT NULL COMMENT 'RAPID_COLUMN=DATA_PLACEMENT_KEY=1',
      MODIFY `to` smallint NOT NULL COMMENT 'RAPID_COLUMN=DATA_PLACEMENT_KEY=2';
Add output
```
Use the following query to inspect the number of partitions used in the most recent offloaded query:
```
mysql> SELECT QEP_TEXT->>"$**.nPreparts"  FROM performance_schema.rpd_query_stats
       WHERE query_id = (select max(query_id) FROM performance_schema.rpd_query_stats);

              Add output
```
The nPreparts value represents the number of partitions involved. A lower nPreparts value indicates more efficient data placement and reduced inter-node communication.

Important Notes

JOIN and GROUP BY query optimizations are only applied if at least one of the JOIN or GROUP BY relations has a key that matches the defined data placement key.
If you execute a JOIN operation with or without the JOIN and GROUP BY query optimization, a compilation-time cost model determines how the query is executed. The cost model uses estimated statistics.
A data placement key cannot be defined on a dictionary-encoded string column but are permitted on variable-length encoded columns. MySQL HeatWave applies variable-length encoding to string columns by default. See Section 5.8.1.4, “Encoding String Columns”.
A data placement key can only be defined on a column with a supported data type. See Section 4.2.1, “Supported Data Types for MySQL HeatWave”.
A data placement key column cannot be defined as a NOT SECONDARY column. See Section 4.2.6.1, “Exclude Table Columns”.

What's Next

Learn about Autopilot Advisor.