MySQL Analytics Engine
MySQL Analytics Engine enables provisioning of analytics clusters for accelerated processing of analytic queries. An analytics cluster comprises a MySQL DB System node and two or more query processing nodes. The MySQL DB System node includes an Analytics plugin that is responsible for cluster management, query scheduling, and returning query results to the MySQL DB System. The analytics nodes store data in memory and process analytics queries. Each analytics node contains an instance of the MySQL Analytics Engine.
The MySQL Analytics Engine is a distributed, scalable, shared-nothing, in-memory, columnar, query-processing engine designed for fast execution of analytic queries.
When an analytics cluster is enabled, queries that meet certain prerequisites are automatically offloaded from the MySQL DB System to the analytics cluster for accelerated execution. Queries are issued from a MySQL client or application that interacts with the Analytics Cluster by connecting to the MySQL DB System node. Results are returned to the MySQL DB System node and to the MySQL client or application that issued the query.
This guide describes how to deploy and manage an analytics cluster. It is intended for MySQL Analytics Service Administrators and assumes familiarity with MySQL and tools. After deploying an analytics cluster, refer to the MySQL Database Service Analytics Engine User's Guide for information about how to load data into an analytics cluster and run queries.
Alternatively, refer to the MySQL Analytics Engine Quickstart, which walks you though adding an analytics cluster, loading data, and running queries.
For information about MySQL, see the product documentation available at MySQL Documentation.
Adding an Analytics Cluster to a DB System
To add an analytics cluster to an existing MySQL DB System:
- The DB System was created using the BM.Standard.E2.64 or MySQL.Analytics.VM.Standard.E3 shape.
Managing Analytics Clusters
This topic describes how to manage your Analytics Clusters using the console. The following topics are described:
Analytics Cluster Information
This section describes the Analytics Cluster Information and Analytics Nodes sections of the MySQL DB System Details page, which enables you to manage, view, and edit your analytics cluster.
Table 12-1 Analytics Cluster Information
Field | Description |
---|---|
Cluster Size | The number of nodes in the cluster. |
Analytics Node Shape | The shape used for analytics cluster nodes. |
State | The state of the analytics cluster:
|
Created | The date and time the analytics cluster was created. |
Last Updated | The date and time the analytics cluster was last updated. |
Table 12-2 Analytics Nodes
Column | Description |
---|---|
Node id | The name of the analytics nodes. |
State | Current state of the analytics node:
|
Created | The date and time the analytics node was created. |
Checking Analytics Cluster Status
This topic describes how to check the status of an analytics cluster.
Cluster Failure and Recovery
This topic discusses analytics cluster failure and recovery.
When an analytics node failure is detected, MySQL Analytics automatically attempts to bring the node back online and reform the cluster.
Analytics cluster status can be monitored on the MySQL DB Systems list page. See Checking Analytics Cluster Status for more information. MySQL Analytics monitors node status using a heartbeat mechanism. If there is no response from a node after 60 seconds, the node is considered down, which triggers the recovery process.
When the analytics cluster is reformed, MySQL Analytics attempts to reload the tables that were previously loaded in the cluster. The tables are reloaded from the MySQL Server. The time to reload tables depends on data size. You can expect the operation to take approximately the same amount of time it took to load the tables into the analytics cluster initially.
To monitor the recovery process, you can connect to the MySQL Server using a client, such as MySQL Shell, to query the rapid_service_status variable, which reports the status of the cluster. For information about connecting to the MySQL Server, see Connecting to a DB System.
mysql> SHOW STATUS LIKE 'rapid_service_status';
+----------------------+--------------+
| Variable_name | Value |
+----------------------+--------------+
| rapid_service_status | CLUSTERREADY |
+----------------------+--------------+
-
OFFLINE
: The cluster has not formed yet. No operations can run. -
CLUSTERREADY
: The cluster has formed, but some tables are pending recovery. Operations are permitted on tables that are loaded. To determine which tables are loaded, you can query the RAPID Performance Schema tables.mysql> USE performance_schema; mysql> SELECT NAME, LOAD_STATUS from rpd_tables,rpd_table_id where rpd_tables.ID = rpd_table_id.ID;
-
ONLINE
: The cluster is established and previously loaded tables are reloaded. Operations are permitted. -
RECOVERYFAILED
: There was an unsuccessful attempt to recover the tables after a cluster failure. The cause of the failure could be a network error, a software failure, a hardware failure, or some other issue that cannot be resolved by the automated recovery process. In most such scenarios, stopping and starting the cluster can resolve the issue.One possible cause is that the cluster has run out of memory. After the analytics cluster is bootstrapped, ensure that you have enough cluster nodes for your data by performing a node count estimate. See Generating a Node Count Estimate.
Analytics Cluster Size Estimates (Auto Provisioning)
This topic describes Analytics Cluster Node Count Estimates.
Auto Provisioning provides recommendations on how many analytics nodes are needed to run a workload. When the service is started, database tables on which analytics queries are run need to be loaded to MySQL Analytics cluster memory. The size of the cluster needed depends on tables and columns required to load, and the compression achieved in memory for this data. Under-provisioning the cluster results in data load or query execution failure due to space limitations. Over-provisioning the cluster results in additional costs for unneeded resources. Based on the database tables the user intends to load to memory, Auto Provisioning uses machine learning to intelligently predict the number of analytics nodes needed.
The following topics are described:
Generating a Node Count Estimate
This topic describes how to generate an analytics node count estimate to determine the number of nodes required for your data.
A node count estimate is generated using Machine Learning techniques based on the node shape that you select and the data present in the MySQL DB System associated with the analytics cluster. You can generate a node count estimate when adding an analytics cluster to a MySQL DB System, or at any time after to adjust the number of nodes as your data increases or decreases in size.
- The data you intend to load into the analytics cluster must be present on the MySQL DB System.
- Before performing a node count estimate, prepare the data on the MySQL DB System:
- Exclude columns with data types that are not supported by the RAPID Query Processing Engine.
- Define string column encodings, if necessary. String columns are encoded as
VARLEN
columns by default.
- Optionally, log into your MySQL DB System and run
ANALYZE TABLE
on tables you intend to load into the analytics cluster. Estimates should generally be valid without runningANALYZE TABLE
, but runningANALYZE TABLE
ensures that estimates are as accurate as possible.
Node Count Estimate Table Errors
This topic describes table errors that may appear in node count estimate results.
Node Count Estimate Table Errors
Table Error | Description |
---|---|
TOO MANY COLUMNS TO LOAD | The table has too many columns. The column limit is 470. |
CONTAINS UNSUPPORTED COLUMN TYPES | The table has unsupported column types. Unsupported column types must be defined as NOT SECONDARY . For a list of supported column types and information about defining columns as NOT SECONDARY , refer to the MySQL Analytics User's Guide |
ALL COLUMNS MARKED AS NOT SECONDARY | There are no columns to load. All table columns are defined as NOT SECONDARY .
|
CONTAINS VARLEN COLUMN WITH >8000 BYTES | A VARLEN column exceeds the 8000 byte limit. For more information on VARLEN , see VARLEN Encoding |
ESTIMATION COULD NOT BE CALCULATED | The estimate could not be calculated. For example, a table estimate may not be available if statistics for VARLEN columns are unavailable.
|
MySQL Analytics Engine Quickstart
The MySQL Analytics Engine Quickstart walks you through the steps required to add an Analytics Cluster to a MySQL DB System, load data into the Analytics Cluster, and run queries.
The following topics are described:
Prerequisites
This topic lists the prerequisites for the MySQL Analytics Engine Quickstart.
- An operational MySQL DB System created using a BM.Standard.E2.64 or MySQL.Analytics.VM.Standard.E3 shape. If you do not have a MySQL DB System, refer to Getting Started with MySQL Database Service for the steps required to create one. Make note of the IP address of the MySQL Endpoint in the DB System, and the administration user and password.
- A running Compute instance (Oracle Linux is used in this Quickstart) attached to a public subnet on the same VCN as the MySQL DB System. Make note of the public IP address of the compute instance. For information about setting up a Compute instance, refer to Creating a Compute Instance.
- MySQL Shell 8.0.21 or higher installed on the Compute instance. For installation instructions, see Connecting to the MySQL DB System with SSH and MySQL Shell.
- Access to Object Storage and an existing bucket.
- The name of your Object Storage namespace. See Understanding Object Storage Namespaces for more information.
- A valid OCI CLI configuration file. See SDK and CLI Configuration File. If you have installed and configured the Command Line Interface in the default location, you have a valid configuration file. If you have not installed and configured the CLI, you must either install it, or create a configuration file manually. MySQL Shell references the CLI configuration file when accessing the Object storage bucket.
- In addition to the mandatory policies for the MySQL DB System, you, or your group, have been granted the
mysql-analytics
policies described in Policy Details for MySQL Database Service. - The
tpch
sample database referenced in the Quickstart is an ad-hoc decision support database derived from the TPC Benchmark™ H (TPC-H) specification. The database consist of eight separate tables. For an overview of the database schema, refer to the Logical Database Design section of the specification document. Example DDL statements are provided for creating the schema and tables, but you must supply your own data files for populating thetpch
sample database. You can generate data using thedbgen
tool provided in the TPC-H Tools download from TPC Download Current. At least 1GB of data is recommended.
Uploading Sample Data to Object Storage
This topic describes how to upload your sample data files for the tpch
sample database to an Object Storage bucket.
nation.tbl
region.tbl
part.tbl
supplier.tbl
partsupp.tbl
customer.tbl
orders.tbl
lineitem.tbl
- In the Oracle Cloud Infrastructure Console navigation menu, go to Object Storage, and then select Object Storage.
- From the list of Buckets, click Create Bucket or select your object storage bucket.
- On the Bucket Details page, under Objects, click Upload.
- In the Upload Objects panel, drag and drop the sample database files to the dropzone, or click select files to locate them on your machine.
- Click Upload, and then click Close.
Creating the Sample Database and Importing Data
This topic describes how to create the tpch
sample database on the MySQL DB System and import the sample data. The sample data must be available on the MySQL DB System before it can be loaded into the Analytics Cluster.
Sample database creation and import operations are performed using MySQL Shell. The MySQL Shell parallel table import utility provides fast data import for large data files. The utility analyzes an input data file, divides it into chunks, and uploads the chunks to the target MySQL DB System using parallel connections. The utility is capable of completing a large data import many times faster than a standard single-threaded upload using a LOAD DATA
statement. For additional information, see Parallel Table Import Utility.
If you prefer to use your own data instead of the
tpch
sample database, refer to Importing and Exporting for information about loading data into the MySQL DB System. After loading data into the MySQL DB System, refer to the MySQL Database Service Analytics Engine User Guide for information about preparing data and loading it into the Analytics Cluster.
tpch
sample database on the MySQL DB System and import data:
Adding an Analytics Cluster
This topic describes how to add an analytics cluster to your MySQL DB System.
To add an Analytics Cluster:
- In the Oracle Cloud Infrastructure Console navigation menu, go to MySQL, and then select DB Systems.
- On the MySQL DB Systems dialog, select a DB System. Select a different Compartment if necessary to locate the DB System that you want to use.
- On the MySQL DB Systems Information dialog, select Add Analytics Cluster from the More Actions drop-down menu.
- On the Add Analytics Cluster dialog, click Select Shape to select a shape for your analytics nodes. Select the same shape used by the MySQL DB System. The shape defines the number of CPU cores, the amount of RAM, and so on.
- Define the number of analytics nodes to create. The default number of nodes is 2, which is a sufficient number of nodes for the MySQL Analytics Service Quickstart, assuming roughly 1GB of data. Optionally, you can perform a node count estimate to determine an appropriate number of nodes to create. For details, see Generating a Node Count Estimate.
- Click Add Analytics Cluster to create the Analytics Cluster.
Loading Sample Data Into the Analytics Cluster
This topic describes how to load data into the analytics cluster.
RAPID
as the secondary engine, and executing table load operations.
Running Queries
This topic describes how to query data in the analytics cluster.
For other tpch
sample database queries that you can run, see Additional Queries. For additional information about running queries, refer to the MySQL Database Service Analytics Engine User Guide.
Additional Queries
This topic provides additional queries that you can run to test the Analytics Cluster.
TPCH-Q1: Pricing Summary Report Query
As described in the TPC Benchmark™ H (TPC-H) specification: "The Pricing Summary Report Query provides a summary pricing report for all lineitems shipped as of a given date. The date is within 60 - 120 days of the greatest ship date contained in the database. The query lists totals for extended price, discounted extended price, discounted extended price plus tax, average quantity, average extended price, and average discount. These aggregates are grouped by RETURNFLAG
and LINESTATUS
, and listed in ascending order of RETURNFLAG
and LINESTATUS
. A count of the number of lineitems in each group is included."
SELECT
l_returnflag,
l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM
lineitem
WHERE
l_shipdate <= DATE '1998-12-01' - INTERVAL '90' DAY
GROUP BY l_returnflag , l_linestatus
ORDER BY l_returnflag , l_linestatus;
TPCH-Q3: Shipping Priority Query
As described in the TPC Benchmark™ H (TPC-H) specification: "The Shipping Priority Query retrieves the shipping priority and potential revenue, defined as the sum of l_extendedprice * (1-l_discount)
, of the orders having the largest revenue among those that had not been shipped as of a given date. Orders are listed in decreasing order of revenue. If more than 10 unshipped orders exist, only the 10 orders with the largest revenue are listed."
SELECT
l_orderkey,
SUM(l_extendedprice * (1 - l_discount)) AS revenue,
o_orderdate,
o_shippriority
FROM
customer,
orders,
lineitem
WHERE
c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < DATE '1995-03-15'
AND l_shipdate > DATE '1995-03-15'
GROUP BY l_orderkey , o_orderdate , o_shippriority
ORDER BY revenue DESC , o_orderdate
LIMIT 10;
TPCH-Q9: Product Type Profit Measure Query
As described in the TPC Benchmark™ H (TPC-H) specification: "The Product Type Profit Measure Query finds, for each nation and each year, the profit for all parts ordered in that year that contain a specified substring in their names and that were filled by a supplier in that nation. The profit is defined as the sum of [(l_extendedprice*(1-l_discount)) - (ps_supplycost * l_quantity)]
for all lineitems describing parts in the specified line. The query lists the nations in ascending alphabetical order and, for each nation, the year and profit in descending order by year (most recent first). "
SELECT
nation, o_year, SUM(amount) AS sum_profit
FROM
(SELECT
n_name AS nation,
YEAR(o_ORDERdate) AS o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
FROM
part
STRAIGHT_JOIN partsupp
STRAIGHT_JOIN lineitem
STRAIGHT_JOIN supplier
STRAIGHT_JOIN orders
STRAIGHT_JOIN nation
WHERE
s_suppkey = l_suppkey
AND ps_suppkey = l_suppkey
AND ps_partkey = l_partkey
AND p_partkey = l_partkey
AND o_ORDERkey = l_ORDERkey
AND s_nationkey = n_nationkey
AND p_name LIKE '%green%') AS profit
GROUP BY nation , o_year
ORDER BY nation , o_year DESC;