MySQL 8.0 Reference Manual Including MySQL NDB Cluster 8.0

8.2.2.1 Optimizing IN and EXISTS Subquery Predicates with Semijoin Transformations

A semijoin is a preparation-time transformation that enables multiple execution strategies such as table pullout, duplicate weedout, first match, loose scan, and materialization. The optimizer uses semijoin strategies to improve subquery execution, as described in this section.

For an inner join between two tables, the join returns a row from one table as many times as there are matches in the other table. But for some questions, the only information that matters is whether there is a match, not the number of matches. Suppose that there are tables named class and roster that list classes in a course curriculum and class rosters (students enrolled in each class), respectively. To list the classes that actually have students enrolled, you could use this join:

SELECT class.class_num, class.class_name
    FROM class 
    INNER JOIN roster
    WHERE class.class_num = roster.class_num;

However, the result lists each class once for each enrolled student. For the question being asked, this is unnecessary duplication of information.

Assuming that class_num is a primary key in the class table, duplicate suppression is possible by using SELECT DISTINCT, but it is inefficient to generate all matching rows first only to eliminate duplicates later.

The same duplicate-free result can be obtained by using a subquery:

SELECT class_num, class_name
    FROM class
    WHERE class_num IN 
        (SELECT class_num FROM roster);

Here, the optimizer can recognize that the IN clause requires the subquery to return only one instance of each class number from the roster table. In this case, the query can use a semijoin; that is, an operation that returns only one instance of each row in class that is matched by rows in roster.

The following statement, which contains an EXISTS subquery predicate, is equivalent to the previous statement containing an IN subquery predicate:

SELECT class_num, class_name
    FROM class
    WHERE EXISTS 
        (SELECT * FROM roster WHERE class.class_num = roster.class_num);

In MySQL 8.0.16 and later, any statement with an EXISTS subquery predicate is subject to the same semijoin transforms as a statement with an equivalent IN subquery predicate.

Beginning with MySQL 8.0.17, the following subqueries are transformed into antijoins:

In short, any negation of a subquery of the form IN (SELECT ... FROM ...) or EXISTS (SELECT ... FROM ...) is transformed into an antijoin.

An antijoin is an operation that returns only rows for which there is no match. Consider the query shown here:

SELECT class_num, class_name
    FROM class
    WHERE class_num NOT IN 
        (SELECT class_num FROM roster);

This query is rewritten internally as the antijoin SELECT class_num, class_name FROM class ANTIJOIN roster ON class_num, which returns one instance of each row in class that is not matched by any rows in roster. This means that, for each row in class, as soon as a match is found in roster, the row in class can be discarded.

Antijoin transformations cannot in most cases be applied if the expressions being compared are nullable. An exception to this rule is that (... NOT IN (SELECT ...)) IS NOT FALSE and its equivalent (... IN (SELECT ...)) IS NOT TRUE can be transformed into antijoins.

Outer join and inner join syntax is permitted in the outer query specification, and table references may be base tables, derived tables, view references, or common table expressions.

In MySQL, a subquery must satisfy these criteria to be handled as a semijoin (or, in MySQL 8.0.17 and later, an antijoin if NOT modifies the subquery):

If a subquery meets the preceding criteria, MySQL converts it to a semijoin (or, in MySQL 8.0.17 or later, an antijoin if applicable) and makes a cost-based choice from these strategies:

Each of these strategies can be enabled or disabled using the following optimizer_switch system variable flags:

These flags are enabled by default. See Section 8.9.2, “Switchable Optimizations”.

The optimizer minimizes differences in handling of views and derived tables. This affects queries that use the STRAIGHT_JOIN modifier and a view with an IN subquery that can be converted to a semijoin. The following query illustrates this because the change in processing causes a change in transformation, and thus a different execution strategy:

CREATE VIEW v AS
SELECT *
FROM t1
WHERE a IN (SELECT b
           FROM t2);

SELECT STRAIGHT_JOIN *
FROM t3 JOIN v ON t3.x = v.a;

The optimizer first looks at the view and converts the IN subquery to a semijoin, then checks whether it is possible to merge the view into the outer query. Because the STRAIGHT_JOIN modifier in the outer query prevents semijoin, the optimizer refuses the merge, causing derived table evaluation using a materialized table.

EXPLAIN output indicates the use of semijoin strategies as follows: