MySQL 5.7 Reference Manual Including MySQL NDB Cluster 7.5 and NDB Cluster 7.6
A semijoin is a preparation-time transformation that enables multiple execution strategies such as table pullout, duplicate weedout, first match, loose scan, and materialization. The optimizer uses semijoin strategies to improve subquery execution, as described in this section.
For an inner join between two tables, the join returns a row
from one table as many times as there are matches in the other
table. But for some questions, the only information that
matters is whether there is a match, not the number of
matches. Suppose that there are tables named
class
and roster
that
list classes in a course curriculum and class rosters
(students enrolled in each class), respectively. To list the
classes that actually have students enrolled, you could use
this join:
SELECT class.class_num, class.class_name FROM class INNER JOIN roster WHERE class.class_num = roster.class_num;
However, the result lists each class once for each enrolled student. For the question being asked, this is unnecessary duplication of information.
Assuming that class_num
is a primary key in
the class
table, duplicate suppression is
possible by using
SELECT
DISTINCT
, but it is inefficient to generate all
matching rows first only to eliminate duplicates later.
The same duplicate-free result can be obtained by using a subquery:
SELECT class_num, class_name FROM class WHERE class_num IN (SELECT class_num FROM roster);
Here, the optimizer can recognize that the
IN
clause requires the subquery to return
only one instance of each class number from the
roster
table. In this case, the query can
use a semijoin; that is,
an operation that returns only one instance of each row in
class
that is matched by rows in
roster
.
Outer join and inner join syntax is permitted in the outer query specification, and table references may be base tables, derived tables, or view references.
In MySQL, a subquery must satisfy these criteria to be handled as a semijoin:
It must be an IN
(or
=ANY
) subquery that appears at the top
level of the WHERE
or
ON
clause, possibly as a term in an
AND
expression. For example:
SELECT ... FROM ot1, ... WHERE (oe1, ...) IN (SELECT ie1, ... FROM it1, ... WHERE ...);
Here, ot_
and i
it_
represent tables in the outer and inner parts of the
query, and
i
oe_
and
i
ie_
represent expressions that refer to columns in the outer
and inner tables.
i
It must not contain a GROUP BY
or
HAVING
clause.
It must not be implicitly grouped (it must contain no aggregate functions).
It must not have ORDER BY
with
LIMIT
.
The statement must not use the
STRAIGHT_JOIN
join type in the outer
query.
The STRAIGHT_JOIN
modifier must not be
present.
The number of outer and inner tables together must be less than the maximum number of tables permitted in a join.
The subquery may be correlated or uncorrelated.
DISTINCT
is permitted, as is
LIMIT
unless ORDER BY
is
also used.
If a subquery meets the preceding criteria, MySQL converts it to a semijoin and makes a cost-based choice from these strategies:
Convert the subquery to a join, or use table pullout and run the query as an inner join between subquery tables and outer tables. Table pullout pulls a table out from the subquery to the outer query.
Duplicate Weedout: Run the semijoin as if it was a join and remove duplicate records using a temporary table.
FirstMatch: When scanning the inner tables for row combinations and there are multiple instances of a given value group, choose one rather than returning them all. This "shortcuts" scanning and eliminates production of unnecessary rows.
LooseScan: Scan a subquery table using an index that enables a single value to be chosen from each subquery's value group.
Materialize the subquery into an indexed temporary table that is used to perform a join, where the index is used to remove duplicates. The index might also be used later for lookups when joining the temporary table with the outer tables; if not, the table is scanned. For more information about materialization, see Section 8.2.2.2, “Optimizing Subqueries with Materialization”.
Each of these strategies can be enabled or disabled using the
following optimizer_switch
system variable flags:
The semijoin
flag
controls whether semijoins are used.
If semijoin
is enabled,
the firstmatch
,
loosescan
,
duplicateweedout
, and
materialization
flags
enable finer control over the permitted semijoin
strategies.
If the duplicateweedout
semijoin strategy is disabled, it is not used unless all
other applicable strategies are also disabled.
If duplicateweedout
is
disabled, on occasion the optimizer may generate a query
plan that is far from optimal. This occurs due to
heuristic pruning during greedy search, which can be
avoided by setting
optimizer_prune_level=0
.
These flags are enabled by default. See Section 8.9.2, “Switchable Optimizations”.
The optimizer minimizes differences in handling of views and
derived tables. This affects queries that use the
STRAIGHT_JOIN
modifier and a view with an
IN
subquery that can be converted to a
semijoin. The following query illustrates this because the
change in processing causes a change in transformation, and
thus a different execution strategy:
CREATE VIEW v AS SELECT * FROM t1 WHERE a IN (SELECT b FROM t2); SELECT STRAIGHT_JOIN * FROM t3 JOIN v ON t3.x = v.a;
The optimizer first looks at the view and converts the
IN
subquery to a semijoin, then checks
whether it is possible to merge the view into the outer query.
Because the STRAIGHT_JOIN
modifier in the
outer query prevents semijoin, the optimizer refuses the
merge, causing derived table evaluation using a materialized
table.
EXPLAIN
output indicates the
use of semijoin strategies as follows:
Semijoined tables show up in the outer select. For
extended EXPLAIN
output,
the text displayed by a following
SHOW WARNINGS
shows the
rewritten query, which displays the semijoin structure.
(See Section 8.8.3, “Extended EXPLAIN Output Format”.) From this you
can get an idea about which tables were pulled out of the
semijoin. If a subquery was converted to a semijoin, you
can see that the subquery predicate is gone and its tables
and WHERE
clause were merged into the
outer query join list and WHERE
clause.
Temporary table use for Duplicate Weedout is indicated by
Start temporary
and End
temporary
in the Extra
column. Tables that were not pulled out and are in the
range of EXPLAIN
output
rows covered by Start temporary
and
End temporary
have their
rowid
in the temporary table.
FirstMatch(
in the tbl_name
)Extra
column indicates join
shortcutting.
LooseScan(
in the m
..n
)Extra
column indicates use of
the LooseScan strategy. m
and
n
are key part numbers.
Temporary table use for materialization is indicated by
rows with a select_type
value of
MATERIALIZED
and rows with a
table
value of
<subquery
.
N
>