聊聊MySQL的子查询

1. 背景

在之前介绍MySQL执行计划的博文中已经谈及了一些关于子查询相关的执行计划与优化。本文将重点介绍MySQL中与子查询相关的内容，设计子查询优化策略，包含半连接子查询的优化与非半连接子查询的优化。其中非半连接子查询优化由于策略较少不作详细展开。

2. 子查询概念

子查询简单理解就是在sql中嵌套了select查询子句。子查询的优点在于它的可读性会比较高（相比写join和union）。子查询根据查询结果的形式可以大致分为标量子查询、列子查询、行子查询、表子查询等。根据与外部语句的关联性，也可以分为相关子查询和独立子查询。

有一个谬误是所有子查询能做的事情，用连接也能做。

举例，如下在where中嵌套一个其他表最大值子查询来等值比较的子查询就没法用join做到（不要钻牛角尖硬套join）



select x from a where x  = (select max(y) from b);

3. 子查询的效率

很多人都会关心子查询效率，是否子查询真的就很慢？这个问题很难回答。但在MySQL低版本（5.5以下）对子查询的支持确实不是太好。在《高性能MySQL》一书中的6.5.1（关联子查询）章节中也提及了MySQL对于where子句中带有in子查询的处理，就是用exists转写为关联子查询，这样的话查询复杂度非常高，性能很差。所以在这种情况下，建议把子查询改写为关联查询并保证从表的连接字段上有索引。

4. 子查询的优化

在MySQL 5.6及更高版本，有了更多的子查询优化策略，具体如下：

对于in或者=any子查询：
- Semi-Join（半连接）
- Materialization（物化）
- EXISTS策略
对于not in或者<>any子查询：
- Materialization（物化）
- EXISTS策略
对于派生表（from子句中的子查询）：
- 把派生表结果与外部查询块合并
- 把派生表物化为临时表

4.1 半连接优化

只有满足如下条件的子查询，MySQL可以应用半连接优化。

in或者=any出现在where或者on的顶层
子查询为不含union的单一select
子查询不含有group和having
子查询不能带有聚合函数
子查询不能带有order by... limit
子查询不能带有straight_join查询提示
外层表和内层表的总数不能超过join允许最大的表数量

如果满足如上条件，MySQL会基于成本代价从如下几种优化策略选择一个来进行半连接子查询优化

4.1.1 Table pullout优化

Table pullout优化将半连接子查询的表提取到外部查询，将查询改写为join。这与人们在使用低版本MySQL时的常见的手工调优策略很相似。

下面来举例看看：

首先建立两张表，一张为stu（学生表），一张为stu_cls（学生班级关联表）

create table stu (id int primary key auto_increment);

create table stu_cls(s_id int, c_id int, unique key(s_id));

往stu表里初始化一些数据

insert into stu select null;

insert into stu select null from stu;

insert into stu select null from stu;

insert into stu select null from stu;

insert into stu select null from stu;

insert into stu select null from stu;

insert into stu select null from stu;

insert into stu select null from stu;

insert into stu select null from stu;

insert into stu select null from stu;

往stu_cls表中初始化一些数据

insert into stu_cls select 1,1;

insert into stu_cls select 2,1;

insert into stu_cls select 3,2;

insert into stu_cls select 4,3;

insert into stu_cls select 5,2;

insert into stu_cls select 6,2;

insert into stu_cls select 7,4;

insert into stu_cls select 8,2;

insert into stu_cls select 9,1;

insert into stu_cls select 10,3;

在MySQL5.5的环境下，当查询下面的执行计划时，显示的结果很可能是这样的

mysql> explain extended select * from stu where id in (select s_id from stu_cls)\G

*************************** 1. row ***************************

           id: 1

  select_type: PRIMARY

        table: stu

         type: index

possible_keys: NULL

          key: PRIMARY

      key_len: 4

          ref: NULL

         rows: 512

     filtered: 100.00

        Extra: Using where; Using index

*************************** 2. row ***************************

           id: 2

  select_type: DEPENDENT SUBQUERY

        table: stu_cls

         type: index_subquery

possible_keys: s_id

          key: s_id

      key_len: 5

          ref: func

         rows: 1

     filtered: 100.00

        Extra: Using index; Using where

2 rows in set, 1 warning (0.01 sec)

可以看到stu进行了一次索引全扫描（因为我们的试验表结构比较简单，stu表里就一个主键id)，子查询的类型为相关子查询。近一步查看MySQL改写过的SQL。

mysql> show warnings\G

*************************** 1. row ***************************

  Level: Note

   Code: 1003

Message: select `test`.`stu`.`id` AS `id` from `test`.`stu` where <in_optimizer>(`test`.`stu`.`id`,<exists>(<index_lookup>(<cache>(`test`.`stu`.`id`) in stu_cls on s_id where (<cache>(`test`.`stu`.`id`) = `test`.`stu_cls`.`s_id`))))

1 row in set (0.00 sec)

可以看到在MySQL5.5中采用了EXISTS策略将in子句转写为了exists子句。

在MySQL5.7的环境下，上面的执行计划如下所示

mysql> explain select * from stu where id in (select s_id from stu_cls)\G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: stu_cls

   partitions: NULL

         type: index

possible_keys: s_id

          key: s_id

      key_len: 5

          ref: NULL

         rows: 10

     filtered: 100.00

        Extra: Using where; Using index

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: stu

   partitions: NULL

         type: eq_ref

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: test.stu_cls.s_id

         rows: 1

     filtered: 100.00

        Extra: Using index

2 rows in set, 1 warning (0.00 sec)

可以看到执行计划中的两行id是一样的，从rows列的信息可以看到，预估扫描记录数仅仅为之前的2%；查看改写过的SQL如下

mysql> show warnings\G

*************************** 1. row ***************************

  Level: Note

   Code: 1003

Message: /* select#1 */ select `test`.`stu`.`id` AS `id` from `test`.`stu_cls` join `test`.`stu` where (`test`.`stu`.`id` = `test`.`stu_cls`.`s_id`)

1 row in set (0.00 sec)

可以看到原来的子查询已经被改写为join了。

值得一提的是，Table pullout优化没有单独的优化器开关。如果想要禁用Table pullout优化，则需要禁用semijoin优化。如下所示：

mysql> set optimizer_switch='semijoin=off';

Query OK, 0 rows affected (0.00 sec)

mysql> explain select * from stu where id in (select s_id from stu_cls)\G

*************************** 1. row ***************************

           id: 1

  select_type: PRIMARY

        table: stu

   partitions: NULL

         type: index

possible_keys: NULL

          key: PRIMARY

      key_len: 4

          ref: NULL

         rows: 512

     filtered: 100.00

        Extra: Using where; Using index

*************************** 2. row ***************************

           id: 2

  select_type: SUBQUERY

        table: stu_cls

   partitions: NULL

         type: index

possible_keys: s_id

          key: s_id

      key_len: 5

          ref: NULL

         rows: 10

     filtered: 100.00

        Extra: Using index

2 rows in set, 1 warning (0.00 sec)

mysql> show warnings\G

*************************** 1. row ***************************

  Level: Note

   Code: 1003

Message: /* select#1 */ select `test`.`stu`.`id` AS `id` from `test`.`stu` where <in_optimizer>(`test`.`stu`.`id`,`test`.`stu`.`id` in ( <materialize> (/* select#2 */ select `test`.`stu_cls`.`s_id` from `test`.`stu_cls` where 1 ), <primary_index_lookup>(`test`.`stu`.`id` in <temporary table> on <auto_key> where ((`test`.`stu`.`id` = `materialized-subquery`.`s_id`)))))

1 row in set (0.00 sec)

可以看到当禁用semijoin优化后，MySQL5.7会采用非半连接子查询的物化策略来优化来处理此查询。

4.1.2 FirstMatch优化

下面来讲讲FirstMatch优化，这也是在处理半连接子查询时可能会用到的一种优化策略。

下面展示了一个构造FirstMatch的小demo例子：

create table department (id int primary key auto_increment);

create table employee (id int primary key auto_increment, dep_id int, key(dep_id));

mysql> explain select * from department where id in (select dep_id from employee)\G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: department

   partitions: NULL

         type: index

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: NULL

         rows: 1

     filtered: 100.00

        Extra: Using index

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: employee

   partitions: NULL

         type: ref

possible_keys: dep_id

          key: dep_id

      key_len: 5

          ref: test.department.id

         rows: 1

     filtered: 100.00

        Extra: Using index; FirstMatch(department)

2 rows in set, 1 warning (0.00 sec)

mysql> show warnings\G

*************************** 1. row ***************************

  Level: Note

   Code: 1003

Message: /* select#1 */ select `test`.`department`.`id` AS `id` from `test`.`department` semi join (`test`.`employee`) where (`test`.`employee`.`dep_id` = `test`.`department`.`id`)

1 row in set (0.00 sec)

我们可以看到上面查询计划中，两个id都为1，且extra中列可以看到FirstMatch(department)。MySQL使用了连接来处理此查询，对于department表的行，只要能在employee表中找到1条满足即可以不必再检索employee表。从语义角度来看，和IN-to-EXIST策略转换为Exist子句是相似的，区别就是FirstMath以连接形式执行查询，而不是子查询。

4.1.3 Semi-join Materialization优化

Semi-join Materialization优化策略是将半连接的子查询物化为临时表。

我们可以将上面FirstMatch优化策略中的例子进行改造以构造Materialization优化的例子。

mysql> alter table employee drop index dep_id;

Query OK, 0 rows affected (0.10 sec)

Records: 0  Duplicates: 0  Warnings: 0

mysql> explain select * from department where id in (select dep_id from employee)\G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: <subquery2>

   partitions: NULL

         type: ALL

possible_keys: NULL

          key: NULL

      key_len: NULL

          ref: NULL

         rows: NULL

     filtered: 100.00

        Extra: Using where

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: department

   partitions: NULL

         type: eq_ref

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: <subquery2>.dep_id

         rows: 1

     filtered: 100.00

        Extra: Using index

*************************** 3. row ***************************

           id: 2

  select_type: MATERIALIZED

        table: employee

   partitions: NULL

         type: ALL

possible_keys: NULL

          key: NULL

      key_len: NULL

          ref: NULL

         rows: 1

     filtered: 100.00

        Extra: NULL

3 rows in set, 1 warning (0.00 sec)

mysql> show warnings\G

*************************** 1. row ***************************

  Level: Note

   Code: 1003

Message: /* select#1 */ select `test`.`department`.`id` AS `id` from `test`.`department` semi join (`test`.`employee`) where (`test`.`department`.`id` = `<subquery2>`.`dep_id`)

1 row in set (0.00 sec)

上述操作，先是drop掉了employee表上的dep_id索引，然后查看相同的执行计划。可以看到查询计划返回了三段结果，其中前两个id为1，最后一个id为2，且select_type为MATERIALIZED。这说明MySQL对于此查询会创建临时表，会将employee表物化为临时表,然后再以临时表为驱动表与department进行连接。

使用该优化策略需要打开优化器的semijoin和materialization开关。

4.1.4 Duplicate Weedout优化

这种优化策略就是将半连接子查询转换为INNER JOIN，再删除重复重复记录。

weed out有清除杂草的含义，所谓Duplicate Weedout优化策略就是先连接，得到有重复的结果集，再从中删除重复记录，好比是清理杂草。

由于构造Duplicate Weedout例子比较困难，我们不妨这里显式关闭半连接其他几个优化策略:

set optimizer_switch = 'materialization=off';

set optimizer_switch = 'firstmatch=off';

set optimizer_switch = 'loosescan=off';

以上面Semi-join Materialization优化中的表继续查询相同的执行计划:

mysql> explain select * from department where id in (select dep_id from employee)\G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: employee

   partitions: NULL

         type: ALL

possible_keys: NULL

          key: NULL

      key_len: NULL

          ref: NULL

         rows: 1

     filtered: 100.00

        Extra: Using where; Start temporary

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: department

   partitions: NULL

         type: eq_ref

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: test.employee.dep_id

         rows: 1

     filtered: 100.00

        Extra: Using index; End temporary

2 rows in set, 1 warning (0.00 sec)

mysql> show warnings\G

*************************** 1. row ***************************

  Level: Note

   Code: 1003

Message: /* select#1 */ select `test`.`department`.`id` AS `id` from `test`.`department` semi join (`test`.`employee`) where (`test`.`department`.`id` = `test`.`employee`.`dep_id`)

1 row in set (0.00 sec)

可以看到两段结果中Extra中分别有Start temporary和End temporary，这正是Duplicate Weedout过程中创建临时表，保存中间结果到临时表，从临时表中删除重复记录的过程。

4.1.5 LooseScan优化

LooseScan（松散索引扫描）和group by优化中的LooseScan差不多。

它采用松散索引扫描读取子查询的表，然后作为驱动表，外部表作为被驱动表进行连接。

它可以优化如下形式的子查询

select ... from ... where expr in (select key_part1 from ... where ...);

select ... from ... where expr in (select key_part2 from ... where key_part1=常量);

我们新建一张employee_department关联表

create table employee_department(id int primary key auto_increment,d_id int,e_id int,unique key(e_id,d_id));

往employee和新建的employee_department关联表中填充一些数据（填充语句省略）

查询如下执行计划：

mysql> explain select * from employee where id in (select e_id from employee_department)\G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: employee_department

   partitions: NULL

         type: index

possible_keys: e_id

          key: e_id

      key_len: 10

          ref: NULL

         rows: 4

     filtered: 100.00

        Extra: Using where; Using index; LooseScan

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: employee

   partitions: NULL

         type: eq_ref

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: test.employee_department.e_id

         rows: 1

     filtered: 100.00

        Extra: NULL

2 rows in set, 1 warning (0.00 sec)

mysql> show warnings\G

*************************** 1. row ***************************

  Level: Note

   Code: 1003

Message: /* select#1 */ select `test`.`employee`.`id` AS `id`,`test`.`employee`.`dep_id` AS `dep_id` from `test`.`employee` semi join (`test`.`employee_department`) where (`test`.`employee`.`id` = `test`.`employee_departm

可以看到返回的结果中第一段的extra中带有LooseScan，且两段结果的id也都为1，也即采用了半连接查询优化中的松散索引扫描。

此查询MySQL会以employee_department来作为驱动表，读取employee_department的唯一索引，连接employee得到最终结果集。

LooseScan对应的优化器开关为loosecan

4.2 非半连接优化

以上部分介绍了基于半连接的子查询优化，对于如下一些非半连接子查询，则无法使用上面的几种优化方式了：

in与or混在一起

select *

from xxx

where expr1 in (select ...) or expr2;

用了not in

select *

from xxx

where expr not in (select ...);

select中套了子查询

select ...,(select ...)

from xxx;

having里套了子查询

select ...

from xxx

where expr

having (select ...);

子查询里用了union

select ...

form

where expr in (select ... union select ...)

这种非半连接子查询优化比较少，主要就是Materialization（物化）和IN-to-EXISTS（in转exist子句），这里不作过多介绍。

5. 参考

MySQL官方手册

《深入理解MariaDB与MySQL》