MySQL Execution Plan--DISTINCT语句优化
问题描述
在很多业务场景中业务需要过滤掉重复数据,对于MySQL数据库可以有多种SQL写法能实现这种需求,如:
-
使用DISTINCT,如:
SELECT DISTINCT username FROM hotel_owner WHERE username IN ('user001','user002');
-
使用GROUP BY,如:
SELECT username FROM hotel_owner WHERE username IN ('user001','user002') GROUP BY username;
-
使用LIMIT 1,如:
SELECT username FROM( SELECT username FROM hotel_owner WHERE username = 'user001' LIMIT 1 ) AS T1 UNION ALL SELECT username FROM( SELECT username FROM hotel_owner WHERE username = 'user002' LIMIT 1 ) AS T2
-
使用EXIST,如:
SELECT username FROM ( SELECT 'user001' AS username UNION ALL SELECT 'user002' AS username ) AS T1 WHERE EXISTS( SELECT username FROM hotel_owner AS T2 WHERE T1.username = T1.username )
-
使用临时变量、使用公共表达式+rownumber(MYSQL 8.0)等其他
当前hotel_owner表上有索引idx_username(username)
,针对上面两个用户的数据量为:
mysql> SELECT username,count(1) AS usercount
-> FROM hotel_owner
-> WHERE username IN ('user001','user002')
-> GROUP BY username;
+-------------+-----------+
| username | usercount |
+-------------+-----------+
| user002 | 16455 |
| user001 | 18718 |
+-------------+-----------+
2 rows in set (0.02 sec)
上面4种SQL都能得到相同的执行结果,但查询性能相差50倍以上。
问题原因
MySQL Server架构可分为MySQL网络连接层、MySQL服务层、MySQL存储引擎层三层:
- MySQL网络连接层,负责处理客户端请求连接。
- MySQL服务层,负责解析SQL语句生成直接计划,由查询执行引擎与存储引擎层进行交互处理,将处理结果返回给客户端。
- MySQL存储引擎层,负责MySQL中数据的存储与提取,与底层系统文件进行交互。MySQL存储引擎是插件式的,服务器中的查询执行引擎通过接口与存储引擎进行通信,接口屏蔽了不同存储引擎之间的差异 。
MySQL查询处理流程如下:
由于MySQL架构的分层设计和不同存储引擎内部实现的差异性,MySQL服务层的查询优化器无法针对某个存储引擎进行定制开发,导致MySQL查询优化器在某些场景下无法生成"相对更优"的执行计划,需要研发人员"使用查询提示"或"改写SQL语句"来改变SQL语句的执行计划和提示SQL语句的执行效率。
通过MySQL内部工具profiling能清楚得到上面四种SQL语句的实际执行耗时:
*************************** 1. row ***************************
Query_ID: 1
Duration: 0.02456375
Query: SELECT DISTINCT username
FROM hotel_owner
WHERE username IN ('user001','user002')
*************************** 2. row ***************************
Query_ID: 2
Duration: 0.02770700
Query: SELECT username
FROM hotel_owner
WHERE username IN ('user001','user002')
GROUP BY username
*************************** 3. row ***************************
Query_ID: 3
Duration: 0.00054050
Query: SELECT username
FROM(
SELECT username
FROM hotel_owner
WHERE username = 'user001'
LIMIT 1
) AS T1
UNION ALL
SELECT username
FROM(
SELECT username
FROM hotel_owner
WHERE username = 'user002'
LIMIT 1
) AS T2
*************************** 4. row ***************************
Query_ID: 4
Duration: 0.00083600
Query: SELECT username
FROM (
SELECT 'user001' AS username
UNION ALL
SELECT 'user002' AS username
) AS T1
WHERE EXISTS(
SELECT username
FROM hotel_owner AS T2
WHERE T1.username = T1.username
)
DISTINCT
方式和GROUP BY
方式耗时接近,耗时分别为24ms和27ms。
LIMIT 1
方式和EXISTS
方式耗时接近,耗时分别为0.5毫秒和0.8ms。
其中DISTINCT
方式的执行计划和执行成本明细为:
mysql> DESC SELECT DISTINCT username
-> FROM hotel_owner
-> WHERE username IN ('user001','user002') \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: hotel_owner
partitions: NULL
type: range
possible_keys: idx_seq_usr,idx_username
key: idx_username
key_len: 152
ref: NULL
rows: 66282
filtered: 100.00
Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)
mysql> SHOW PROFILE CPU,BLOCK IO,SWAPS FOR QUERY 1;
+----------------------+----------+----------+------------+--------------+---------------+-------+
| Status | Duration | CPU_user | CPU_system | Block_ops_in | Block_ops_out | Swaps |
+----------------------+----------+----------+------------+--------------+---------------+-------+
| starting | 0.000065 | NULL | NULL | NULL | NULL | NULL |
| checking permissions | 0.000006 | NULL | NULL | NULL | NULL | NULL |
| Opening tables | 0.000017 | NULL | NULL | NULL | NULL | NULL |
| init | 0.000024 | NULL | NULL | NULL | NULL | NULL |
| System lock | 0.000007 | NULL | NULL | NULL | NULL | NULL |
| optimizing | 0.000008 | NULL | NULL | NULL | NULL | NULL |
| statistics | 0.000158 | NULL | NULL | NULL | NULL | NULL |
| preparing | 0.000018 | NULL | NULL | NULL | NULL | NULL |
| Sorting result | 0.000004 | NULL | NULL | NULL | NULL | NULL |
| executing | 0.000001 | NULL | NULL | NULL | NULL | NULL |
| Sending data | 0.024214 | NULL | NULL | NULL | NULL | NULL |
| end | 0.000003 | NULL | NULL | NULL | NULL | NULL |
| query end | 0.000007 | NULL | NULL | NULL | NULL | NULL |
| closing tables | 0.000004 | NULL | NULL | NULL | NULL | NULL |
| freeing items | 0.000021 | NULL | NULL | NULL | NULL | NULL |
| cleaning up | 0.000008 | NULL | NULL | NULL | NULL | NULL |
+----------------------+----------+----------+------------+--------------+---------------+-------+
16 rows in set, 1 warning (0.00 sec)
而LIMIT 1
方式的执行成本明细为:
mysql> DESC SELECT username
-> FROM(
-> SELECT username
-> FROM hotel_owner
-> WHERE username = 'user001'
-> LIMIT 1
-> ) AS T1
-> UNION ALL
-> SELECT username
-> FROM(
-> SELECT username
-> FROM hotel_owner
-> WHERE username = 'user002'
-> LIMIT 1
-> ) AS T2 \G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: <derived2>
partitions: NULL
type: system
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
filtered: 100.00
Extra: NULL
*************************** 2. row ***************************
id: 2
select_type: DERIVED
table: hotel_owner
partitions: NULL
type: ref
possible_keys: idx_username
key: idx_username
key_len: 152
ref: const
rows: 34788
filtered: 100.00
Extra: Using index
*************************** 3. row ***************************
id: 3
select_type: UNION
table: <derived4>
partitions: NULL
type: system
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
filtered: 100.00
Extra: NULL
*************************** 4. row ***************************
id: 4
select_type: DERIVED
table: hotel_owner
partitions: NULL
type: ref
possible_keys: idx_username
key: idx_username
key_len: 152
ref: const
rows: 31494
filtered: 100.00
Extra: Using index
4 rows in set, 1 warning (0.00 sec)
mysql> SHOW PROFILE CPU,BLOCK IO,SWAPS FOR QUERY 3;
+----------------------+----------+----------+------------+--------------+---------------+-------+
| Status | Duration | CPU_user | CPU_system | Block_ops_in | Block_ops_out | Swaps |
+----------------------+----------+----------+------------+--------------+---------------+-------+
| starting | 0.000099 | NULL | NULL | NULL | NULL | NULL |
| checking permissions | 0.000003 | NULL | NULL | NULL | NULL | NULL |
| checking permissions | 0.000004 | NULL | NULL | NULL | NULL | NULL |
| Opening tables | 0.000060 | NULL | NULL | NULL | NULL | NULL |
| init | 0.000066 | NULL | NULL | NULL | NULL | NULL |
| System lock | 0.000007 | NULL | NULL | NULL | NULL | NULL |
| optimizing | 0.000004 | NULL | NULL | NULL | NULL | NULL |
| optimizing | 0.000008 | NULL | NULL | NULL | NULL | NULL |
| statistics | 0.000083 | NULL | NULL | NULL | NULL | NULL |
| preparing | 0.000018 | NULL | NULL | NULL | NULL | NULL |
| executing | 0.000001 | NULL | NULL | NULL | NULL | NULL |
| Sending data | 0.000027 | NULL | NULL | NULL | NULL | NULL |
| statistics | 0.000005 | NULL | NULL | NULL | NULL | NULL |
| preparing | 0.000005 | NULL | NULL | NULL | NULL | NULL |
| optimizing | 0.000004 | NULL | NULL | NULL | NULL | NULL |
| optimizing | 0.000003 | NULL | NULL | NULL | NULL | NULL |
| statistics | 0.000044 | NULL | NULL | NULL | NULL | NULL |
| preparing | 0.000006 | NULL | NULL | NULL | NULL | NULL |
| executing | 0.000001 | NULL | NULL | NULL | NULL | NULL |
| Sending data | 0.000020 | NULL | NULL | NULL | NULL | NULL |
| statistics | 0.000004 | NULL | NULL | NULL | NULL | NULL |
| preparing | 0.000002 | NULL | NULL | NULL | NULL | NULL |
| executing | 0.000002 | NULL | NULL | NULL | NULL | NULL |
| Sending data | 0.000007 | NULL | NULL | NULL | NULL | NULL |
| executing | 0.000001 | NULL | NULL | NULL | NULL | NULL |
| Sending data | 0.000002 | NULL | NULL | NULL | NULL | NULL |
| end | 0.000002 | NULL | NULL | NULL | NULL | NULL |
| query end | 0.000007 | NULL | NULL | NULL | NULL | NULL |
| removing tmp table | 0.000002 | NULL | NULL | NULL | NULL | NULL |
| query end | 0.000001 | NULL | NULL | NULL | NULL | NULL |
| closing tables | 0.000001 | NULL | NULL | NULL | NULL | NULL |
| removing tmp table | 0.000003 | NULL | NULL | NULL | NULL | NULL |
| closing tables | 0.000001 | NULL | NULL | NULL | NULL | NULL |
| removing tmp table | 0.000002 | NULL | NULL | NULL | NULL | NULL |
| closing tables | 0.000009 | NULL | NULL | NULL | NULL | NULL |
| freeing items | 0.000021 | NULL | NULL | NULL | NULL | NULL |
| cleaning up | 0.000008 | NULL | NULL | NULL | NULL | NULL |
+----------------------+----------+----------+------------+--------------+---------------+-------+
37 rows in set, 1 warning (0.00 sec)
DISTINCT
方式和LIMIT 1
方式都使用索引,其中最大耗时差异在Sending data部分。
DISTINCT
方式的Sending data部分耗时:
| Sending data | 0.024214 | NULL | NULL | NULL | NULL | NULL |
LIMIT 1
方式的Sending data部分耗时:
| Sending data | 0.000027 | NULL | NULL | NULL | NULL | NULL |
| Sending data | 0.000020 | NULL | NULL | NULL | NULL | NULL |
| Sending data | 0.000007 | NULL | NULL | NULL | NULL | NULL |
差异原因:
DISTINCT
方式需要扫描所有满足WHERE条件的16455+18718条记录并将这些记录返回到MySQL Server层,由MySQL Server层负责数据去重处理(DISTINCT)并返回给客户端。LIMIT 1
方式针对每个子查询仅需要扫描到第1条满足WHERE条件的记录并将这些记录返回到MySQL Server层,由MySQL Server层负责数据合并(UNION ALL)并返回给客户端。
当满足WHERE条件的记录较少时,无论使用上述4种SQL种的任意1种方式都能快速返回结果,但随着满足WHERE条件的记录增多时,需要结合实际的业务需求和数据分布来编写"高效SQL"。
优化建议
在编写SQL语句时,不仅需要根据业务需求编写"正确SQL",还需要根据"实际数据分布"编写"高效SQL"。