SQL 优化策略与执行计划 (再读MySQL Documentation)
一.执行计划
EXPLAIN works with SELECT, DELETE, INSERT, REPLACE, and UPDATE statements.With the help of EXPLAIN, you can see where you should add indexes to tables so that the statement executes faster by using indexes to find rows. You can also use EXPLAIN to check whether the optimizer joins the tables in an optimal order. To give a hint to the optimizer to use a join order corresponding to the order in which the tables are named in a SELECT statement, begin the statement with SELECT STRAIGHT_JOIN rather than just SELECT.
EXPLAIN returns a row of information for each table used in the SELECT statement. It lists the tables in the output in the order that MySQL would read them while processing the statement. MySQL resolves all joins using a nested-loop join method. This means that MySQL reads a row from the first table, and then finds a matching row in the second table, the third table, and so on. When all tables are processed, MySQL outputs the selected columns and backtracks through the table list until a table is found for which there are more matching rows. The next row is read from this table and the process continues with the next table.
二.Index Merge 与执行计划
The Index Merge access method retrieves rows with multiple range scans and merges their results into one. This access method merges index scans from a single table only, not scans across multiple tables. The merge can produce unions, intersections, or unions-of-intersections of its underlying scans.
In EXPLAIN output, the Index Merge method appears as index_merge in the type column. In this case, the key column contains a list of indexes used, and key_len contains a list of the longest key parts for those indexes.The Index Merge access method has several algorithms, which are displayed in the Extra field of EXPLAIN output:Using intersect(...) 或者Using union(...)或者Using sort_union(...)
The Index Merge intersection algorithm performs simultaneous scans on all used indexes and produces the intersection of row sequences that it receives from the merged index scans. If all columns used in the query are covered by the used indexes, full table rows are not retrieved (EXPLAIN output contains Using index in Extra field in this case).
二.Condition Filtering 与执行计划
什么是Condition Filtering
In join processing, prefix rows are those rows passed from one table in a join to the next. In general, the optimizer attempts to put tables with low prefix counts early in the join order to keep the number of row combinations from increasing rapidly. To the extent that the optimizer can use information about conditions on rows selected from one table and passed to the next, the more accurately it can compute row estimates and choose the best execution plan.
Without condition filtering, the prefix row count for a table is based on the estimated number of rows selected by the WHERE clause according to whichever access method the optimizer chooses. Condition filtering enables the optimizer to use other relevant conditions in the WHERE clause not taken into account by the access method, and thus improve its prefix row count estimates. For example, even though there might be an index-based access method that can be used to select rows from the current table in a join, there might also be additional conditions for the table in the WHERE clause that can filter (further restrict) the estimate for qualifying rows passed to the next table.
在执行计划中的映射
In EXPLAIN output, the rows column indicates the row estimate for the chosen access method, and the filtered column reflects the effect of condition filtering. filtered values are expressed as percentages. The maximum value is 100, which means no filtering of rows occurred. Values decreasing from 100 indicate increasing amounts of filtering.
The prefix row count (the number of rows estimated to be passed from the current table in a join to the next) is the product of the rows and filtered values. That is, the prefix row count is the estimated row count, reduced by the estimated filtering effect. For example, if rows is 1000 and filtered is 20%, condition filtering reduces the estimated row count of 1000 to a prefix row count of 1000 × 20% = 1000 × .2 = 200.
举例
SELECT * FROM employee JOIN department ON employee.dept_no = department.dept_no WHERE employee.first_name = 'John' AND employee.hire_date BETWEEN '2018-01-01' AND '2018-06-01';
表说明:
(1)表employee有1024行记录,表department有12行记录;在关联查询字段dept_no上,两张表都是有索引的;此外,表employee在first_name字段上也有索引。
(2)表employee 符合 first_name = 'John' 的记录有8条;符合 hire_date BETWEEN '2018-01-01' AND '2018-06-01' 的记录有 150条;两个条件都符合的记录只有一条。
如果没有Condition Filtering,执行计划如下:
id | table | type | possible_keys | key | ref | rows | filtered |
1 1 |
employee department |
ref eq_ref |
name,h_date,dept PRIMARY |
name PRIMARY |
const dept_no |
8 1 |
100.00 100.00 |
可以理解为,从employee基于索引列first_name取出符合first_name = 'John' 的8笔记录。没有Condition Filtering(filtered is 100%)的意思是,将这8笔数据都传递给下一个关联表,即rows × filtered = 8 × 100% = 8。
而 如果考虑condition filtering,执行计划如下:
id | table | type | possible_keys | key | ref | rows | filtered |
1 1 |
employee department |
ref eq_ref |
name,h_date,dept PRIMARY |
name PRIMARY |
const dept_no |
8 1 |
16.31 100.00 |
意思是,在employee将数据都传递给下一个关联表前,还考虑了where的另一个条件,比如employee.hire_date字段。1/8 约等于16.31。
三.ORDER BY 与执行计划 (filesort)
In some cases, MySQL may use an index to satisfy an ORDER BY clause and avoid the extra sorting involved in performing a filesort operation.If an index cannot be used to satisfy an ORDER BY clause, MySQL performs a filesort operation that reads table rows and sorts them. A filesort constitutes an extra sorting phase in query execution.A filesort operation uses temporary disk files as necessary if the result set is too large to fit in memory.
即大部分情况,我们是通过索引来避免order by 来产生 filesort操作。filesort操作发生时,可能会使用disk 来处理较大的数据。
在执行计划中的显现形式是:
• If the Extra column of EXPLAIN output does not contain Using filesort, the index is used and a filesort is not performed.
• If the Extra column of EXPLAIN output contains Using filesort, the index is not used and a filesort is performed.
与filesort相关的系统设置:
sort_buffer_size;max_sort_length;Sort_merge_passes; read_rnd_buffer_size等。
程序端的优化思路就是通过Limit限制下。
Some types of queries are particularly suited to completely in-memory filesort operations. For example, the optimizer can use filesort to efficiently handle in memory, without temporary files, the ORDER BY operation for queries (and subqueries) of the following form:
SELECT ... FROM single_table ... ORDER BY non_index_column [DESC] LIMIT [M,]N;
四. GROUP BY 与执行计划 --Loose Index Scan
The most efficient way to process GROUP BY is when an index is used to directly retrieve the grouping columns. With this access method, MySQL uses the property of some index types that the keys are ordered (for example, BTREE , but not for a HASH index).This property enables use of lookup groups in an index without having to consider all keys in the index that satisfy all WHERE conditions. This access method considers only a fraction of the keys in an index, so it is called a Loose Index Scan.When there is no WHERE clause, a Loose Index Scan reads as many keys as the number of groups, which may be a much smaller number than that of all keys. If the WHERE clause contains range predicates,a Loose Index Scan looks up the first key of each group that satisfies the range conditions, and again reads the smallest possible number of keys.
If Loose Index Scan is applicable to a query, the EXPLAIN output shows Using index for group-by in the Extra column.
举例
表为t1,字段为(c1,c2,c3,c4),索引为idx(c1,c2,c3)。
以下查询使用到Loose Index Scan方法
-- The GROUP BY names only columns that form a leftmost prefix of the index and no other columns. SELECT c1, c2 FROM t1 GROUP BY c1, c2; -- If, instead of GROUP BY, the query has a DISTINCT clause, all distinct attributes refer to columns that form a leftmost prefix of the index. SELECT DISTINCT c1, c2 FROM t1; --The only aggregate functions used in the select list (if any) are MIN() and MAX(), and all of them refer to the same column. The column must be in the index and must immediately follow the columns in the GROUP BY. SELECT c1, MIN(c2) FROM t1 GROUP BY c1; SELECT c1, c2 FROM t1 WHERE c1 < const GROUP BY c1, c2; Any other parts of the index than those from the GROUP BY referenced in the query must be constants (that is, they must be referenced in equalities with constants), except for the argument of MIN() or MAX() functions. SELECT MAX(c3), MIN(c3), c1, c2 FROM t1 WHERE c2 > const GROUP BY c1, c2; SELECT c2 FROM t1 WHERE c1 < const GROUP BY c1, c2; SELECT c1, c2 FROM t1 WHERE c3 = const GROUP BY c1, c2;
使用不到Loose Index Scan
It is not applicable if the query has GROUP BY c2, c3 (the columns are not a leftmost prefix) or GROUP BY c1, c2, c4 (c4 is not in the index).
具体的例子如下:
-- There are aggregate functions other than MIN() or MAX(): SELECT c1, SUM(c2) FROM t1 GROUP BY c1; -- The columns in the GROUP BY clause do not form a leftmost prefix of the index: SELECT c1, c2 FROM t1 GROUP BY c2, c3; -- The query refers to a part of a key that comes after the GROUP BY part, and for which there is no equality with a constant: -- Were the query to include WHERE c3 = const, Loose Index Scan could be used. SELECT c1, c3 FROM t1 GROUP BY c1, c2;
扩展--可以用到Loose Index Scan的其它聚合函数
AVG(DISTINCT), SUM(DISTINCT), and COUNT(DISTINCT) are supported.
Assume that there is an index idx(c1,c2,c3) on table t1(c1,c2,c3,c4). The Loose Index Scan access method can be used for the following queries:
SELECT COUNT(DISTINCT c1), SUM(DISTINCT c1) FROM t1; SELECT COUNT(DISTINCT c1, c2), COUNT(DISTINCT c2, c1) FROM t1;
五 GROUP BY 与执行计划 --Tight Index Scan
If there are range conditions in the WHERE clause, this method reads only the keys that satisfy these conditions. Otherwise, it performs an index scan. Because this method reads all keys in each range defined by the WHERE clause, or scans the whole index if there are no range conditions, it is called a Tight Index Scan. With a Tight Index Scan, the grouping operation is performed only after all keys that satisfy the range conditions have been found.
For this method to work, it is sufficient that there be a constant equality condition for all columns in a query referring to parts of the key coming before or in between parts of the GROUP BY key. The constants from the equality conditions fill in any “gaps” in the search keys so that it is possible to form complete prefixes of the index. These index prefixes then can be used for index lookups. If the GROUP BY result requires sorting, and it is possible to form search keys that are prefixes of the index, MySQL also avoids extra sorting operations because searching with prefixes in an ordered index already retrieves all the keys in order.
举例
表为t1,字段为(c1,c2,c3,c4),索引为idx(c1,c2,c3)。
The following queries do not work with the Loose Index Scan access method described previously, but still work with the Tight Index Scan access method.
-- There is a gap in the GROUP BY, but it is covered by the condition c2 = 'a': SELECT c1, c2, c3 FROM t1 WHERE c2 = 'a' GROUP BY c1, c3; -- The GROUP BY does not begin with the first part of the key, but there is a condition that provides a constant for that part: SELECT c1, c2, c3 FROM t1 WHERE c1 = 'a' GROUP BY c2, c3;
六 Subqueries with Materialization 与执行计划
The optimizer uses materialization to enable more efficient subquery processing. Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory. The first time MySQL needs the subquery result, it materializes that result into a temporary table. Any subsequent time the result is needed, MySQL refers again to the temporary table. The optimizer may index the table with a hash index to make lookups fast and inexpensive. The index contains unique values to eliminate duplicates and make the table smaller.
Subquery materialization uses an in-memory temporary table when possible, falling back to on-disk storage if the table becomes too large.Subquery materialization using a temporary table to makes it possible to execute the subquery only once rather than once per row of the outer query.
The following restrictions apply to the use of subquery materialization:
• The types of the inner and outer expressions must match. For example, the optimizer might be able to use materialization if both expressions are integer or both are decimal, but cannot if one expression is integer and the other is decimal.
• The inner expression cannot be a BLOB.
在执行计划上的显现
Use of EXPLAIN with a query provides some indication of whether the optimizer uses subquery materialization:
• Compared to query execution that does not use materialization, select_type may change from DEPENDENT SUBQUERY to SUBQUERY. This indicates that, for a subquery that would be executed once per outer row, materialization enables the subquery to be executed just once.
• For extended EXPLAIN output, the text displayed by a following SHOW WARNINGS includes materialize and materialized-subquery.
七.Subqueries with the EXISTS Strategy 与执行计划
mysql> EXPLAIN SELECT outer_expr IN (SELECT t2.maybe_null_key FROM t2, t3 WHERE ...) FROM t1; *************************** 1. row *************************** id: 1 select_type: PRIMARY table: t1 ... *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: t2 type: ref_or_null possible_keys: maybe_null_key key: maybe_null_key key_len: 5 ref: func rows: 2 Extra: Using where; Using index ...
The additional OR ... IS NULL condition makes query execution slightly more complicated (and some optimizations within the subquery become inapplicable), but generally this is tolerable. The situation is much worse when outer_expr can be NULL.
EXPLAIN SELECT t1.col1, t1.col1 IN (SELECT t2.key1 FROM t2 WHERE t2.col2=t1.col2) FROM t1\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: t1 ... *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: t2 type: index_subquery possible_keys: key1 key: key1 key_len: 5 ref: func rows: 2 Extra: Using where; Full scan on NULL key
For multiple-table subqueries, execution of NULL IN (SELECT ...) is particularly slow because the join optimizer does not optimize for the case where the outer expression is NULL. It assumes that subquery evaluations with NULL on the left side are very rare, even if there are statistics that indicate otherwise. On the other hand, if the outer expression might be NULL but never actually is, there is no performance penalty.
To help the query optimizer better execute your queries, use these suggestions:
• Declare a column as NOT NULL if it really is. This also helps other aspects of the optimizer by simplifying condition testing for the column.
• If you need not distinguish a NULL from FALSE subquery result, you can easily avoid the slow execution path.
八.补充说明
1.When loading a table from a text file, use LOAD DATA. This is usually 20 times faster than using INSERT statements.
2.Declare columns to be NOT NULL if possible. It makes SQL operations faster, by enabling better use of indexes and eliminating overhead for testing whether each value is NULL. You also save some storage space, one bit per column. If you really need NULL values in your tables, use them. Just avoid the default setting that allows NULL values in every column.
3.Use the most efficient (smallest) data types possible. MySQL has many specialized types that save disk space and memory. For example, use the smaller integer types if possible to get smaller tables. MEDIUMINT is often a better choice than INT because a MEDIUMINT column uses 25% less space.
3.The internal representation of a MySQL table has a maximum row size limit of 65,535 bytes, even if the storage engine is capable of supporting larger rows. BLOB and TEXT columns only contribute 9 to 12 bytes toward the row size limit because their contents are stored separately from the rest of the row.
4.The maximum row size for an InnoDB table, which applies to data stored locally within a database page, is slightly less than half a page for 4KB, 8KB, 16KB, and 32KB innodb_page_size settings. For example, the maximum row size is slightly less than 8KB for the default 16KB InnoDB page size. For 64KB pages, the maximum row size is slightly less than 16KB.
5.Make your redo log files big, even as big as the buffer pool. When InnoDB has written the redo log files full, it must write the modified contents of the buffer pool to disk in a checkpoint. Small redo log files cause many unnecessary disk writes. Although historically big redo log files caused lengthy recovery times, recovery is now much faster and you can confidently use large redo log files. The size and number of redo log files are configured using the innodb_log_file_size and innodb_log_files_in_group configuration options.
6.MySQL uses row-level locking for InnoDB tables to support simultaneous write access by multiple sessions, making them suitable for multi-user, highly concurrent, and OLTP applications.MySQL uses table-level locking for MyISAM, MEMORY, and MERGE tables, permitting only one session to update those tables at a time. This locking level makes these storage engines more suitable for readonly, read-mostly, or single-user applications.
7.sql 语句主要的执行流程
一条SQL语句在MySQL服务器中主要的执行流程,包括 SQL解析、基于语法树的准备工作、优化器的逻辑变化、优化器的代价准备工作、基于代价模型的优化、进行额外的优化和运行执行计划等部分。具体如下图所示:
8.代价模型
如何从可选方案中选出耗时最短的方案?就需要定义一个量化可比较的数值指标,这个指标就是代价(Cost),系统从中选出最小指的方案。
代价模型将操作操作分为Server层和Engine(存储引擎)层两类。Server层主要是CPU代价,Engine层主要是IO代价,比如,从磁盘读取一个数据页的代价io_block_read_cost为1,计算符合条件的行的代价为row_evaluate_cost为0.2.除此之外还有;
(1)memory_temptable_create_cost(default 2.0) 内存临时表的创建代价;(2)memory_temptable_row_cost(default 0.2) 内存临时表的行代价;(3)key_compare_cost(default 0.1) 键比较的代价,例如排序;(4)disk_temptabe_create_cost(default 40.0) 内部myisam 或innodb临时表的创建代价;(5)disk_temptabe_row_cost(default 1.0) 内部myisam 或innodb临时表的行代价。在Mysql 5.7 中,这些操作代价的默认值可以进行配置。
为了计算出方案的总代价,还需要参考一些统计数据,如表数据量大小、元数据和索引信息等。代价优化模型整体如下图所示:
如果打开了MySQL追踪优化器Trace功能,可以通过INFORMATION_SCHEMA.OPTIMIZER_TRACE查看 计算出的 各个备选方案 的SQL代价值。
9.学习参考
1.MySQL 官方文档
2.美团技术分享
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· [.NET]调用本地 Deepseek 模型
· 一个费力不讨好的项目,让我损失了近一半的绩效!
· .NET Core 托管堆内存泄露/CPU异常的常见思路
· PostgreSQL 和 SQL Server 在统计信息维护中的关键差异
· C++代码改造为UTF-8编码问题的总结
· 一个费力不讨好的项目,让我损失了近一半的绩效!
· 清华大学推出第四讲使用 DeepSeek + DeepResearch 让科研像聊天一样简单!
· 实操Deepseek接入个人知识库
· CSnakes vs Python.NET:高效嵌入与灵活互通的跨语言方案对比
· Plotly.NET 一个为 .NET 打造的强大开源交互式图表库