MySQL Execution Plan--如果优雅地按照时间范围进行数据抽取

1、问题场景

在很多业务存在增量抽取数据的场景，当抽取数据量较大时，通常采用分批方式按照主键进行抽取，以抽取表footprint_detail_info最近一天数据为例。表footprint_detail_info的结构为：

CREATE TABLE `footprint_detail_info` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'id',
  `status` tinyint(4) NOT NULL DEFAULT '0' COMMENT '状态',
  ...省略其他业务字段...
  `create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建日期',
  `update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',
  PRIMARY KEY (`id`),
  KEY `idx_create_time` (`create_time`),
  KEY `idx_update_time` (`update_time`),
) ENGINE=InnoDB AUTO_INCREMENT=89241820 DEFAULT CHARSET=utf8mb4

需要抽取最近一天update_time>'2022-07-29 00:00:00'的数据，第一次抽取SQL为：

SELECT 
id,`status`,update_time 
FROM footprint_detail_info 
WHERE update_time > '2022-07-29 00:00:00' 
ORDER BY id LIMIT 1000;

假设第一次返回记录中最大id为1329455，则第二次的抽取SQL为：

SELECT 
id,`status`,update_time 
FROM footprint_detail_info 
WHERE update_time > '2022-07-29 00:00:00'
AND id > 1329455
ORDER BY id LIMIT 1000;

后面依次使用上一次的最大ID最为下一次的起始ID，直到取回数据小于1000行结束。

2、问题分析

以第二次抽取SQL为例，其执行计划为：

# 查看执行计划
EXPLAIN SELECT 
id,`status`,update_time 
FROM footprint_detail_info 
WHERE update_time > '2022-07-29 00:00:00'
AND id > 1329455
ORDER BY id LIMIT 1000;
# 执行计划输出
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: footprint_detail_info
         type: range
possible_keys: PRIMARY,idx_update_time
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 14450630
        Extra: Using where
1 row in set (0.00 sec)

由于查询SQL使用ORDER BY id LIMIT 1000，由于对大量记录进行排序的操作成本较高，MySQL查询优化器选择使用主键索引来避免排序操作，对应伪代码为：

match_rows=[]
# 按照主键索引遍历满足id > 1329455条件的记录
for row_item in [scan table rows by primary key where id > 1329455 ]:
    # 判断记录是否满足update_time>'2022-07-29 00:00:00'条件
    if row_item.update_time>'2022-07-29 00:00:00':
        match_rows.append(row_item)
        # 判断已匹配记录数满足LIMIT 1000限制
        if len(match_rows)==1000:
            break
 # 由于按照主键索引遍历，因此记录按照主键有序
# 无需对集合重新排序
 return match_rows

影响查询效率的关键因素在于记录满足条件update_time>'2022-07-29 00:00:00'的概率：

如果满足条件概率为0.9，则需要遍历1000/0.9=1111行记录
如果满足条件概率为0.1，则需要遍历1000/0.1=10000行记录
如果满足条件概率为0.001，则需要遍历1000/0.1=1000000行记录
如果满足条件的总行数小于1000行，则需要遍历所有id > 1329455的记录。

可以通过改写SQL来避免主键扫描，如将排序条件改为ORDER BY id + 0 LIMIT 1000 ，对应执行计划为：

# 查看执行计划
EXPLAIN SELECT 
id,`status`,update_time 
FROM footprint_detail_info 
WHERE update_time > '2022-07-29 00:00:00'
AND id > 1329455
ORDER BY id + 0 LIMIT 1000;
# 执行计划输出
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: footprint_detail_info
         type: range
possible_keys: PRIMARY,idx_update_time
          key: idx_update_time
      key_len: 4
          ref: NULL
         rows: 558808
        Extra: Using index condition; Using filesort
1 row in set (0.00 sec)

对应伪代码为：

match_rows=[]
index_rows=[]
# 获取所有满足update_time>'2022-07-29 00:00:00'条件的索引记录
for index_row in [scan index rows by index idx_update_time where row_item.update_time>'2022-07-29 00:00:00' ]:
    if index_row.id > 1329455:
        index_rows.append(row_item)
# 对索引记录按照update_time有序，需要按照主键重新排序
sort_index_rows = index_rows.sort(key=lambda x: x.id)
# 按照LIMIT 1000条件取前1000行记录
top_index_rows = sort_index_rows[0:1000]
for index_row in top_index_rows:
    # 使用主键索引进行回标查找
    row_item = [seek table rows by primary key where id = index_row.id]
    match_rows.append(row_item)
return match_rows

影响查询效率的关键因素在于记录满足条件update_time>'2022-07-29 00:00:00'的记录数，满足条件记录越多，加载记录并排序的操作成本越高。

针对上述两种查询方式，无论是满足条件的概率或满足条件的记录数都取决于业务场景，无法保证查询语句稳定的查询性能。

3、解决方案

表footprint_detail_info上idx_update_time (update_time)索引等价于idx_update_time (update_time,id)，索引记录按照update_time+id顺序存储，如果将(update_time,id) 当作一个特殊列，将SQL编写为：

SELECT 
id,`status`,update_time 
FROM footprint_detail_info 
WHERE (update_time, id) > ('2022-07-29 00:00:00',1329455)
ORDER BY update_time,id LIMIT 1000;

其对应执行计划为：

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: footprint_detail_info
         type: index
possible_keys: NULL
          key: idx_update_time
      key_len: 4
          ref: NULL
         rows: 1000
        Extra: Using where

对应伪代码(理论)为：

match_rows=[]
index_rows=[]
# 获取所有满足(update_time,id) > ('2022-07-29 00:00:00',1329455)条件的索引记录
for index_row in [scan index rows by index idx_update_time 
                 where (update_time,id) > ('2022-07-29 00:00:00',1329455)]:
    index_rows.append(index_row)
    # 判断已匹配记录数满足LIMIT 1000限制
    if len(index_rows)==1000:
        break
 for index_row in index_rows:
    # 使用主键索引进行回标查找
    row_item = [seek table rows by primary key where id = index_row.id]
    match_rows.append(row_item)
 return match_rows

上面查询即避免数据满足过滤条件概率问题又避免查询结果排序问题，能很好保证查询语句稳定的查询性能。

从查询语句对应的执行计划看，查询使用索引且仅需要扫描有限的索引记录，查询性能应该很快且很稳定，但实际执行时发现性能极差。

如果仅使用update_time列来进行过滤和排序，其对应执行计划为：

# 查看执行计划
EXPLAIN SELECT 
id,`status`,update_time 
FROM footprint_detail_info 
WHERE update_time > '2022-07-29 00:00:00'
ORDER BY update_time LIMIT 1000;
# 执行计划输出
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: footprint_detail_info
         type: range
possible_keys: idx_update_time
          key: idx_update_time
      key_len: 4
          ref: NULL
         rows: 72420
        Extra: Using index condition
1 row in set (0.00 sec)

仔细观察两个执行计划可发现差异为Extra: Using where和Using index condition，Using where表示MySQL在执行过程中未使用索引来过滤，而是将数据从InnoDB存储引擎加载到MySQL Server层，在MySQL Server层使用WHERE过滤条件对数据进行过滤。

根本原因为MySQL能识别(update_time,id) > ('2022-07-29 00:00:00',1329455)语法但无法对(update_time,id 使用索引，需要将SQL改写为：

SELECT 
id,`status`,update_time 
FROM footprint_detail_info 
WHERE ((update_time > '2022-07-29 00:00:00') 
OR (update_time = '2022-07-29 00:00:00' AND id > 1329455))
ORDER BY update_time,id LIMIT 1000;

对应执行计划为：

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: footprint_detail_info
         type: range
possible_keys: PRIMARY,idx_update_time
          key: idx_update_time
      key_len: 8
          ref: NULL
         rows: 662481
        Extra: Using index condition

查询使用索引(Using index condition)且没有排序(Using filesort)，虽然预估扫描行数较高，但由于LIMIT语句的限制，每次查询实际扫描记录数较小，查询性能稳定，实际测试中排除物理IO影响后能保证在5ms内完成。

4、方案扩展

如果SQL过滤条件中除主键列和索引列外还包含其他列，可能存在：

数据记录在"其他列"上满足过滤条件的概率较低，查询需要扫描大量记录。
执行计划根据其他过滤条件使用"其他列"索引，查询需要排序大量记录。

为避免上述两类情况，可以通过子查询方式将不同过滤条件拆分开，如需要抽取最近一天update_time>'2022-07-29 00:00:00' and status=1的数据，首先先确认本次查询需要扫描的数据区间：

# 获取本轮查询数据区间的最大值。
SELECT 
update_time,id 
FROM footprint_detail_info 
WHERE ((update_time > '2022-07-29 00:00:00') OR (update_time = '2022-07-29 00:00:00' AND id >= 0))
ORDER BY update_time,id LIMIT 1000,1;

如果返回值为NONE，则证明剩余数据不足1000行，直接全量获取：

SELECT 
id,`status`,update_time 
FROM footprint_detail_info
INNER JOIN 
(
    SELECT 
    id AS rid
    FROM footprint_detail_info 
    WHERE ((update_time > '2022-07-29 00:00:00') OR (update_time = '2022-07-29 00:00:00' AND id >= 0))
) AS T1
ON T1.rid = footprint_detail_info.id
WHERE footprint_detail_info.status = 1

如果返回值不为NONE，如返回数据 '2022-07-29 00:00:07',89191455，则将该值作为本次查询的上限和下次查询下限：

本轮查询为：

SELECT 
id,`status`,update_time 
FROM footprint_detail_info
INNER JOIN 
(
    SELECT 
    id AS rid
    FROM footprint_detail_info 
    WHERE ((update_time > '2022-07-29 00:00:00') OR (update_time = '2022-07-29 00:00:00' AND id >= 0))
    AND   ((update_time < '2022-07-29 00:00:07') OR (update_time = '2022-07-29 00:00:07' AND id < 89191455))
) AS T1
ON T1.rid = footprint_detail_info.id
WHERE footprint_detail_info.status = 1

对应执行计划为：

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 997
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: footprint_detail_info
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: T1.rid
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: footprint_detail_info
         type: range
possible_keys: PRIMARY,idx_update_time
          key: idx_update_time
      key_len: 8
          ref: NULL
         rows: 997
        Extra: Using where; Using index
3 rows in set (0.00 sec)