分表后的分页查询处理

概述

在一些数据量比较大的场景下,如每天进行的数据统计,在一段时间后可能数据量会变得特别大(上亿级别),并且有很大部分的老数据只有很小的概率进行查询。在这种情况下,原始的单表处理会成为新增数据的一个性能瓶颈,为了解决这一问题,一般都会对这部分业务的数据做分库或分表的处理。分表的方式一般分为垂直拆分和水平拆分

垂直拆分

故名思义,就是将表的数据分为左右两部分数据的形式(每个部分就是单独的数据表),简单来讲就是拆分列。这样做的目的是减少单表每条数据的访问量,并结合一些列式存储引擎或文件存储系统来提高对大数据对象的访问速度。比如,对于图片数据,一般就只会在列中存储实际的图片地址,而不会存储实际的图片内容。

这种拆分方式一般不会影响到实际的数据总数,因此对于分页查询来讲,也不会造成太大的影响,因此在这里不做讨论

水平拆分

水平拆分就是将数据分为上下两个部分(每个部分就是单独的数据表),上半部分的数据可能是上个月的,下半部分的数据是这个月的,随着业务业务的增加,可能还会有下个月的、下下个月的…………

这种分表方式是处理大规模数据集常用的方式,但由于将数据分散到不同的数据表中了,因此会导致分页查询数据时会存在一些问题

解决方案

假设现在有为了提高对每日统计数据访问的速度,将其划分成了三个表:oa_statistic_2025_aoa_statistic_2025_boa_statistic_2025_c

三个表的数据分别如下:

oa_statistic_2025_a

id created_time
1904828645457788928 2025-01-01 01:20:45
1904828642072985600 2025-01-01 12:59:04
1904828643016704000 2025-01-01 20:35:21
1904828646275678208 2025-01-02 01:28:05
1904828645797527552 2025-01-03 14:36:56
1904828642161065984 2025-01-03 18:49:48
1904828641582252032 2025-01-04 05:12:00
1904828644891557888 2025-01-05 14:11:14
1904828642349809664 2025-01-05 19:12:04
1904828646212763648 2025-01-06 07:15:59

oa_statistic_2025_b

id created_time
1904828640227491840 2025-01-01 00:39:03
1904828643675209728 2025-01-01 19:38:42
1904828643511631872 2025-01-01 22:08:43
1904828641108295680 2025-01-03 06:08:00
1904828641498365952 2025-01-03 11:58:52
1904828646179209216 2025-01-05 02:52:25
1904828642253340672 2025-01-06 14:08:36
1904828640554647552 2025-01-07 01:06:33
1904828643033481216 2025-01-07 04:50:19
1904828641838104576 2025-01-08 01:44:56

oa_statistic_2025_c

id created_time
1904828643012509696 2025-01-01 06:14:16
1904828642609856512 2025-01-01 16:46:56
1904828645503926272 2025-01-02 21:02:03
1904828645780750336 2025-01-04 16:51:36
1904828640760168448 2025-01-05 15:46:41
1904828645327765504 2025-01-06 16:46:41
1904828641477394432 2025-01-07 16:21:11
1904828645617172480 2025-01-09 10:32:10
1904828643540992000 2025-01-10 06:39:48
1904828642962178048 2025-01-11 10:47:03

由于这里的分表方式是通过 hash 信息来分的(有的会按照日期进行划分,会好一些),因此在分页查询时就不得不统计每个表的数据

UNION ALL 所有表

假设现在需要按照创建时间升序查找第二页的 10 条数据(假设一页十条),并且一种直观的做法是将所有的表通过 UNION ALL 的方式进行全连接查询,这样得到的结果便是正常的数据结果:

SELECT *
FROM (SELECT *
      FROM (SELECT 'a' AS table_name, a.id, a.created_time
            FROM oa_statistic_2025_a a
            ORDER BY created_time) AS a
      UNION ALL
      SELECT *
      FROM (SELECT 'b' AS table_name, b.id, b.created_time
            FROM oa_statistic_2025_b b
            ORDER BY created_time) AS b
      UNION ALL
      SELECT *
      FROM (SELECT 'c' AS table_name, c.id, c.created_time
            FROM oa_statistic_2025_c c
            ORDER BY created_time) AS c) AS t
ORDER BY created_time
LIMIT 10 OFFSET 10;

得到的结果如下:

table_name id created_time
b 1904828641108295680 2025-01-03 06:08:00
b 1904828641498365952 2025-01-03 11:58:52
a 1904828645797527552 2025-01-03 14:36:56
a 1904828642161065984 2025-01-03 18:49:48
a 1904828641582252032 2025-01-04 05:12:00
c 1904828645780750336 2025-01-04 16:51:36
b 1904828646179209216 2025-01-05 02:52:25
a 1904828644891557888 2025-01-05 14:11:14
c 1904828640760168448 2025-01-05 15:46:41

二次查询法

具体算法

具体算法如下:

  1. 设本次查询的数据量大小为 \(n\),偏移量大小为 \(f\),分表的表数量为 \(m\),对应单独表(未分库前)实际的查询语句如下所示:

    SELECT * 
    FROM table_name
    WHERE TRUE
    ORDER BY created_time
    LIMIT n offset f
    
  2. 设分表后每个表的偏移量 \(f'=f/m\),首次对每个子表按照对应的偏移量 \(f'\) 进行查询

    SELECT * FROM tab1 WHERE TRUE ORDER BY created_time LIMIT n offset f'
    
    SELECT * FROM tab2 WHERE TRUE ORDER BY created_time LIMIT n offset f'
    

    ……………………

    假设每个子查询得到的数据集分别为 \(r1、r2……\)

  3. 比较每个数据集中的首条数据,得到数据集中的最小记录为 \(min\),并将每个数据集中最大的数据分别记为 \(max1、max2……\)

  4. 第二次对每个表分别进行查询,得到每个子表的实际数据:

    SELECT * FROM tab1 WHERE id BETWEEN min AND max1  ORDER BY created_time;
    
    SELECT * FROM tab2 WHERE id BETWEEN min AND max1  ORDER BY created_time;
    

    ……………………

    设第二次子查询得到的数据集分别为 \(r1'、r2'……\)

  5. 记在子查询数据集中的偏移量 \(f''=f \mod m\),以及在每个子数据集中相对原有偏移量的前置偏移量 \(f''_1、f''_2……\),将他们累加到 \(f''\), 对第二次查询得到的子数据集进行进一步分页查询处理:

    SELECT *
    FROM (
        r1'
        UNION ALL 
        r2'
        UNION ALL
        …………
    ) AS t LIMIT n OFFSET f''
    

具体 Demo

以上文的例子为例,其中 \(n=10, f=10, m = 3\),首次查询对每个子表的偏移量 \(f'=3\),对整体子表数据集的偏移量 \(f''=1\)

对应的步骤如下:

  1. 对三个分别进行首次查询

    • oa_statistic_2025_a

      SELECT 'a' AS table_name, a.id, a.created_time
      FROM oa_statistic_2025_a a
      ORDER BY created_time
      LIMIT 10 OFFSET 3;
      

      对应的数据集如下:

      table_name id created_time
      a 1904828646275678208 2025-01-02 01:28:05
      a 1904828645797527552 2025-01-03 14:36:56
      a 1904828642161065984 2025-01-03 18:49:48
      a 1904828641582252032 2025-01-04 05:12:00
      a 1904828644891557888 2025-01-05 14:11:14
      a 1904828642349809664 2025-01-05 19:12:04
      a 1904828646212763648 2025-01-06 07:15:59
      a 1904828641078935552 2025-01-07 10:49:26
      a 1904828645180964864 2025-01-07 13:29:36
      a 1904828644237246464 2025-01-08 03:05:51
    • oa_statistic_2025_b

      SELECT 'b' AS table_name, b.id, b.created_time
      FROM oa_statistic_2025_b b
      ORDER BY created_time
      LIMIT 10 OFFSET 3;
      

      对应的数据集如下:

      table_name id created_time
      b 1904828641108295680 2025-01-03 06:08:00
      b 1904828641498365952 2025-01-03 11:58:52
      b 1904828646179209216 2025-01-05 02:52:25
      b 1904828642253340672 2025-01-06 14:08:36
      b 1904828640554647552 2025-01-07 01:06:33
      b 1904828643033481216 2025-01-07 04:50:19
      b 1904828641838104576 2025-01-08 01:44:56
      b 1904828644279189504 2025-01-08 04:22:01
      b 1904828644442767360 2025-01-08 06:57:43
      b 1904828641976516608 2025-01-08 17:51:34
    • oa_statistic_2025_c

      SELECT 'c' AS table_name, c.id, c.created_time
      FROM oa_statistic_2025_c c
      ORDER BY created_time
      LIMIT 10 OFFSET 3;
      

      对应的数据集如下:

      table_name id created_time
      c 1904828645780750336 2025-01-04 16:51:36
      c 1904828640760168448 2025-01-05 15:46:41
      c 1904828645327765504 2025-01-06 16:46:41
      c 1904828641477394432 2025-01-07 16:21:11
      c 1904828645617172480 2025-01-09 10:32:10
      c 1904828643540992000 2025-01-10 06:39:48
      c 1904828642962178048 2025-01-11 10:47:03
      c 1904828642521776128 2025-01-12 05:43:39
      c 1904828644837031936 2025-01-12 07:46:20
      c 1904828641364148224 2025-01-12 08:42:18
  2. 其中,最小的数据为 '(a,4,1904828646275678208, '2025-01-02 01:28:05')',每个子表对应的最大数据分别为:‘’

    ‘(a,13, 1904828644237246464, '2025-01-08 03:05:51')'、'(b, 13, 1904828641976516608, '2025-01-08 17:51:34')'、'(c, 13, 1904828641364148224, '2025-01-12 08:42:18')'

    因此对每个子表再执行第二次的查询:

    • oa_statistic_2025_a

      SELECT 'a' AS table_name, a.id, a.created_time
      FROM oa_statistic_2025_a a
      WHERE a.created_time BETWEEN '2025-01-02 01:28:05' AND '2025-01-08 03:05:51'
      ORDER BY created_time;
      

      对应数据集如下:

      table_name id created_time
      a 1904828646275678208 2025-01-02 01:28:05
      a 1904828645797527552 2025-01-03 14:36:56
      a 1904828642161065984 2025-01-03 18:49:48
      a 1904828641582252032 2025-01-04 05:12:00
      a 1904828644891557888 2025-01-05 14:11:14
      a 1904828642349809664 2025-01-05 19:12:04
      a 1904828646212763648 2025-01-06 07:15:59
      a 1904828641078935552 2025-01-07 10:49:26
      a 1904828645180964864 2025-01-07 13:29:36
      a 1904828644237246464 2025-01-08 03:05:51

      相对第一次查询的向前偏移量 \(f''_1=0\)

    • oa_statistic_2025_b

      SELECT 'b' AS table_name, b.id, b.created_time
      FROM oa_statistic_2025_b b
      WHERE b.created_time BETWEEN '2025-01-02 01:28:05' AND '2025-01-08 17:51:34'
      ORDER BY created_time;
      

      数据集如下:

      table_name id created_time
      b 1904828641108295680 2025-01-03 06:08:00
      b 1904828641498365952 2025-01-03 11:58:52
      b 1904828646179209216 2025-01-05 02:52:25
      b 1904828642253340672 2025-01-06 14:08:36
      b 1904828640554647552 2025-01-07 01:06:33
      b 1904828643033481216 2025-01-07 04:50:19
      b 1904828641838104576 2025-01-08 01:44:56
      b 1904828644279189504 2025-01-08 04:22:01
      b 1904828644442767360 2025-01-08 06:57:43
      b 1904828641976516608 2025-01-08 17:51:34

      相对第一次查询的向前偏移量 \(f''_2=0\)

    • oa_statistic_2025_c

      SELECT 'c' AS table_name, c.id, c.created_time
      FROM oa_statistic_2025_c c
      WHERE c.created_time BETWEEN '2025-01-02 01:28:05' AND '2025-01-12 08:42:18'
      ORDER BY created_time;
      

      数据集如下:

      table_name id created_time
      c 1904828645503926272 2025-01-02 21:02:03
      c 1904828645780750336 2025-01-04 16:51:36
      c 1904828640760168448 2025-01-05 15:46:41
      c 1904828645327765504 2025-01-06 16:46:41
      c 1904828641477394432 2025-01-07 16:21:11
      c 1904828645617172480 2025-01-09 10:32:10
      c 1904828643540992000 2025-01-10 06:39:48
      c 1904828642962178048 2025-01-11 10:47:03
      c 1904828642521776128 2025-01-12 05:43:39
      c 1904828644837031936 2025-01-12 07:46:20
      c 1904828641364148224 2025-01-12 08:42:18

      相对第一次查询的向前偏移量 \(f''_3=1\)

  3. 对这几个数据进行联合处理,并加上缺失的 \(f''\) 偏移量 \(f'' = f'_1 + f''_1 + f''_2 + f''_3 = 2\)

    SELECT *
    FROM ((SELECT 'a' AS table_name, a.id, a.created_time
           FROM oa_statistic_2025_a a
           WHERE a.created_time BETWEEN '2025-01-02 01:28:05' AND '2025-01-08 03:05:51'
           ORDER BY created_time)
          UNION ALL
          (SELECT 'b' AS table_name, b.id, b.created_time
           FROM oa_statistic_2025_b b
           WHERE b.created_time BETWEEN '2025-01-02 01:28:05' AND '2025-01-08 17:51:34'
           ORDER BY created_time)
          UNION ALL
          (SELECT 'c' AS table_name, c.id, c.created_time
           FROM oa_statistic_2025_c c
           WHERE c.created_time BETWEEN '2025-01-02 01:28:05' AND '2025-01-12 08:42:18'
           ORDER BY created_time)) AS t
    ORDER BY t.created_time
    LIMIT 10 OFFSET 2;
    

    对应的数据集如下:

    table_name id created_time
    b 1904828641108295680 2025-01-03 06:08:00
    b 1904828641498365952 2025-01-03 11:58:52
    a 1904828645797527552 2025-01-03 14:36:56
    a 1904828642161065984 2025-01-03 18:49:48
    a 1904828641582252032 2025-01-04 05:12:00
    c 1904828645780750336 2025-01-04 16:51:36
    b 1904828646179209216 2025-01-05 02:52:25
    a 1904828644891557888 2025-01-05 14:11:14
    c 1904828640760168448 2025-01-05 15:46:41
    a 1904828642349809664 2025-01-05 19:12:04

代码实现

具体的核心代码如下:

package org.xhliu.sharding.service;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.github.pagehelper.Page;
import com.github.pagehelper.PageHelper;
import com.google.common.collect.Lists;
import com.google.common.collect.Maps;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
import org.springframework.util.Assert;
import org.springframework.util.CollectionUtils;
import org.xhliu.sharding.dto.SplitStatisticDto;
import org.xhliu.sharding.mapper.OaStatisticMapper;
import org.xhliu.sharding.rpo.SplitStatisticRpo;

import javax.annotation.Resource;
import java.time.LocalDateTime;
import java.util.Comparator;
import java.util.List;
import java.util.Map;
import java.util.Objects;

/**
 *@author lxh
 */
@Service
@Transactional
public class SplitStatisticService {

    private final static Logger log = LoggerFactory.getLogger(SplitStatisticService.class);

    @Resource
    private OaStatisticMapper oaStatisticMapper;

    @Resource
    private ObjectMapper mapper;

    public Page<SplitStatisticDto> multiplePageSearch(SplitStatisticRpo splitStatisticRpo) {
        Assert.notNull(splitStatisticRpo, "请求参数不能为 null");
        Integer pageNo = splitStatisticRpo.getPageNo();
        Integer pageSize = splitStatisticRpo.getPageSize();

        // 计算每个子表的偏移量
        int allOffset = (pageNo - 1) * pageSize;
        List<String> tableNames = currSplitTableNames();
        int offset = allOffset / tableNames.size();
        int limit = pageSize + (allOffset - tableNames.size() * offset); // 避免整数相除导致的偏移量确实
        int maxLimit = pageSize * tableNames.size(); // 每个表最多需要查询的数据两

        // 首次查询每个子表偏移量分页数据
        Map<String, List<SplitStatisticDto>> firstSubDataMap = Maps.newTreeMap();
        for (String tableName : tableNames) {
            SplitStatisticRpo rpo = SplitStatisticRpo.valueOf(splitStatisticRpo);
            rpo.setOffset(offset);
            rpo.setTableName(tableName);
            rpo.setPageSize(limit);
            firstSubDataMap.put(tableName, oaStatisticMapper.firstPageSearch(rpo));
        }

        // 补偿当表中数据不足时的偏移量
        int subCnt = 0;
        for (Map.Entry<String, List<SplitStatisticDto>> entry : firstSubDataMap.entrySet()) {
            if (CollectionUtils.isEmpty(entry.getValue())) {
                SplitStatisticRpo rpo = SplitStatisticRpo.valueOf(splitStatisticRpo);
                rpo.setTableName(entry.getKey());
                long cnt = oaStatisticMapper.countFirstPageSearch(rpo);
                if (offset > cnt) {
                    subCnt += (int) ((long) offset - cnt);
                }
            }
        }


        // 遍历每个子数据集合,得到最小的记录; 以及每个子数据集的最大记录
        SplitStatisticDto min = null;
        Map<String, SplitStatisticDto> maxRecordMap = Maps.newTreeMap();
        for (Map.Entry<String, List<SplitStatisticDto>> entry : firstSubDataMap.entrySet()) {
            List<SplitStatisticDto> dtoList = entry.getValue();
            if (CollectionUtils.isEmpty(dtoList)) {
                SplitStatisticDto max = new SplitStatisticDto();
                max.setCreatedTime(LocalDateTime.MAX);
                maxRecordMap.put(entry.getKey(), max);
                continue;
            }

            SplitStatisticDto firstRow = dtoList.get(0);
            if (min == null) {
                min = firstRow;
            } else if (firstRow.getCreatedTime().isBefore(min.getCreatedTime())) {
                min = firstRow;
            }

            maxRecordMap.put(entry.getKey(), dtoList.get(dtoList.size() - 1));
        }
        if (Objects.isNull(min)) {
            return new Page<>();
        }

        // 第二次查询,得到实际需要的结果集合
        Map<String, List<SplitStatisticDto>> secondSubDataMap = Maps.newTreeMap();
        for (String tableName : tableNames) {
            SplitStatisticRpo rpo = SplitStatisticRpo.valueOf(splitStatisticRpo);
            SplitStatisticDto maxRecord = maxRecordMap.get(tableName);
            rpo.setTableName(tableName);
            rpo.setCondition("t.created_time BETWEEN '" + formatDateTime(min.getCreatedTime()) + "'"
                    + " AND '" + formatDateTime(maxRecord.getCreatedTime()) + "'");
            secondSubDataMap.put(tableName, oaStatisticMapper.secondPageSearch(rpo));
        }

        // 计算在整个数据集合中所需的偏移量
        int allDataOffset = allOffset - (offset * tableNames.size()) + subCnt;
        for (String tableName : tableNames) {
            List<SplitStatisticDto> firstRows = firstSubDataMap.get(tableName);
            List<SplitStatisticDto> secondRows = secondSubDataMap.get(tableName);
            if (CollectionUtils.isEmpty(firstRows)) {
                allDataOffset += secondRows.size();
                continue;
            }

            int f = 0;
            SplitStatisticDto firstRow = firstRows.get(0);
            for (SplitStatisticDto secondRow : secondRows) {
                if (firstRow.equals(secondRow)) {
                    break;
                }
                f++;
            }
            allDataOffset += f;
        }

        // 有可能存在数据倾斜的情况,在这种情况下可以考虑在偏斜的表中多查一些数据
        if (allDataOffset >= pageSize) {
            for (String tableName : tableNames) {
                List<SplitStatisticDto> firstQuery = firstSubDataMap.get(tableName);
                if (CollectionUtils.isEmpty(firstQuery)) continue;
                if (min != firstQuery.get(0)) continue;

                SplitStatisticRpo rpo = SplitStatisticRpo.valueOf(splitStatisticRpo);
                rpo.setTableName(tableName);
                rpo.setPageSize(maxLimit);
                rpo.setOffset(offset);
                secondSubDataMap.put(tableName, oaStatisticMapper.firstPageSearch(rpo));
            }
        }

        // 整合所有数据,作为返回结果
        List<SplitStatisticDto> data = Lists.newArrayList();
        for (Map.Entry<String, List<SplitStatisticDto>> entry : secondSubDataMap.entrySet()) {
            data.addAll(entry.getValue());
        }
        data.sort(Comparator.comparing(SplitStatisticDto::getCreatedTime));
        Page<SplitStatisticDto> res = new Page<>();
        res.addAll(data.subList(allDataOffset, Math.min(data.size(), allDataOffset + pageSize)));
        return res;
    }

    protected String formatDateTime(LocalDateTime dateTime) {
        return String.format("%4s-%2s-%2s$%2s:%2s:%2s",
                        dateTime.getYear(), dateTime.getMonthValue(), dateTime.getDayOfMonth(),
                        dateTime.getHour(), dateTime.getMinute(), dateTime.getSecond()
                ).replaceAll(" ", "0")
                .replaceAll("\\$", " ");
    }

    protected List<String> currSplitTableNames() {
        return Lists.newArrayList("oa_statistic_2025_a",
                "oa_statistic_2025_b", "oa_statistic_2025_c");
    }
}

具体的单元测试:https://github.com/LiuXianghai-coder/Spring-Study/blob/master/sharding/src/test/java/org/xhliu/sharding/service/SplitStatisticServiceTest.java

实际 Demo 地址:https://github.com/LiuXianghai-coder/Spring-Study/tree/master/sharding

posted @ 2025-03-28 20:10  FatalFlower  阅读(82)  评论(0)    收藏  举报