kingbase ES group by 语句优化

1、group by 分组语句

在SQL中group by主要用来进行分组统计，分组字段放在group by的后面；分组结果一般需要借助聚合函数实现。

group by语法结构 1、常用语法

语法结构 SELECT column_name1,column_name2, … 聚合函数1,聚合函数2 , … FROM table_name GROUP BY column_name1,column_name2, … 说明：1、group by中的分组字段和select后的字段要保持一致；2、通常group by和聚合函数一起使用，但也可以不包含聚合函数，根据业务需求决定；3、常用的聚合函数有max()、min()、avg()、sum()、count()等。

2、影响group by的内存参数

work_mem:因为group by 语句很会用到sort、hash中的任何一中方式，如果work_mem 配置不合适就是用到temp file 进行排序，调整work_mem的目的是使相关操作尽可能在内存中执行

max_parallel_workers：并行相关参数可以条并行扫描相关的速度，是通过消耗cpu资源提升查询效率得一种方法

max_parallel_workers_per_gather：同上

3、聚合方式对group 的影响（重点介绍）

金仓数据库中针对group by 函数有两种聚合方式hashagg、groupagg

HashAggregate

对于hash聚合来说，数据库会根据group by字段后面的值算出hash值，并根据前面使用的聚合函数在内存中维护对应的列表。如果select后面有两个聚合函数，那么在内存中就会维护两个对应的数据。同样的，有n个聚合函数就会维护n个同样的数组。对于hash算法来说，数组的长度肯定是大于group by的字段的distinct值的个数的，且跟这个值应该呈线性关系，group by后面的值越唯一，使用的内存也就越大。

执行计划示例：

aligputf8=# explain select count(1) from pg_class group by oid;
QUERY PLAN
----------------------------------------------------------------------
HashAggregate (cost=1721.40..2020.28 rows=23910 width=4)
Group By: oid
-> Seq Scan on pg_class (cost=0.00..1004.10 rows=143460 width=4)
Settings: enable_seqscan=on
(4 rows)

GroupAggregate

对于普通聚合函数，使用group聚合，其原理是先将表中的数据按照group by的字段排序，这样子同一个group by的值就在一起，这样就只需要对排好序的数据进行一次全扫描，就可以得到聚合的结果了。

执行计划示例：

aligputf8=# set enable_hashagg = off;
SET
aligputf8=# explain select count(1) from pg_class group by oid;
QUERY PLAN
----------------------------------------------------------------------------
GroupAggregate (cost=13291.66..14666.48 rows=23910 width=4)
Group By: oid
-> Sort (cost=13291.66..13650.31 rows=143460 width=4)
Sort Key: oid
-> Seq Scan on pg_class (cost=0.00..1004.10 rows=143460 width=4)
Settings: enable_hashagg=off; enable_seqscan=on
(6 rows)

在不考虑聚合函数的前提下，从上面的两个执行计划的cost来说，GroupAgg 由于需要排序，效率很差，消耗是HashAggregate的7倍。

但是HashAgg也不是万能的，从定义中可以看到在select 查询中用到一个聚合函数hashagg就需要额外维护一个数组。如果的情况下hashagg效率会随着聚合函数的增加直线上升。

对比两种聚合方式

hashagg 不进行排序操作，所以如果对结果集顺序不做要求的情况下有限hashagg
hashagg 的资源消耗会随着聚合函数的增加直线上升。

实验数据

1、构造1000万行测试数据

create table t1(id int primary key, a int, b int);
declare
i int;
begin
i=1;
while(i<=10000000) loop
insert into t1 values(i, i, i);
i=i+1;
end loop;
end;

测试场景一：不带聚合函数

HashAggregate聚合

kingbase=# explain (analyze true ,buffers true) SELECT /*+hashagg*/ a%10 as m,b%100 n from t1 group by m ,n;

QUERY PLAN
HashAggregate (cost=254348.64..404349.12 rows=10000032 width=8) (actual time=4264.745..4317.458 rows=100 loops=1)
Group Key: (a % 10), (b % 100)
Buffers: shared hit=10948 read=43400
-> Seq Scan on t1 (cost=0.00..204348.48 rows=10000032 width=8) (actual time=0.098..2026.376 rows=10000000 loops=1)
Buffers: shared hit=10948 read=43400
Planning Time: 0.059 ms
Execution Time: 4524.736 ms
(7 rows)

GroupAggregate聚合

kingbase=# explain (analyze true ,buffers true) SELECT a%10 as m,b%100 n from t1 group by m ,n;

QUERY PLAN
Group (cost=1367027.26..1492027.66 rows=10000032 width=8) (actual time=8414.506..12554.006 rows=100 loops=1)
Group Key: ((a % 10)), ((b % 100))
Buffers: shared hit=11015 read=43336
-> Sort (cost=1367027.26..1392027.34 rows=10000032 width=8) (actual time=8414.498..11298.553 rows=10000000 loops=1)
Sort Key: ((a % 10)), ((b % 100))
Sort Method: quicksort Memory: 861967kB
Buffers: shared hit=11015 read=43336
-> Seq Scan on t1 (cost=0.00..204348.48 rows=10000032 width=8) (actual time=0.030..2049.870 rows=10000000 loops=1)
Buffers: shared hit=11012 read=43336
Planning Time: 0.060 ms
Execution Time: 12606.549 ms
(11 rows)

针对1000万数据在没有聚合函数情况下 hashagg性能是groupagg的两倍以上

测试场景二：少量聚合函数

HashAggregate聚合

kingbase=#explain (analyze true,buffers true) SELECT /+hashagg/ a%10 as m,b%100 n ,max(a),max(b),avg(a),avg(b) from t1 group by m ,n;

QUERY PLAN
HashAggregate (cost=10000354348.96..10000554349.60 rows=10000032 width=80) (actual time=5248.529..5308.127 rows=100 loops=1)
Group Key: (a % 10), (b % 100)
Buffers: shared hit=11076 read=43272
-> Seq Scan on t1 (cost=0.00..204348.48 rows=10000032 width=16) (actual time=0.055..2152.745 rows=10000000 loops=1)
Buffers: shared hit=11076 read=43272
Planning Time: 0.188 ms
Execution Time: 5541.141 ms
(7 rows)

GroupAggregate聚合

kingbase=# explain (analyze true ,buffers true) SELECT a%10 as m,b%100 n ,max(a),max(b),avg(a),avg(b) from t1 group by m ,n;

QUERY PLAN
GroupAggregate (cost=1367027.26..1742028.46 rows=10000032 width=80) (actual time=9070.775..14161.918 rows=100 loops=1)
Group Key: ((a % 10)), ((b % 100))
Buffers: shared hit=11044 read=43304
-> Sort (cost=1367027.26..1392027.34 rows=10000032 width=16) (actual time=9010.429..11943.928 rows=10000000 loops=1)
Sort Key: ((a % 10)), ((b % 100))
Sort Method: quicksort Memory: 861967kB
Buffers: shared hit=11044 read=43304
-> Seq Scan on t1 (cost=0.00..204348.48 rows=10000032 width=16) (actual time=0.017..2160.858 rows=10000000 loops=1)
Buffers: shared hit=11044 read=43304
Planning Time: 0.118 ms
Execution Time: 14221.171 ms
(11 rows)

测试场景三：大量聚合函数

HashAggregate聚合

explain(analyze true,buffers true)
SELECT/+hashagg/a%10 as m,b%100 n ,max(a),max(b),avg(a),avg(b),min(a),min(b),sum(a),sum(b),sum(a+b),sum(a-b) from t1 group by m ,n;

QUERY PLAN
HashAggregate (cost=10000554349.60..10000754350.24 rows=10000032 width=120) (actual time=6070.836..6131.290 rows=100 loops=1)
Group Key: (a % 10), (b % 100)
Buffers: shared hit=11204 read=43144
-> Seq Scan on t1 (cost=0.00..204348.48 rows=10000032 width=16) (actual time=0.096..2148.660 rows=10000000 loops=1)
Buffers: shared hit=11204 read=43144
Planning Time: 0.138 ms
Execution Time: 6359.172 ms
(7 rows)

本次只是大量聚合函数测试可以看到HashAggregate 相较于场景二时间变长1s中左右，具体多少聚合函数会对hashagg 有比较大的影响还需要大家深入探讨

测试场景四：hashagg 走索引

总结

可以看出，对于GroupAgg来说，消耗的内存基本上是恒定的，无论group by哪个字段。当聚合函数较少的时候，速度也相对较慢，但是相对稳定。

HashAgg在少数聚合函数是表现优异，但是很多聚合函数，性能跟消耗的内存差异很明显。尤其是受group by字段的唯一性很明显，字段count（district）值越大，hash聚合消耗的内存越多，性能下降剧烈。

所以在sql中有大量聚合函数，group by 的字段由相对比较唯一的时候，应该用GroupAgg，而不能用HashAgg。

posted @ 2023-09-18 15:19 KINGBASE研究院阅读(234) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

KINGBASE研究院

kingbase ES group by 语句优化

1、group by 分组语句

2、影响group by的内存参数

3、聚合方式对group 的影响（重点介绍）

HashAggregate

GroupAggregate

对比两种聚合方式

实验数据

测试场景一：不带聚合函数

测试场景二：少量聚合函数

测试场景三：大量聚合函数

测试场景四：hashagg 走索引

总结

公告