clickHouse的SummingMergeTree引擎以及AggregatingMergeTree引擎使用介绍
一、SummingMergeTree
1.SummingMergeTree介绍
求和引擎继承自 MergeTree。区别在于,当合并 SummingMergeTree 表的数据片段时,ClickHouse 会把所有具有相同主键的行合并为一行,该行包含了被合并的行中具有数值数据类型的列的汇总值。如果主键的组合方式使得单个键值对应于大量的行,则可以显著的减少存储空间并加快数据查询的速度。
2.建表语句
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] ( name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1], name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2], ... ) ENGINE = SummingMergeTree([columns]) [PARTITION BY expr] [ORDER BY expr] [SAMPLE BY expr] [SETTINGS name=value, ...]
columns - 包含了将要被汇总的列的列名的元组。可选参数。
所选的列必须是数值类型,并且不可位于主键中。
3.使用示例
--本地表 create table test.summing_table_test1 ( v1 Int32, v2 Int32, name String, total_date DateTime ) ENGINE = SummingMergeTree((v1,v2)) order by (name) partition by toDate(total_date) SETTINGS index_granularity = 8192; --写入测试数据: insert into test.summing_table_test1 values (1,2,'a',now()),(2,2,'a',now()-1*60*60),(3,4,'b',now()); --强制合并 optimize table test.summing_table_test1 FINAL; --查询数据: SELECT * FROM test.summing_table_test1 Query id: 2da82c96-2a90-496a-83fe-8a6528ba336c ┌─v1─┬─v2─┬─name─┬──────────total_date─┐ │ 3 │ 4 │ a │ 2021-10-13 11:41:12 │ │ 3 │ 4 │ b │ 2021-10-13 11:41:12 │ └────┴────┴──────┴─────────────────────┘
二、AggregatingMergeTree
1.AggregatingMergeTree 介绍
该表引擎继承自MergeTree,可以使用 AggregatingMergeTree 表来做增量数据统计聚合。如果要按一组规则来合并减少行数,则使用 AggregatingMergeTree 是合适的。AggregatingMergeTree是通过预先定义的聚合函数计算数据并通过二进制的格式存入表内。
是SummingMergeTree的加强版,SummingMergeTree能做的是对非主键列进行sum聚合,而AggregatingMergeTree则可以指定各种聚合函数。
2.建表语句
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] ( name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1], name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2], ... ) ENGINE = AggregatingMergeTree() [PARTITION BY expr] [ORDER BY expr] [SAMPLE BY expr] [TTL expr] [SETTINGS name=value, ...]
3.使用示例
1)计算汇总员工工资
--建表: CREATE TABLE emp_aggregatingmergeTree ( emp_id UInt16 COMMENT '员工id', name String COMMENT '员工姓名', work_place String COMMENT '工作地点', age UInt8 COMMENT '员工年龄', depart String COMMENT '部门', salary AggregateFunction(sum, Decimal32(2)) COMMENT '工资' ) ENGINE = AggregatingMergeTree() ORDER BY (emp_id, name) PRIMARY KEY emp_id PARTITION BY work_place; ORDER BY (emp_id,name) -- 注意排序key是两个字段 PRIMARY KEY emp_id -- 主键是一个字段 --对于AggregateFunction类型的列字段,在进行数据的写入和查询时与其他的表引擎有很大区别,在写入数据时,需要调用-State函数;而在查询数据时,则需要调用相应的-Merge函数。对于上面的建表语句而言,需要使用sumState函数进行数据插入 -- 插入数据, -- 注意:需要使用INSERT…SELECT语句进行数据插入 INSERT INTO TABLE emp_aggregatingmergeTree SELECT 1,'tom','上海',25,'信息部',sumState(toDecimal32(10000,2)); INSERT INTO TABLE emp_aggregatingmergeTree SELECT 1,'tom','上海',25,'信息部',sumState(toDecimal32(20000,2)); -- 查询数据 SELECT emp_id,name,sumMerge(salary) FROM emp_aggregatingmergeTree GROUP BY emp_id,name; -- 结果输出 ┌─emp_id─┬─name─┬─sumMerge(salary)─┐ │ 1 │ tom │ 30000.00 │ └────────┴──────┴──────────────────┘ --AggregatingMergeTree通常作为物化视图的表引擎,与普通MergeTree搭配使用。物化视图是作为其他数据表上层的一种查询视图。 -- 创建一个MereTree引擎的明细表 -- 用于存储全量的明细数据 -- 对外提供实时查询 CREATE TABLE emp_mergetree_base ( emp_id UInt16 COMMENT '员工id', name String COMMENT '员工姓名', work_place String COMMENT '工作地点', age UInt8 COMMENT '员工年龄', depart String COMMENT '部门', salary Decimal32(2) COMMENT '工资' ) ENGINE = MergeTree() ORDER BY (emp_id, name) PARTITION BY work_place; -- 创建一张物化视图 -- 使用AggregatingMergeTree表引擎 CREATE MATERIALIZED VIEW view_emp_agg ENGINE = AggregatingMergeTree() PARTITION BY emp_id ORDER BY (emp_id, name) AS SELECT emp_id, name, sumState(salary) AS salary FROM emp_mergetree_base GROUP BY emp_id, name; -- 向基础明细表emp_mergetree_base插入数据 INSERT INTO emp_mergetree_base VALUES (1,'tom','上海',25,'技术部',20000),(1,'tom','上海',26,'人事部',10000); -- 查询物化视图 SELECT emp_id,name,sumMerge(salary) FROM view_emp_agg GROUP BY emp_id,name; -- 结果 ┌─emp_id─┬─name─┬─sumMerge(salary)─┐ │ 1 │ tom │ 50000.00 │ └────────┴──────┴──────────────────┘
2)展示每一个节点cpu 利用率的当前值
使用argMaxState 聚合列
create materialized view cpu_last_point_idle_mv engine = AggregatingMergeTree() partition by tuple() order by tags_id populate as select argMaxState(create_date,created_at) as created_data, maxState(create_at) as max_created_max, argMaxState(time,created_at) as time, tags_id, argMaxState(usage_idle,created_at) as usage_idle from cpu group by tags_id
argMax(a,b) 函数返回 b 最大值时 a的值
State 为聚合函数的后缀,聚合函数加此后缀不直接返回结果,返回聚合函数的中间结果,该中间结果可在AggregatingMergeTree 引擎中使用
使用Merge函数后缀得到聚合结果
create view cpu_last_point_idle_v as select argMaxMerge(created_date) as created_date, maxMerge(max_created_at) as created_at, argMaxMerge(time) as time, tags_id, argMaxMerge(usage_idle) as usage_idle from cpu_last_point_idle_mv group by tags_id
查询结果视图
select tags_id, 100 - usage_idle usage from cpu_last_point_idle_v order by usage desc,tags_id asc limit 10
3)创建一个跟踪tb_test_MergeTree_basic表的物化视图
create materialized view tb_test_AggregatingMergeTree_view ENGINE = AggregatingMergeTree() PARTITION BY (brandId,shopId) ORDER BY (brandId,shopId) as select brandId,shopId,sumState(saleMoney) saleMoney,sumState(saleQty) saleQty,countState(1) saleNum,uniqState(vipId) vipNum from tb_test_MergeTree_basic group by brandId,shopId
b64d9704419c :) create materialized view tb_test_AggregatingMergeTree_view ENGINE = AggregatingMergeTree() PARTITION BY (brandId,shopId) ORDER BY (brandId,shopId) as select brandId,shopId,sumState(saleMoney) saleMoney,sumState(saleQty) saleQty,countState(1) saleNum,uniqState(vipId) vipNum from tb_test_MergeTree_basic group by brandId,shopId CREATE MATERIALIZED VIEW tb_test_AggregatingMergeTree_view ENGINE = AggregatingMergeTree() PARTITION BY (brandId, shopId) ORDER BY (brandId, shopId) AS SELECT brandId, shopId, sumState(saleMoney) AS saleMoney, sumState(saleQty) AS saleQty, countState(1) AS saleNum, uniqState(vipId) AS vipNum FROM tb_test_MergeTree_basic GROUP BY brandId, shopId Ok. 0 rows in set. Elapsed: 0.012 sec. b64d9704419c :)
show table 可见比普通表多了“.inner.”前缀
目录名称也比普通表多了一些类似乱码的字符
建视图前已经存在的数据不能跟踪
tb_test_MergeTree_basic 表原来已经在创建物化视图的时候已经有数据了
b64d9704419c :) select * from tb_test_MergeTree_basic SELECT * FROM tb_test_MergeTree_basic ┌─brandId─┬─shopId─┬───saleDate─┬─saleMoney─┬─saleQty─┬─vipId─┐ │ 429 │ 6002 │ 2020-10-07 │ 200.5 │ 40 │ 10002 │ └─────────┴────────┴────────────┴───────────┴─────────┴───────┘ ┌─brandId─┬─shopId─┬───saleDate─┬─saleMoney─┬─saleQty─┬─vipId─┐ │ 429 │ 6002 │ 2020-10-05 │ 200.5 │ 10 │ 10001 │ └─────────┴────────┴────────────┴───────────┴─────────┴───────┘ ┌─brandId─┬─shopId─┬───saleDate─┬─saleMoney─┬─saleQty─┬─vipId─┐ │ 429 │ 6001 │ 2020-10-07 │ 200.5 │ 30 │ 10003 │ └─────────┴────────┴────────────┴───────────┴─────────┴───────┘ ┌─brandId─┬─shopId─┬───saleDate─┬─saleMoney─┬─saleQty─┬─vipId─┐ │ 429 │ 6002 │ 2020-10-04 │ 200.5 │ 40 │ 10001 │ └─────────┴────────┴────────────┴───────────┴─────────┴───────┘ ┌─brandId─┬─shopId─┬───saleDate─┬─saleMoney─┬─saleQty─┬─vipId─┐ │ 429 │ 6001 │ 2020-10-01 │ 200.5 │ 10 │ 10001 │ │ 429 │ 6001 │ 2020-10-02 │ 200.5 │ 20 │ 10002 │ │ 429 │ 6001 │ 2020-10-03 │ 200.5 │ 30 │ 10003 │ │ 429 │ 6001 │ 2020-10-04 │ 200.5 │ 10 │ 10001 │ │ 429 │ 6001 │ 2020-10-05 │ 200.5 │ 20 │ 10001 │ └─────────┴────────┴────────────┴───────────┴─────────┴───────┘ ┌─brandId─┬─shopId─┬───saleDate─┬─saleMoney─┬─saleQty─┬─vipId─┐ │ 429 │ 6001 │ 2020-10-06 │ 200.5 │ 30 │ 10003 │ └─────────┴────────┴────────────┴───────────┴─────────┴───────┘
执行一次optimize table tb_test_AggregatingMergeTree_view
再查tb_test_AggregatingMergeTree_view视图
b64d9704419c :) select * from tb_test_AggregatingMergeTree_view SELECT * FROM tb_test_AggregatingMergeTree_view Ok. 0 rows in set. Elapsed: 0.003 sec. b64d9704419c :)
可见没有跟踪建表之前的已经存在的数据
可以跟踪建视图后再插入的数据
1)插入2条数据
insert into tb_test_MergeTree_basic values (429,6001,'2020-10-08 14:15:23',200.50,30,10003) insert into tb_test_MergeTree_basic values (429,6002,'2020-10-08 14:15:23',200.50,40,10002)
2)查看
b64d9704419c :) select * from tb_test_AggregatingMergeTree_view SELECT * FROM tb_test_AggregatingMergeTree_view ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum─┐ │ 429 │ 6001 │ i@ │ │ │ ³Gw │ └─────────┴────────┴───────────┴─────────┴─────────┴────────┘ ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum─┐ │ 429 │ 6002 │ i@ │ ( │ │ $a6㞠 │ └─────────┴────────┴───────────┴─────────┴─────────┴────────┘ 2 rows in set. Elapsed: 0.008 sec. b64d9704419c :)
3)聚合结果
b64d9704419c :) select brandId,shopId,sumMerge(saleMoney) saleMoney,sumMerge(saleQty) saleQty,countMerge(saleNum) saleNum,uniqMerge(vipNum) vipNum from tb_test_AggregatingMergeTree_view group by brandId,shopId SELECT brandId, shopId, sumMerge(saleMoney) AS saleMoney, sumMerge(saleQty) AS saleQty, countMerge(saleNum) AS saleNum, uniqMerge(vipNum) AS vipNum FROM tb_test_AggregatingMergeTree_view GROUP BY brandId, shopId ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum─┐ │ 429 │ 6002 │ 200.5 │ 40 │ 1 │ 1 │ │ 429 │ 6001 │ 200.5 │ 30 │ 1 │ 1 │ └─────────┴────────┴───────────┴─────────┴─────────┴────────┘ 2 rows in set. Elapsed: 0.005 sec. b64d9704419c :)
4)继续插入新的数据
insert into tb_test_MergeTree_basic values (429,6001,'2020-10-09 14:15:23',200.50,10,10001) insert into tb_test_MergeTree_basic values (429,6001,'2020-10-09 14:15:23',200.50,20,10002) insert into tb_test_MergeTree_basic values (429,6001,'2020-10-09 14:15:23',200.50,30,10003) insert into tb_test_MergeTree_basic values (429,6001,'2020-10-09 14:15:23',200.50,10,10001) insert into tb_test_MergeTree_basic values (429,6001,'2020-10-09 14:15:23',200.50,20,10001) insert into tb_test_MergeTree_basic values (429,6001,'2020-10-09 14:15:23',200.50,30,10003) insert into tb_test_MergeTree_basic values (429,6002,'2020-10-09 14:15:23',200.50,40,10001) insert into tb_test_MergeTree_basic values (429,6002,'2020-10-09 14:15:23',200.50,10,10001) insert into tb_test_MergeTree_basic values (429,6001,'2020-10-10 14:15:23',200.50,10,10001)
5)查看
可见分区没有合并
b64d9704419c :) select * from tb_test_AggregatingMergeTree_view SELECT * FROM tb_test_AggregatingMergeTree_view ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum─┐ │ 429 │ 6001 │ i@ │ │ │ l │ └─────────┴────────┴───────────┴─────────┴─────────┴────────┘ ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum─┐ │ 429 │ 6001 │ i@ │ │ │ ³Gw │ └─────────┴────────┴───────────┴─────────┴─────────┴────────┘ ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum─┐ │ 429 │ 6002 │ i@ │ │ │ l │ └─────────┴────────┴───────────┴─────────┴─────────┴────────┘ ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum─┐ │ 429 │ 6001 │ i@ │ │ │ l │ └─────────┴────────┴───────────┴─────────┴─────────┴────────┘ ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum─┐ │ 429 │ 6002 │ i@ │ ( │ │ $a6㞠 │ └─────────┴────────┴───────────┴─────────┴─────────┴────────┘ ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum───┐ │ 429 │ 6001 │ T@ │ d │ │ l ³Gw │ └─────────┴────────┴───────────┴─────────┴─────────┴──────────┘ ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum─┐ │ 429 │ 6002 │ i@ │ ( │ │ l │ └─────────┴────────┴───────────┴─────────┴─────────┴────────┘ 7 rows in set. Elapsed: 0.004 sec. b64d9704419c :)
6)观察自动跟踪
b64d9704419c :) select brandId,shopId,sumMerge(saleMoney) saleMoney,sumMerge(saleQty) saleQty,countMerge(saleNum) saleNum,uniqMerge(vipNum) vipNum from tb_test_AggregatingMergeTree_view group by brandId,shopId SELECT brandId, shopId, sumMerge(saleMoney) AS saleMoney, sumMerge(saleQty) AS saleQty, countMerge(saleNum) AS saleNum, uniqMerge(vipNum) AS vipNum FROM tb_test_AggregatingMergeTree_view GROUP BY brandId, shopId ┌─brandId─┬─shopId─┬─saleMoney─┬─saleQty─┬─saleNum─┬─vipNum─┐ │ 429 │ 6002 │ 601.5 │ 90 │ 3 │ 2 │ │ 429 │ 6001 │ 1604 │ 160 │ 8 │ 3 │ └─────────┴────────┴───────────┴─────────┴─────────┴────────┘ 2 rows in set. Elapsed: 0.010 sec. b64d9704419c :)
7)可见确实已经自动跟踪聚合了
但是只是创建视图后插入的数据才能跟踪,验证SQL 如下
select brandId,shopId,sum(saleMoney),sum(saleQty),count(1),uniq(vipId) from tb_test_MergeTree_basic where saleDate>='2020-10-08' group by brandId,shopId
执行结果
b64d9704419c :) select brandId,shopId,sum(saleMoney),sum(saleQty),count(1),uniq(vipId) from tb_test_MergeTree_basic where saleDate>='2020-10-08' group by brandId,shopId SELECT brandId, shopId, sum(saleMoney), sum(saleQty), count(1), uniq(vipId) FROM tb_test_MergeTree_basic WHERE saleDate >= '2020-10-08' GROUP BY brandId, shopId ┌─brandId─┬─shopId─┬─sum(saleMoney)─┬─sum(saleQty)─┬─count(1)─┬─uniq(vipId)─┐ │ 429 │ 6002 │ 601.5 │ 90 │ 3 │ 2 │ │ 429 │ 6001 │ 1604 │ 160 │ 8 │ 3 │ └─────────┴────────┴────────────────┴──────────────┴──────────┴─────────────┘ 2 rows in set. Elapsed: 0.003 sec. b64d9704419c :)
8)创建视图前原来已经存在的数据是不能被跟踪的
下面的这部分值不能被跟踪
select brandId,shopId,sum(saleMoney),sum(saleQty),count(1),uniq(vipId) from tb_test_MergeTree_basic where saleDate<'2020-10-08' group by brandId,shopId SELECT brandId, shopId, sum(saleMoney), sum(saleQty), count(1), uniq(vipId) FROM tb_test_MergeTree_basic WHERE saleDate < '2020-10-08' GROUP BY brandId, shopId ┌─brandId─┬─shopId─┬─sum(saleMoney)─┬─sum(saleQty)─┬─count(1)─┬─uniq(vipId)─┐ │ 429 │ 6002 │ 601.5 │ 90 │ 3 │ 2 │ │ 429 │ 6001 │ 1403.5 │ 150 │ 7 │ 3 │ └─────────┴────────┴────────────────┴──────────────┴──────────┴─────────────┘ 2 rows in set. Elapsed: 0.003 sec. b64d9704419c :)
4.总结
1)使用ORDER BY排序键作为聚合数据的依据
2)使用AggregateFunction字段类型定义聚合函数的类型以及聚合字段
3)只有在合并分区的时候才会触发聚合计算的逻辑
4)聚合只会发生在同分区内,不同分区的数据不会发生聚合
5)在进行数据计算时,因为同分区的数据已经基于ORDER BY排序,所以能够找到相邻且具有相同聚合key的数据
6)在聚合数据时,同一分区内,相同聚合key的多行数据会合并成一行,对于那些非主键、非AggregateFunction类型字段,则会取第一行数据
7)AggregateFunction类型字段使用二进制存储,在写入数据时,需要调用state函数;在读数据时,需要调用merge函数,*表示定义时使用的聚合函数
8)AggregateMergeTree通常作为物化视图的引擎,与普通的MergeTree搭配使用
5.注意
可以使用AggregatingMergeTree
表来做增量数据统计聚合,包括物化视图的数据聚合
1)AggregatingMergeTree表不能跟踪basic表,在执行inser select 之后查的数据无法进行聚合,只能inser select 之前的数据聚合
2)AggregatingMergeTree物化视图可以跟踪basic表,但是在视图创建前已经存在的数据不能被跟踪,只能跟踪聚合视图创建后新插入的数据