Clickhouse join优化之分桶优化
1、背景
ck在单表查询能够做到极致,但是在join上性能就相对尬尴,
A JOIN B 特别是当两张表的数据都不小的时候,经常就会有内存溢出,超时等等情况
特别是当AB都为分布表的时候
就拿常用的事件表(events_all)和用户表(users_all)做JOIN为例,都是分布表;
以下为clickhouse分桶测试(针对账号pid进行hash分桶,按月分区,相同的pid会落到同一个分片上,增加join效率,减少网络IO)
2、表结构创建
事件本地表CREATE TABLE test.join_event_local on cluster cluster_3shards_2replicas ( `ds` String, `gn` String, `log_type` String, `server_id` Int32, `pid` Int32, `plosgn` Int8, `role_uid` Int64, `log_time` Int32, `gold_num` Int32, `gold_type` Int32, `function_key` Int32, `direction` Int8, `item_id` Int32, `item_num` Int64, `uuid` String, `level` Int32, `vip` Int32, `remain` Int32, `salt` Int32, `plosgn_lo` Int8, `kv` String, `state` String, `flume2hdfs` String, `updatetime` String ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/join_event_local', '{replica}') PARTITION BY toYYYYMM(toDate(ds)) ORDER BY (gn, server_id, log_time, item_id, uuid, plosgn,intHash64(pid)) SAMPLE BY intHash64(pid) SETTINGS index_granularity = 8192;
事件分布式表
CREATE TABLE test.join_event_all on cluster cluster_3shards_2replicas as test.join_event_local ENGINE = Distributed('cluster_3shards_2replicas', 'test', 'join_event_local',intHash64(pid)) ;
用户本地表
CREATE TABLE test.join_user_local on cluster cluster_3shards_2replicas ( `ds` String, `gn` String, `pid` Int32, `qid` String, `active_time` Int64, `login_ip` Int64, `source` String, `bind` String, `os` String, `create_time` Int64, `last_login_time` Int64, `isvalid` Int32, `groupid` Int32, `status` Int8, `uuid_hash_low` String, `uuid_hash_high` String, `plosgn` Int32, `groupbase` Int32, `state` String, `timezone` String, `a` String, `b` String ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/join_user_local', '{replica}') PARTITION BY toYYYYMM(toDate(ds)) ORDER BY (gn, pid, intHash64(pid), active_time, login_ip, source, bind, os, create_time, last_login_time, isvalid, groupid, status, uuid_hash_low, uuid_hash_high, plosgn, groupbase) SAMPLE BY intHash64(pid) SETTINGS index_granularity = 8192 ;
用户分布式表
CREATE TABLE test.join_user_all on cluster cluster_3shards_2replicas as test.join_user_local ENGINE = Distributed('cluster_3shards_2replicas', 'test', 'join_user_local', intHash64(pid))
查询结果
***************非分桶join查询1*******************
测试sql1:账号关联,日期关联计数(256秒)
select count(a.pid) from (select pid,ds from test.event_all) a
join (select pid,ds from test.user_all) b on a.pid=b.pid and a.ds=b.ds group by ds order by ds limit 10;
***************分桶join查询1*******************
测试sql1:账号关联,日期关联计数 (分桶表常规写法查询,依然很慢,309秒)
select ds,count(a.pid) from (select pid,ds from test.join_event_all) a
join (select pid,ds from test.join_user_all) b on a.pid=b.pid and a.ds=b.ds group by ds order by ds limit 10;
(分桶表关联查询ck写法,62秒)
set distributed_product_mode = 'global';
select a.ds,count(a.pid) from test.join_event_all a
global join test.join_user_all b using(pid,ds) group by ds order by ds limit 10;
***************非分桶join查询2*******************
测试sql2:大表关联小表 账号关联:
select count(a.pid) from (select pid from test.event_all) a join (select pid from test.user_all) b on a.pid=b.pid; (非分桶 30秒)
***************分桶join查询2**********************
select count(a.pid) from test.join_event_all a
global join test.join_user_all b using(pid) limit 10; (分桶查询46秒)
注意:查询的时候不要B表括号起来select *from (select form A) a join (select from B) b ...这样 否则优化无效,这跟ck的解析有关
总结:当大表和大表进行关联的时候,效率提升明显(选定好常用的字段进行hash分桶,比如账号。。)(大小表关联条件单一的时候,可能效果不是太明显)
参考:https://blog.csdn.net/a495679822/article/details/118548564
posted on 2022-09-09 11:22 RICH-ATONE 阅读(1213) 评论(0) 编辑 收藏 举报