ClickHouse 多表查询优化
1、准备测试数据
#创建小表 CREATE TABLE datasets.visits_v2 ENGINE = CollapsingMergeTree(Sign) PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192 as select * from datasets.visits_v1 limit 10000; #创建join结果表:避免控制台疯狂打印数据 CREATE TABLE datasets.hits_v2 ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192 as select * from datasets.hits_v1 where 1 = 0;
2 用 IN 代替 JOIN
当多表联查时,查询的数据仅从其中一张表出时,可考虑用 IN 操作而不是JOIN
#正例:使用join insert into hits_v2 select a.* from datasets.hits_v1 a where a. CounterID in (select CounterID from datasets.visits_v1); #反例:使用join insert into table hits_v2 select a.* from datasets.hits_v1 a left join datasets.visits_v1 b on a. CounterID = b. CounterID;
3 大小表JOIN
多表join时要满足小表在右的原则,右表关联时被加载到内存中与左表进行比较,ClickHouse中无论是Left join 、Right join 还是 Inner join 永远都是拿着右表中的每一条记录到左表中查找该记录是否存在,所以右表必须是小表。
小表在右
insert into table datasets.hits_v2 select a.* from datasets.hits_v1 a left join datasets.visits_v2 b on a. CounterID = b. CounterID;
大表在右
insert into table datasets.hits_v2 select a.* from datasets.visits_v2 b left join datasets.hits_v1 a on a.CounterID = b. CounterID;
4 注意谓词下推(版本差异)
Explain syntax select a.* from datasets.hits_v1 a left join datasets.visits_v2 b on a. CounterID=b. CounterID having a.EventDate = '2014-03-17'; Explain syntax select a.* from datasets.hits_v1 a left join datasets.visits_v2 b on a. CounterID=b. CounterID having b.StartDate = '2014-03-17'; insert into datasets.hits_v2 select a.* from datasets.hits_v1 a left join datasets.visits_v2 b on a. CounterID=b. CounterID where a.EventDate = '2014-03-17'; insert into datasets.hits_v2 select a.* from ( select * from datasets.hits_v1 where EventDate = '2014-03-17' ) a left join datasets.visits_v2 b on a. CounterID=b. CounterID;
5 分布式表使用GLOBAL
两张分布式表上的IN和JOIN之前必须加上GLOBAL关键字,右表只会在接收查询请求的那个节点查询一次,并将其分发到其他节点上。如果不加GLOBAL关键字的话,每个节点都会单独发起一次对右表的查询,而右表又是分布式表,就导致右表一共会被查询N²次(N是该分布式表的分片数量),这就是查询放大,会带来很大开销。
6 使用字典表&提前过滤
将一些需要关联分析的业务创建成字典表进行join操作,前提是字典表不宜太大,因为字典表会常驻内存;通过增加逻辑过滤可以减少数据扫描,达到提高执行速度及降低内存消耗的目的