ClickHouse 多表查询优化

1、准备测试数据 

#创建小表
CREATE TABLE datasets.visits_v2
    ENGINE = CollapsingMergeTree(Sign)
        PARTITION BY toYYYYMM(StartDate)
        ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID)
        SAMPLE BY intHash32(UserID)
        SETTINGS index_granularity = 8192
as
select *
from datasets.visits_v1
limit 10000;

#创建join结果表:避免控制台疯狂打印数据
CREATE TABLE datasets.hits_v2
    ENGINE = MergeTree()
        PARTITION BY toYYYYMM(EventDate)
        ORDER BY (CounterID, EventDate, intHash32(UserID))
        SAMPLE BY intHash32(UserID)
        SETTINGS index_granularity = 8192
as
select *
from datasets.hits_v1
where 1 = 0;

2 用 IN 代替 JOIN

当多表联查时,查询的数据仅从其中一张表出时,可考虑用 IN 操作而不是JOIN
#正例:使用join
insert into hits_v2
select a.*
from datasets.hits_v1 a
where a. CounterID in (select CounterID from datasets.visits_v1);

#反例:使用join
insert into table hits_v2
select a.*
from datasets.hits_v1 a
         left join datasets.visits_v1 b on a. CounterID = b. CounterID;

3 大小表JOIN

多表join时要满足小表在右的原则,右表关联时被加载到内存中与左表进行比较,ClickHouse中无论是Left join 、Right join 还是 Inner join 永远都是拿着右表中的每一条记录到左表中查找该记录是否存在,所以右表必须是小表。
小表在右
insert into table datasets.hits_v2
select a.*
from datasets.hits_v1 a
         left join datasets.visits_v2 b on a. CounterID = b. CounterID;
大表在右
insert into table datasets.hits_v2
select a.*
from datasets.visits_v2 b
         left join datasets.hits_v1 a on a.CounterID = b. CounterID;

4 注意谓词下推(版本差异)

Explain syntax
select a.* from datasets.hits_v1 a left join datasets.visits_v2 b on a. CounterID=b. CounterID
having a.EventDate = '2014-03-17';


Explain syntax
select a.* from datasets.hits_v1 a left join datasets.visits_v2 b on a. CounterID=b. CounterID
having b.StartDate = '2014-03-17';

insert into datasets.hits_v2
select a.* from datasets.hits_v1 a left join datasets.visits_v2 b on a. CounterID=b. CounterID
where a.EventDate = '2014-03-17';

insert into datasets.hits_v2
select a.* from (
    select * from
    datasets.hits_v1
    where EventDate = '2014-03-17'
) a left join datasets.visits_v2 b on a. CounterID=b. CounterID;

5 分布式表使用GLOBAL

两张分布式表上的IN和JOIN之前必须加上GLOBAL关键字,右表只会在接收查询请求的那个节点查询一次,并将其分发到其他节点上。如果不加GLOBAL关键字的话,每个节点都会单独发起一次对右表的查询,而右表又是分布式表,就导致右表一共会被查询N²次(N是该分布式表的分片数量),这就是查询放大,会带来很大开销。

6 使用字典表&提前过滤

将一些需要关联分析的业务创建成字典表进行join操作,前提是字典表不宜太大,因为字典表会常驻内存;通过增加逻辑过滤可以减少数据扫描,达到提高执行速度及降低内存消耗的目的
 
posted @ 2022-11-24 09:57  晓枫的春天  阅读(1546)  评论(0编辑  收藏  举报