ClickHouse 使用Primary Key原因以及为什么与 Sorting Key 不同

官方地址

首先选择主键原因

Selecting the Primary Key​
The number of columns in the primary key is not explicitly limited. Depending on the data structure, you can include more or fewer columns in the primary key. This may:

Improve the performance of an index.

If the primary key is (a, b), then adding another column c will improve the performance if the following conditions are met:

There are queries with a condition on column c.
Long data ranges (several times longer than the index_granularity) with identical values for (a, b) are common. In other words, when adding another column allows you to skip quite long data ranges.
Improve data compression.

ClickHouse sorts data by primary key, so the higher the consistency, the better the compression.

Provide additional logic when merging data parts in the CollapsingMergeTree and SummingMergeTree engines.

In this case it makes sense to specify the sorting key that is different from the primary key.

A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during SELECT queries.

You can create a table without a primary key using the ORDER BY tuple() syntax. In this case, ClickHouse stores data in the order of inserting. If you want to save data order when inserting data by INSERT ... SELECT queries, set max_insert_threads = 1.

To select data in the initial order, use single-threaded SELECT queries.

翻译过来就是

主键中包含的列数是没有限制的,可以通过多个或者少量的列作为主键,可以达到的效果有:

1. 提高索引性能(有a,b两字段索引,当查询条件中有c 并且 a,b 值相同的数据很多(比index_granuclear长几倍),增加c列作为主键可以跳过很多数据)

2. 提升数据压缩(ClickHouse 根据主键对数据进行压缩,更高的一致性带来的是更好的数据压缩)

3. 在CollapsingMergeTree and SummingMergeTree 表引擎下合并数据时主键提供额外的合并逻辑(在这种情况下,指定一个不同于sorting key的主键是有意义的)

4. 一个长的主键会影响数据插入性能和内存消耗,但是主键额外的列不会影响clickhouse的查询select查询性能

5. 你可以创建一个表没有主键,使用ORDER BY tuple(),这种情况下ClickHouse按照插入顺序存储。在使用INSERT ... SELECT查询时,如果你想保证保存数据的顺序,必须设置set max_insert_threads = 1,让数据按照初始查询的顺序,使用单线程插入保存。

选择不同于sorting key 的原因

Choosing a Primary Key that Differs from the Sorting Key
It is possible to specify a primary key (an expression with values that are written in the index file for each mark) that is different from the sorting key (an expression for sorting the rows in data parts). In this case the primary key expression tuple must be a prefix of the sorting key expression tuple.

This feature is helpful when using the SummingMergeTree and AggregatingMergeTree table engines. In a common case when using these engines, the table has two types of columns: dimensions and measures. Typical queries aggregate values of measure columns with arbitrary GROUP BY and filtering by dimensions. Because SummingMergeTree and AggregatingMergeTree aggregate rows with the same value of the sorting key, it is natural to add all dimensions to it. As a result, the key expression consists of a long list of columns and this list must be frequently updated with newly added dimensions.

In this case it makes sense to leave only a few columns in the primary key that will provide efficient range scans and add the remaining dimension columns to the sorting key tuple.

ALTER of the sorting key is a lightweight operation because when a new column is simultaneously added to the table and to the sorting key, existing data parts do not need to be changed. Since the old sorting key is a prefix of the new sorting key and there is no data in the newly added column, the data is sorted by both the old and new sorting keys at the moment of table modification.

翻译过来总体意思就是

1:首先如果不设置primary key,primary key默认就是order by key
2:修改order by key比修改primary key简单很多
3:primary key必须是order by key的前缀
4:SummingMergeTree和AggregatingMergeTree中 才会一般出现这种情况,即primary key 和order by key不一样。
是为了给修改order by key留出空间,因为如果不设置primary key,因为原因1就会导致primary key和sort by key一样,这样修改sort by key的时候就会出问题,至于为什么要修改sort by key,是因为俩个聚合mergetree的作用决定的

posted @ 2022-10-19 17:19  衰草寒烟  阅读(1324)  评论(0编辑  收藏  举报