ClickHouse 数据类型 (二)
4.4、二级索引
目前在ClickHouse的官网上二级索引的功能是默认开始的,创建测试表
create table t_order_mt2( id UInt32, sku_id String, total_amount Decimal(16,2), create_time Datetime, INDEX a total_amount TYPE minmax GRANULARITY 5 ) engine =MergeTree partition by toYYYYMMDD(create_time) primary key (id) order by (id, sku_id);
其中GRANULARITY N 是设定二级索引对于一级索引粒度的粒度。插入数据
insert into t_order_mt2 values (101,'sku_001',1000.00,'2022-08-18 12:00:00') , (102,'sku_002',2000.00,'2022-08-18 11:00:00'), (102,'sku_004',2500.00,'2022-08-18 12:00:00'), (102,'sku_002',2000.00,'2022-08-18 13:00:00'), (102,'sku_002',12000.00,'2022-08-18 13:00:00'), (102,'sku_002',600.00,'2022-08-19 12:00:00');
文件展示
/var/lib/clickhouse/data/default/t_order_mt2/20220818_1_1_0 [root@hadoop201 20220818_1_1_0]# ll total 44 -r--r----- 2 clickhouse clickhouse 334 Aug 4 05:22 checksums.txt -r--r----- 2 clickhouse clickhouse 118 Aug 4 05:22 columns.txt -r--r----- 2 clickhouse clickhouse 1 Aug 4 05:22 count.txt -r--r----- 2 clickhouse clickhouse 189 Aug 4 05:22 data.bin -r--r----- 2 clickhouse clickhouse 144 Aug 4 05:22 data.mrk3 -r--r----- 2 clickhouse clickhouse 10 Aug 4 05:22 default_compression_codec.txt -r--r----- 2 clickhouse clickhouse 8 Aug 4 05:22 minmax_create_time.idx -r--r----- 2 clickhouse clickhouse 4 Aug 4 05:22 partition.dat -r--r----- 2 clickhouse clickhouse 8 Aug 4 05:22 primary.idx -r--r----- 2 clickhouse clickhouse 41 Aug 4 05:22 skp_idx_a.idx -r--r----- 2 clickhouse clickhouse 24 Aug 4 05:22 skp_idx_a.mrk3
存在二级索引
SELECT * FROM t_order_mt2 Query id: 325b5819-a7b3-41d7-a74a-d9e507576428 ┌──id─┬─sku_id──┬─total_amount─┬─────────create_time─┐ │ 102 │ sku_002 │ 600.00 │ 2022-08-19 12:00:00 │ └─────┴─────────┴──────────────┴─────────────────────┘ ┌──id─┬─sku_id──┬─total_amount─┬─────────create_time─┐ │ 101 │ sku_001 │ 1000.00 │ 2022-08-18 12:00:00 │ │ 102 │ sku_002 │ 2000.00 │ 2022-08-18 11:00:00 │ │ 102 │ sku_002 │ 2000.00 │ 2022-08-18 13:00:00 │ │ 102 │ sku_002 │ 12000.00 │ 2022-08-18 13:00:00 │ │ 102 │ sku_004 │ 2500.00 │ 2022-08-18 12:00:00 │ └─────┴─────────┴──────────────┴─────────────────────┘
对比:那么在使用下面语句进行测试,可以看出二级索引能够为非主键字段的查询发挥作用
[root@hadoop201 t_order_mt]# clickhouse-client --send_logs_level=trace <<< 'select * from t_order_mt2 where total_amount > toDecimal32(900., 2)'; [hadoop201] 2022.08.04 05:24:06.341226 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Debug> executeQuery: (from [::1]:51232, using production parser) select * from t_order_mt2 where total_amount > toDecimal32(900., 2) [hadoop201] 2022.08.04 05:24:06.341773 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "total_amount > toDecimal32(900., 2)" moved to PREWHERE [hadoop201] 2022.08.04 05:24:06.341978 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Trace> ContextAccess (default): Access granted: SELECT(id, sku_id, total_amount, create_time) ON default.t_order_mt2 [hadoop201] 2022.08.04 05:24:06.342048 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Trace> InterpreterSelectQuery: FetchColumns -> Complete [hadoop201] 2022.08.04 05:24:06.342260 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Debug> default.t_order_mt2 (7bf22023-4035-4f85-bbf2-202340355f85) (SelectExecutor): Key condition: unknown [hadoop201] 2022.08.04 05:24:06.342440 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Debug> default.t_order_mt2 (7bf22023-4035-4f85-bbf2-202340355f85) (SelectExecutor): MinMax index condition: unknown [hadoop201] 2022.08.04 05:24:06.342908 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Debug> default.t_order_mt2 (7bf22023-4035-4f85-bbf2-202340355f85) (SelectExecutor): Index `a` has dropped 1/2 granules. [hadoop201] 2022.08.04 05:24:06.342931 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Debug> default.t_order_mt2 (7bf22023-4035-4f85-bbf2-202340355f85) (SelectExecutor): Selected 2/2 parts by partition key, 1 parts by primary key, 2/2 marks by primary key, 1 marks to read from 1 ranges [hadoop201] 2022.08.04 05:24:06.342975 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Debug> MergeTreeSelectProcessor: Reading 1 ranges from part 20220818_1_1_0, approx. 5 rows starting from 0 [hadoop201] 2022.08.04 05:24:06.345015 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Information> executeQuery: Read 5 rows, 160.00 B in 0.003726225 sec., 1341 rows/sec., 41.93 KiB/sec. [hadoop201] 2022.08.04 05:24:06.345058 [ 1486 ] {d930f810-9631-4433-9821-92c847b4133f} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B. 101 sku_001 1000.00 2022-08-18 12:00:00 102 sku_002 2000.00 2022-08-18 11:00:00 102 sku_002 2000.00 2022-08-18 13:00:00 102 sku_002 12000.00 2022-08-18 13:00:00 102 sku_004 2500.00 2022-08-18 12:00:00
4.5、数据TTL
TTL即Time To Live,MergeTree提供了可以管理数据或者列的生命周期的功能。
列级别TTL
建表:此时在表的 total_amount 进行声明周期管理,10s后若发送合并则数据 total_amount 列自动清除
create table t_order_mt3( id UInt32, sku_id String, total_amount Decimal(16,2) TTL create_time+interval 10 SECOND, create_time Datetime ) engine =MergeTree partition by toYYYYMMDD(create_time) primary key (id) order by (id, sku_id);
插入数据
insert into t_order_mt3 values (106,'sku_001',1000.00,'2022-05-18 22:59:30'), (107,'sku_002',2000.00,'2022-05-18 22:59:30'), (110,'sku_003',600.00,'2022-05-19 12:00:00');
过几分钟后数据没了
SELECT * FROM t_order_mt3 ┌──id─┬─sku_id──┬─total_amount─┬─────────create_time─┐ │ 110 │ sku_003 │ 600.00 │ 2022-05-19 12:00:00 │ └─────┴─────────┴──────────────┴─────────────────────┘ ┌──id─┬─sku_id──┬─total_amount─┬─────────create_time─┐ │ 106 │ sku_001 │ 0.00 │ 2022-05-18 22:59:30 │ │ 107 │ sku_002 │ 0.00 │ 2022-05-18 22:59:30 │ └─────┴─────────┴──────────────┴─────────────────────┘
表级TTL:下面的这条语句是数据会在create_time 之后10秒丢失
alter table t_order_mt3 MODIFY TTL create_time + INTERVAL 10 SECOND;
涉及判断的字段必须是Date或者Datetime类型,推荐使用分区的日期字段。
能够使用的时间周期:
- SECOND
- MINUTE
- HOUR
- DAY
- WEEK
- MONTH
- QUARTER
- YEAR
5、 ReplacingMergeTree
ReplacingMergeTree是MergeTree的一个变种,它存储特性完全继承MergeTree,只是多了一个去重的功能。 尽管MergeTree可以设置主键,但是primary key其实没有唯一约束的功能。如果你想处理掉重复的数据,可以借助这个ReplacingMergeTree。
去重时机:数据的去重只会在合并的过程中出现。合并会在未知的时间在后台进行,所以你无法预先作出计划。有一些数据可能仍未被处理。去重范围:如果表经过了分区,去重只会在分区内部进行去重,不能执行跨分区的去重。所以ReplacingMergeTree能力有限, ReplacingMergeTree 适用于在后台清除重复的数据以节省空间,但是它不保证没有重复的数据出现。
建表
create table t_order_rmt( id UInt32, sku_id String, total_amount Decimal(16,2) , create_time Datetime ) engine =ReplacingMergeTree(create_time) partition by toYYYYMMDD(create_time) primary key (id) order by (id, sku_id);
ReplacingMergeTree() 填入的参数为版本字段,重复数据保留版本字段值最大的。如果不填版本字段,默认按照插入顺序保留最后一条
insert into t_order_rmt values (101,'sku_001',1000.00,'2022-05-18 12:00:00') , (102,'sku_002',2000.00,'2022-05-18 11:00:00'), (102,'sku_004',2500.00,'2022-05-18 12:00:00'), (102,'sku_002',2000.00,'2022-05-18 13:00:00'), (102,'sku_002',12000.00,'2022-05-18 13:00:00'), (102,'sku_002',600.00,'2022-05-19 12:00:00');
第一次查询
SELECT * FROM t_order_rmt ┌──id─┬─sku_id──┬─total_amount─┬─────────create_time─┐ │ 102 │ sku_002 │ 600.00 │ 2022-05-19 12:00:00 │ └─────┴─────────┴──────────────┴─────────────────────┘ ┌──id─┬─sku_id──┬─total_amount─┬─────────create_time─┐ │ 101 │ sku_001 │ 1000.00 │ 2022-05-18 12:00:00 │ │ 102 │ sku_002 │ 2000.00 │ 2022-05-18 11:00:00 │ │ 102 │ sku_002 │ 2000.00 │ 2022-05-18 13:00:00 │ │ 102 │ sku_002 │ 12000.00 │ 2022-05-18 13:00:00 │ │ 102 │ sku_004 │ 2500.00 │ 2022-05-18 12:00:00 │ └─────┴─────────┴──────────────┴─────────────────────┘
手工触发合并
OPTIMIZE TABLE t_order_rmt FINAL;
再次查询
SELECT * FROM t_order_rmt ┌──id─┬─sku_id──┬─total_amount─┬─────────create_time─┐ │ 102 │ sku_002 │ 600.00 │ 2022-05-19 12:00:00 │ └─────┴─────────┴──────────────┴─────────────────────┘ ┌──id─┬─sku_id──┬─total_amount─┬─────────create_time─┐ │ 101 │ sku_001 │ 1000.00 │ 2022-05-18 12:00:00 │ │ 102 │ sku_002 │ 12000.00 │ 2022-05-18 13:00:00 │ │ 102 │ sku_004 │ 2500.00 │ 2022-05-18 12:00:00 │ └─────┴─────────┴──────────────┴─────────────────────┘
结论
实际上是使用order by 字段作为唯一键 去重不能跨分区 只有同一批插入(新版本)合并分区才会进行去重 认定重复的数据保留,版本字段值最大的 如果版本字段相同则按插入顺序保留最后一笔