最近听到个奇葩需求: Flink SQL 每小时计算最近 1 小时内每 10 秒的最近 1 分钟 TPS
这个需求有点绕,举个栗子:
比如 11 点计算:
10:01:10 计算区间: 09:59:10 to 10:01:10 (不包含)
10:01:20 计算区间: 09:59:20 to 10:01:20 (不包含)
10:01:30 计算区间: 09:59:10 to 10:01:30 (不包含)
...
10:59:50 计算区间: 10:58:50 to 10:59:50 (不包含)
11:00:00 计算区间: 10:59:00 to 11:00:00 (不包含)
很明细,这么奇葩的需求,不可能用 flink sql 直接计算出来,稍作分解每 10 秒计算最近 1 分钟的 TPS 还差不多
每 10 秒计算最近 1 分钟的 TPS
上面的需求稍微分解一下,每 10 秒计算一次最近 1分钟的TPS,Flink sql 可以用 滑动窗口,很简单的写出来
sql 代码:
-- kafka source
drop table if exists user_log;
CREATE TABLE user_log
(
`event_time` TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL, -- from Debezium format
`partition_id` BIGINT METADATA FROM 'partition' VIRTUAL, -- from Kafka connector
`offset` BIGINT METADATA VIRTUAL, -- from Kafka connector
user_id VARCHAR,
item_id VARCHAR,
category_id VARCHAR,
behavior VARCHAR,
ts TIMESTAMP(3),
WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka'
,'topic' = 'user_log'
,'properties.bootstrap.servers' = 'localhost:9092'
,'properties.group.id' = 'user_log'
,'scan.startup.mode' = 'earliest-offset'
,'format' = 'json'
);
-- set table.sql-dialect=hive;
-- kafka sink
drop table if exists user_log_sink;
CREATE TABLE user_log_sink
(
window_start STRING,
window_end STRING,
tps DECIMAL(26, 6)
) WITH (
'connector' = 'kafka'
,'topic' = 'user_log_tps'
,'properties.bootstrap.servers' = 'localhost:9092'
,'properties.group.id' = 'user_log'
,'scan.startup.mode' = 'latest-offset'
,'format' = 'json'
);
-- 每10秒计算最近1分钟的 TPS
insert into user_log_sink
select date_format(window_start, 'yyyy-MM-dd HH:mm:ss'), date_format(window_end, 'yyyy-MM-dd HH:mm:ss'), count(user_id)/60.0 tps
FROM TABLE(
HOP(TABLE user_log, DESCRIPTOR(ts), INTERVAL '10' SECOND, INTERVAL '60' SECOND))
group by window_start, window_end
;
sql 结果:
{"window_start":"2023-02-10 13:29:50","window_end":"2023-02-10 13:30:50","tps":500.316667}
{"window_start":"2023-02-10 13:30:00","window_end":"2023-02-10 13:31:00","tps":503.333333}
{"window_start":"2023-02-10 13:30:10","window_end":"2023-02-10 13:31:10","tps":503.333333}
{"window_start":"2023-02-10 13:30:20","window_end":"2023-02-10 13:31:20","tps":503.333333}
{"window_start":"2023-02-10 13:30:30","window_end":"2023-02-10 13:31:30","tps":503.333333}
{"window_start":"2023-02-10 13:30:40","window_end":"2023-02-10 13:31:40","tps":503.333333}
{"window_start":"2023-02-10 13:30:50","window_end":"2023-02-10 13:31:50","tps":503.016667}
整点计算最近 1 分钟的 TPS
上面的需求稍微再分解一下,每小时计算一次最近 1分钟的TPS,Flink sql 可以用 滚动窗口,也很简单
- 注:为了方便测试,这个每 10 分钟计算一次
drop table if exists user_log_sink_1;
CREATE TABLE user_log_sink_1
(
window_start STRING,
window_end STRING,
ts_min STRING,
ts_max STRING,
tps DECIMAL(26, 6)
) WITH (
'connector' = 'kafka'
,'topic' = 'user_log_tps'
,'properties.bootstrap.servers' = 'localhost:9092'
,'properties.group.id' = 'user_log'
,'scan.startup.mode' = 'latest-offset'
,'format' = 'json'
);
-- flink sql 竟然没有date_add 函数,官网的日期函数都是些什么啊,变通一下,用 TIMESTAMPDIFF函数,判断ts 时间和创建结束时间是否在同一分钟,即 最近 1 分钟的数据
-- 为了方便测试,这个每 10 分钟计算一次
insert into user_log_sink_1
select window_start, window_end, ts_min, ts_max, tps
from(
select date_format(window_start, 'yyyy-MM-dd HH:mm:ss') window_start
, date_format(window_end, 'yyyy-MM-dd HH:mm:ss') window_end
, count(user_id)/60.0 tps
, date_format(min(ts), 'yyyy-MM-dd HH:mm:ss') ts_min
, date_format(max(ts), 'yyyy-MM-dd HH:mm:ss') ts_max
FROM TABLE(
TUMBLE(TABLE user_log, DESCRIPTOR(ts), INTERVAL '10' MINUTE))
where TIMESTAMPDIFF( MINUTE, ts , window_end) <= 0
group by window_start, window_end
)
;
sql 结果:
{"window_start":"2023-02-10 15:50:00","window_end":"2023-02-10 16:00:00","ts_min":"2023-02-10 15:59:00","ts_max":"2023-02-10 15:59:59","tps":1E+1}
{"window_start":"2023-02-10 16:00:00","window_end":"2023-02-10 16:10:00","ts_min":"2023-02-10 16:09:00","ts_max":"2023-02-10 16:09:59","tps":1E+1}
{"window_start":"2023-02-10 16:10:00","window_end":"2023-02-10 16:20:00","ts_min":"2023-02-10 16:19:00","ts_max":"2023-02-10 16:19:59","tps":10.816667}
[失败] 整点计算过去 1 小每 10 秒的最近 1 分钟 TPS
需求不做拆解,就很难了,大概用 Flink SQL 写了一下,实现不了,主要是Flink sql 不支持不等值链接,不然用自关联,SQL 还是可以写出来的
这样的窗口会被识别成不等值连接,报错(即使把 and TIMESTAMPDIFF( second, t1.window_time , t2.ts) <= 0 and TIMESTAMPDIFF( second, t1.window_time , t2.ts) >= -60 去掉,也报一样的错,不然用笛卡尔积应该可以做出来
insert into user_log_sink_2
t1.window_start_str
,t1.window_end_str
, t1.window_time
, sum(t2.coun)/60.0 tps
from (
select date_format(window_start, 'yyyy-MM-dd HH:mm:ss') window_start_str
, date_format(window_end, 'yyyy-MM-dd HH:mm:ss') window_end_str
, window_end
, TO_TIMESTAMP(concat(substring(date_format(ts, 'yyyy-MM-dd HH:mm:ss'), 0, 18), '0')) window_time
FROM TABLE(
HOP(TABLE user_log, DESCRIPTOR(ts), INTERVAL '10' MINUTE, INTERVAL '10' MINUTE))
group by window_start, window_end, substring(date_format(ts, 'yyyy-MM-dd HH:mm:ss'), 0, 18)
) t1
left join
(select window_start, window_end, ts, count(1) coun
TABLE(
HOP(TABLE user_log, DESCRIPTOR(ts), INTERVAL '10' MINUTE, INTERVAL '10' MINUTE))
group by window_start, window_end, ts
)t2 on t1.window_end = t2.window_end
and TIMESTAMPDIFF( second, t1.window_time , t2.ts) <= 0
and TIMESTAMPDIFF( second, t1.window_time , t2.ts) >= -60
group by t1.window_start_str
,t1.window_end_str
, t1.window_time
;
报错
org.apache.flink.table.api.SqlParserException: SQL parse failed. Non-query expression encountered in illegal context
at org.apache.flink.table.planner.parse.CalciteParser.parseSqlList(CalciteParser.java:82)
at org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:102)
at org.apache.flink.table.api.internal.StatementSetImpl.addInsertSql(StatementSetImpl.java:62)
at org.apache.flink.table.api.bridge.scala.internal.StreamStatementSetImpl.addInsertSql(StreamStatementSetImpl.scala:36)
at com.rookie.submit.main.SqlSubmit$.$anonfun$main$1(SqlSubmit.scala:96)
at com.rookie.submit.main.SqlSubmit$.$anonfun$main$1$adapted(SqlSubmit.scala:82)
at scala.collection.Iterator.foreach(Iterator.scala:937)
at scala.collection.Iterator.foreach$(Iterator.scala:937)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
at scala.collection.IterableLike.foreach(IterableLike.scala:70)
at scala.collection.IterableLike.foreach$(IterableLike.scala:69)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at com.rookie.submit.main.SqlSubmit$.main(SqlSubmit.scala:82)
at com.rookie.submit.main.SqlSubmit.main(SqlSubmit.scala)
下集预告
用代码实现需求:每小时计算最近 1 小时内每 10 秒的最近 1 分钟 TPS
欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文