flinksql
FlinkSql
传统数据库/批处理场景下数据集是有限的,天然避免了流计算面临的两大问题:
1. 流计算中两条流数据的输入可能存在时间差,如何能保证在时间不一致情况下Join的准确性。
2. 流计算中数据是无限的,历史数据不能一直被保存,否则会带来极大内存、磁盘压力,如何做取舍。
针对第一个问题,Flink采用了retract操作,对于没有Join到的数据会先下发Null,后续等Join上后再进行回撤,并下发正确的结果。虽然这会导致数据的膨胀和下游瞬时状态数据异常,但是只要两条流都没有积压,能够保证数据的瞬时准确。
对于第二个问题,需要根据业务逻辑来进行取舍,通过参数来设置状态的过期时间,又或者使用Window Join来替代。
set state.retention.time.min=1d;
set state.retention.time.max=2d;
双流join和 流维join
Flink在双流Join这一类中支持两种Join方式
① Regular Join
(INNER/LEFT/RIGHT/FULL Join)
append + retract
来自两边的数据(L-Event、R-Event)进入Join算子后先会被分别更新到L-State和R-State,接着L-Event会和R-State中的结果进行Join操作,输出Join结果发到下游;
同样,R-Event则会去和L-State进行Join,并把相关结果发往下游。
(SEMI Join AND ANTI Join)
返回左表的列数据,并不将右表的数据做输出,只是用右表数据对左表进行过滤。
-- SEMI Join:
SELECT * FROM Employee WHERE DeptName IN (SELECT DeptName from Dept);
-- ANTI Join:
SELECT * FROM Employee WHERE DeptName NOT IN (SELECT DeptName from Dept);
Regular Join能够像Batch模式那样满足我们的需求,但是致命缺点是两张表的数据都需要缓存在state中,在unbound数据的情况下,占用的资源会无限增长。
② Interval Join
为了解决Regular Join数据持续增长的问题,Flink在Interval Join中引入了时间窗口的概念,窗口外的数据会被Flink清理,这极大缓解了资源的占用。Interval Join的时间语义既可以是Event Time,也可以是Processing Time;
Flink会依据watermark来进行过期数 据的清理,将空间维持在合适的范围。
Lookup Join 也就是维表join
对Lookup Join的实现主要有Sync和Async这两种,hive/jdbc/hbase/kudu/redis等均支持Sync模式Lookup,仅HBase支持Async模式Lookup。
Sync模式是Lookup Join的默认模式,契合Flink单线程的处理模型,在该模式下,每当接受到新的事实表数据,就会用SQL中Join key去维度表中查询,将查询到的数据聚合后发往下游;
因为所有的操作都在同一个线程内串行执行,某一环节耗时稍长,很容易造成算子吞吐的下降,而耗时长的环节通常是与外部数据库的I/O操作。为了尽量降低频繁的I/O所导致的性能下降,通常会在内存中维护一个缓存,用以加速lookup的速度。
Hive: 处理速度最快,但对大表不太友好,Join前会将数据全部存到堆内存中,很容易导致Full GC频繁,甚至OOM。
Jdbc: 处理速度略慢,但可以兼容大表。该实现会在内存中维护一个Cache,将之前查到的数据缓存起来,但如果缓存命中率不高,每一次Join都会去点查数据库,耗时较长。
hbase: Flink为了降低用户实现各个连接器的难度,简化线程模型,提供了统一的Async I/O功能来支撑Async Lookup Join的实现。Flink通过线程池和CompletableFuture的回调机制,实现了异步非阻塞逻辑,解耦了I/O访问和数据发送,并通过回调机制进行通知,
避免不必要的线程空转。
LookUp Join虽然能满足需求,但是获得的结果却是不精确的,如果事实表处理有延迟,维度表没有,那么计算出来的结果误差是非常大的。针对这样的场景,FlinkSQL推出了
Temporal Join来解决。Temporal Join又有着两种模式,分别为Event-Time和Processing-Time。
Processing-time
该语义代表着事实表每次来Join的都是最新的维度表,其语义和Lookup Join类似,目前写法和Lookup Join一样,功能原理也相同。
Event-time
该语义代表着事实表每次来Join的都是数据对应时间维度表的内容,在该模式下进行Join的两张表,
左表需要是append_only的事实表;
维度表是changlog流所形成的版本表,数据可以来自mysql-binlog,亦或者changelog格式的talos。
需要注意,事实表和维度表都需要指定时间列,并且需要生成watermark,维度表还需要指定一下主键来对不同版本做区分。
两种维表Join模式,分别是 Temporal Table Join 和 Temporal Table Function Join
应用场景是为了补全事实表(probe table)数据的额外字段,通常维度表(build table)里的字段很少发生变化。从数据量上来看,维度表一般数据量较小。
通常维表Join的逻辑是基于Hash Join来实现的,Flink中的维表Join逻辑也不例外。
Hash Join的Join逻辑正常会分为两步,
第一步,将维度表按Key散列,建立哈希表;
第二步,用事实表中的RowKey去哈希表中探测,再输出结果,所以通常又会将事实表叫做probe table,维度表叫做build table。
Temporal Table Join
目前Temporal Table Join仅支持INNER JOIN和LEFT JOIN;
Temporal Table Join的缺点是不能指定Event time作为时间语义,只支持Processing time,那就是说,无论probe table处在哪个时间段,都会和build table中最新的数据进行Join。
实现逻辑主要分为两类。
第一类是全量加载,例如HDFS,Flink会将HDFS文件全量加载进内存,进行Join操作时再去内存的cache匹配;
第二类为部分加载,例如JDBC、Hbase、Redis等,会根据probe table当前Row数据的Key去数据库查询,再辅之以缓存逻辑。
使用该模式进行Join操作时,需要先在probe table中指定Processing time列;
JDBC方式的数据源主要是MySQL,在进行维表join时,会将MySQL的一张表作为Temporal table。在进行join时,会根据join的key去mysql中加载数据,数据并非全量加载,而是只加载RowKey相关的,加载的数据会存在内存中。
进行join操作时,如果能命中cache,则会直接使用cache中的数据,如果没有再去数据库中查询并更新cache。JDBCLookupFunction类中引用了Guava cache来存储维表数据,数据类型为Cache<Row, List<Row>>。
为了避免内存溢出,写入cache中的数据会被设置过期时间,同时cache大小也会得到限制。在cache存满后,采用LRU机制剔除cache中的数据,剔除逻辑交给Guava去执行。
Hbase的加载逻辑和JDBC类似,都是非全量加载。但不同是,Flink在JDBC方式下自行使用cache来保存维表中的数据,而在Hbase方式下,Flink并没有直接管理缓存的维表数据。
因为Hbase Client在进行rpc调用时会将RegionLocations缓存在ConnectionManager中的MetaCache中。Flink在处理时并没有做相关数据的缓存与优化,而是直接依赖了Hbase Client。
Hdfs方式与上面两种都不相同。使用Hdfs中的数据源作Temporal table进行维表join时,会进行数据的全量加载。在Hdfs方式的相关逻辑中,我们定义了一个cahce,类型为Map<Row, List<RowData>>,存放维表中的数据。
与前两种方式不同的是,在这里cache内的数据并不会主动过期,只有当检测到文件有更新之后,Flink才会去清除cache,加载新数据。
设置了一个检测维表文件变化的时间间隔,或者也可以称为cache的过期时间。
在内存中保存文件的最新时间戳,满足条件1后,Flink会通过FileSystem的接口去读取目录下所有文件的修改时间,选择一个最新的,再进行时间戳的判断;
Temporal Table Function Join
通过UDTF来实现probe流和Temporal table的Join
left input(probe table)需要是append-only table,right input(build table)需要有主键和用于版本化的字段(通常是时间字段);
Flink会将左表数据和右表数据按Key分别保存到leftState和rightState中,格式都是MapState<Long, BaseRow>。
左右state不同的是,leftState的主键是一个递增序列,rightState则以时间列作为主键。
在Join逻辑中,首先会遍历左边的状态state,并提取元素中的时间列,用时间列去排好序的rightState中进行binary search,查找rightState.rowTime <= leftState.rowTime的第一条数据,匹配上就发往下游,同时在leftState状态中清除。
从上面的匹配逻辑可以看出,目前 Temporal Table Function Join 仅支持INNER Join。
bin/sql-client.sh embedded -s yarn-session
bin/sql-client.sh embedded -s yarn-session -i conf/sql-client-init.sql
JobManager Web Interface: http://hadoop104:45530
2023-11-12 11:03:45,147 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli [] - The Flink YARN session cluster has been started in detached mode. In order to stop Flink gracefully, use the following command:
$ echo "stop" | ./bin/yarn-session.sh -id application_1699752302611_0002
If this should not be possible, then you can also kill Flink via YARN's web interface or via:
$ yarn application -kill application_1699752302611_0002
change log 可以显示回撤
撤回流(Retract):
新增就是 +I
撤回就是 -D
retract流包含两种类型的message:add messages和retract messages。
通过将INSERT操作编码为add message、将DELETE 操作编码为 retract message、
将UPDATE 操作编码为更新(先前)行的 retract message 和更新(新)行的 add message,
将动态表转换为retract流。
upsert 流:
+U
-U
upsert流包含两种类型的message:upsert messages和delete messages。转换为upsert流的动态表需要(可能是组合的)唯一键。
通过将INSERT和UPDATE操作编码为upsert message,
将DELETE 操作编码为delete message,将具有唯一键的动态表转换为流。
消费流的算子需要知道唯一键的属性,以便正确地应用message。与retract 流的主要区别在于UPDATE操作是用单个message编码的,因此效率更高。
Append 流:
+I
通过INSERT操作修改动态表,可以通过输出插入的行转换为流。
https://www.cnblogs.com/codelives/p/14502962.html
原因:没有将官方指定Pre-bundled Hadoop 2.7.5包放到flink的lib目录下
Apache Flink® 1.9.0 is our latest stable release.
If you plan to use Apache Flink together with Apache Hadoop (run Flink on YARN, connect to HDFS, connect to HBase, or use some Hadoop-based file system connector)
then select the download that bundles the matching Hadoop version, download the optional pre-bundled Hadoop that matches your version and place it in the lib folder of Flink, or export your HADOOP_CLASSPATH
flink中创建表
CREATE TABLE test( id INT, ts BIGINT, vc INT ) WITH ( 'connector' = 'print' ); Flink SQL> create table test2 as select id,ts from test; [ERROR] Could not execute SQL statement. Reason: org.apache.flink.table.api.ValidationException: Connector 'print' can only be used as a sink. It cannot be used as a source. CREATE TABLE test1 ( `value` STRING ) LIKE test; CREATE TABLE source ( id INT, ts BIGINT, vc INT ) WITH ( 'connector' = 'datagen', 'rows-per-second'='1', 'fields.id.kind'='sequence', 'fields.id.start'='1', 'fields.id.end'='100000', 'fields.ts.kind'='sequence', 'fields.ts.start'='1', 'fields.ts.end'='1000000', 'fields.vc.kind'='random', 'fields.vc.min'='1', 'fields.vc.max'='100' ); CREATE TABLE sink ( id INT, ts BIGINT, vc INT ) WITH ( 'connector' = 'print' ); insert into sink select * from source;
WITH source_with_total AS ( SELECT id, vc+10 AS total FROM source ) SELECT id, SUM(total) FROM source_with_total GROUP BY id; select id, price from (values(1,1.2), (2,3.4)) as t(id, price) ; +----+-------------+--------+ | op | id | price | +----+-------------+--------+ | +I | 1 | 1.2 | | +I | 2 | 3.4 | +----+-------------+--------+ Received a total of 2 rows
去重的场景需要注意: ttl的设置时间,比如10s, 这个时候去重,它的状态已经被清除了, 这个时候就可能重复;
SELECT DISTINCT vc FROM source; select count(*) from source; +----+----------------------+ | op | EXPR$0 | +----+----------------------+ | +I | 1 | | -U | 1 | | +U | 2 | | -U | 2 | | +U | 3 | | -U | 3 | | +U | 4 | -U +U, select vc,count(*) from source group by vc; +----+-------------+----------------------+ | op | vc | EXPR$1 | +----+-------------+----------------------+ | +I | 65 | 1 | | +I | 60 | 1 | | +I | 43 | 1 | | +I | 61 | 1 | | +I | 32 | 1 | | +I | 82 | 1 | | +I | 59 | 1 | | +I | 52 | 1 | | +I | 84 | 1 | | -U | 65 | 1 | | +U | 65 | 2 | | +I | 88 | 1 |
每个维度,每分钟的PV数量,它的最大/小价格;
CREATE TABLE source1 ( dim STRING, user_id BIGINT, price BIGINT, row_time AS cast(CURRENT_TIMESTAMP as timestamp(3)), WATERMARK FOR row_time AS row_time - INTERVAL '5' SECOND ) WITH ( 'connector' = 'datagen', 'rows-per-second' = '10', 'fields.dim.length' = '1', 'fields.user_id.min' = '1', 'fields.user_id.max' = '100000', 'fields.price.min' = '1', 'fields.price.max' = '100000' ); CREATE TABLE sink1 ( dim STRING, pv BIGINT, sum_price BIGINT, max_price BIGINT, min_price BIGINT, uv BIGINT, window_start bigint ) WITH ( 'connector' = 'print' ); 每个维度,每分钟的PV数量,它的最大/小价格; select current_timestamp; //年月日时分秒 毫秒 +----+-------------------------+ | op | current_timestamp | +----+-------------------------+ | +I | 2023-11-11 17:22:02.924 | +----+-------------------------+ Received a total of 1 row select UNIX_TIMESTAMP(cast(row_time as String)) from source1; //UNIX_TIMESTAMP 得到的是秒的时间戳; +----+----------------------+ | op | EXPR$0 | +----+----------------------+ | +I | 1699694783 | | +I | 1699694783 | | +I | 1699694783 | | +I | 1699694783 | | +I | 1699694783 | | +I | 1699694783 | | +I | 1699694783 | | +I | 1699694783 | | +I | 1699694783 | | +I | 1699694783 | | +I | 1699694784 | | +I | 1699694784 | | +I | 1699694784 | -- insert into sink1 select dim, count(*) as pv, sum(price) as sum_price, max(price) as max_price, min(price) as min_price, -- 计算 uv 数 count(distinct user_id) as uv, cast((UNIX_TIMESTAMP(CAST(row_time AS STRING))) / 60 as bigint) as window_start from source1 group by dim, -- UNIX_TIMESTAMP得到秒的时间戳,将秒级别时间戳 / 60 转化为 1min, cast((UNIX_TIMESTAMP(CAST(row_time AS STRING))) / 60 as bigint) ; +----+----+----+-----------+-----------+-----------+-----+-------------+ | op |dim | pv | sum_price | max_price | min_price | uv |window_start | +----+----+----+-----------+-----------+-----------+-----+-------------+ | +I | b | 1 | 13012 | 13012 | 13012 | 1 | 28328249 | | +I | 3 | 1 | 72899 | 72899 | 72899 | 1 | 28328249 | | +I | 0 | 1 | 55460 | 55460 | 55460 | 1 | 28328249 | | +I | 5 | 1 | 24504 | 24504 | 24504 | 1 | 28328249 | | +I | e | 1 | 85386 | 85386 | 85386 | 1 | 28328249 | | +I | 9 | 1 | 86145 | 86145 | 86145 | 1 | 28328249 | | -U | 5 | 1 | 24504 | 24504 | 24504 | 1 | 28328249 | | +U | 5 | 2 | 67734 | 43230 | 24504 | 2 | 28328249 | | +I | 4 | 1 | 88061 | 88061 | 88061 | 1 | 28328249 | | +I | f | 1 | 95985 | 95985 | 95985 | 1 | 28328249 | | -U | f | 1 | 95985 | 95985 | 95985 | 1 | 28328249 | | +U | f | 2 | 112573 | 95985 | 16588 | 2 | 28328249 | | -U | 3 | 1 | 72899 | 72899 | 72899 | 1 | 28328249 | | +U | 3 | 2 | 89077 | 72899 | 16178 | 2 | 28328249 | | +I | c | 1 | 83697 | 83697 | 83697 | 1 | 28328249 | | +I | 8 | 1 | 90883 | 90883 | 90883 | 1 | 28328249 | | -U | 9 | 1 | 86145 | 86145 | 86145 | 1 | 28328249 | | +U | 9 | 2 | 120406 | 86145 | 34261 | 2 | 28328249 | | -U | 5 | 2 | 67734 | 43230 | 24504 | 2 | 28328249 | | +U | 5 | 3 | 131080 | 63346 | 24504 | 3 | 28328249 | | -U | e | 1 | 85386 | 85386 | 85386 | 1 | 28328249 | | +U | e | 2 | 165188 | 85386 | 79802 | 2 | 28328249 |
多维分析
SELECT supplier_id, product_id, rating, COUNT(*) FROM ( VALUES ('supplier1', 'product1', 4), ('supplier1', 'product2', 3), ('supplier2', 'product3', 3), ('supplier2', 'product4', 4) ) --自定义source: 供应商id、 产品id、 评级 AS Products(supplier_id, product_id, rating) GROUP BY GROUPING SETS( (supplier_id, product_id, rating), (supplier_id, product_id), (supplier_id, rating), (supplier_id), (product_id, rating), (product_id), (rating), () ); +----+-------------+------------+-------------+--------+ | op | supplier_id | product_id | rating | EXPR$3 | +----+-------------+------------+-------------+--------+ | +I | supplier1 | product1 | 4 | 1 | | +I | supplier1 | product1 | <NULL> | 1 | | +I | supplier1 | <NULL> | 4 | 1 | | +I | supplier1 | <NULL> | <NULL> | 1 | | +I | <NULL> | product1 | 4 | 1 | | +I | <NULL> | product1 | <NULL> | 1 | | +I | <NULL> | <NULL> | 4 | 1 | | +I | <NULL> | <NULL> | <NULL> | 1 | | +I | supplier1 | product2 | 3 | 1 | | +I | supplier1 | product2 | <NULL> | 1 | | +I | supplier1 | <NULL> | 3 | 1 | | -U | supplier1 | <NULL> | <NULL> | 1 | | +U | supplier1 | <NULL> | <NULL> | 2 | | +I | <NULL> | product2 | 3 | 1 | | +I | <NULL> | product2 | <NULL> | 1 | | +I | <NULL> | <NULL> | 3 | 1 | | -U | <NULL> | <NULL> | <NULL> | 1 | | +U | <NULL> | <NULL> | <NULL> | 2 | | +I | supplier2 | product3 | 3 | 1 | | +I | supplier2 | product3 | <NULL> | 1 | | +I | supplier2 | <NULL> | 3 | 1 | | +I | supplier2 | <NULL> | <NULL> | 1 | | +I | <NULL> | product3 | 3 | 1 | | +I | <NULL> | product3 | <NULL> | 1 | | -U | <NULL> | <NULL> | 3 | 1 | | +U | <NULL> | <NULL> | 3 | 2 | | -U | <NULL> | <NULL> | <NULL> | 2 | | +U | <NULL> | <NULL> | <NULL> | 3 | | +I | supplier2 | product4 | 4 | 1 | | +I | supplier2 | product4 | <NULL> | 1 | | +I | supplier2 | <NULL> | 4 | 1 | | -U | supplier2 | <NULL> | <NULL> | 1 | | +U | supplier2 | <NULL> | <NULL> | 2 | | +I | <NULL> | product4 | 4 | 1 | | +I | <NULL> | product4 | <NULL> | 1 | | -U | <NULL> | <NULL> | 4 | 1 | | +U | <NULL> | <NULL> | 4 | 2 | | -U | <NULL> | <NULL> | <NULL> | 3 | | +U | <NULL> | <NULL> | <NULL> | 4 | +----+-------------+--------------------------------+--- Received a total of 39 rows
窗口聚合(滚动|滑动|会话)
1)准备数据 CREATE TABLE ws ( id INT, vc INT, pt AS PROCTIME(), --处理时间 et AS cast(CURRENT_TIMESTAMP as timestamp(3)), --事件时间 WATERMARK FOR et AS et - INTERVAL '5' SECOND --watermark ) WITH ( 'connector' = 'datagen', 'rows-per-second' = '10', 'fields.id.min' = '1', 'fields.id.max' = '3', 'fields.vc.min' = '1', 'fields.vc.max' = '100' ); select * from ws; +----+----+----+-------------------------+-------------------------+ | op | id | vc | pt | et | +----+----+----+-------------------------+-------------------------+ | +I | 2 | 62 | 2023-11-11 17:49:10.695 | 2023-11-11 17:49:10.690 | | +I | 2 | 43 | 2023-11-11 17:49:10.699 | 2023-11-11 17:49:10.699 | | +I | 1 | 15 | 2023-11-11 17:49:10.700 | 2023-11-11 17:49:10.700 | | +I | 1 | 76 | 2023-11-11 17:49:10.700 | 2023-11-11 17:49:10.700 | | +I | 2 | 29 | 2023-11-11 17:49:10.700 | 2023-11-11 17:49:10.700 | | +I | 1 | 85 | 2023-11-11 17:49:10.700 | 2023-11-11 17:49:10.700 | | +I | 3 | 23 | 2023-11-11 17:49:10.700 | 2023-11-11 17:49:10.700 | | +I | 1 | 42 | 2023-11-11 17:49:10.700 | 2023-11-11 17:49:10.700 | | +I | 1 | 34 | 2023-11-11 17:49:10.700 | 2023-11-11 17:49:10.700 | | +I | 3 | 2 | 2023-11-11 17:49:10.700 | 2023-11-11 17:49:10.700 | | +I | 2 | 87 | 2023-11-11 17:49:11.683 | 2023-11-11 17:49:11.683 | | +I | 3 | 86 | 2023-11-11 17:49:11.683 | 2023-11-11 17:49:11.683 | | +I | 3 | 2 | 2023-11-11 17:49:11.683 | 2023-11-11 17:49:11.683 | | +I | 3 | 47 | 2023-11-11 17:49:11.683 | 2023-11-11 17:49:11.683 | | +I | 1 | 89 | 2023-11-11 17:49:11.683 | 2023-11-11 17:49:11.683 | Flink SQL> desc ws; +------+-----------------------------+-------+-----+--------------------------------------------+----------------------------+ | name | type | null | key | extras | watermark | +------+-----------------------------+-------+-----+--------------------------------------------+----------------------------+ | id | INT | TRUE | | | | | vc | INT | TRUE | | | | | pt | TIMESTAMP_LTZ(3) *PROCTIME* | FALSE | | AS PROCTIME() | | | et | TIMESTAMP(3) *ROWTIME* | FALSE | | AS CAST(CURRENT_TIMESTAMP AS TIMESTAMP(3)) | `et` - INTERVAL '5' SECOND | +------+-----------------------------+-------+-----+--------------------------------------------+----------------------------+ 2)滚动窗口示例(时间属性字段,窗口长度) 事件时间 event_time select id, TUMBLE_START(et, INTERVAL '5' SECOND) wstart, TUMBLE_END(et, INTERVAL '5' SECOND) wend, sum(vc) sumVc from ws group by id, TUMBLE(et, INTERVAL '5' SECOND); +----+-------------+-------------------------+-------------------------+-------------+ | op | id | wstart | wend | sumVc | +----+-------------+-------------------------+-------------------------+-------------+ | +I | 2 | 2023-11-11 17:51:50.000 | 2023-11-11 17:51:55.000 | 881 | | +I | 1 | 2023-11-11 17:51:50.000 | 2023-11-11 17:51:55.000 | 660 | | +I | 3 | 2023-11-11 17:51:50.000 | 2023-11-11 17:51:55.000 | 645 | | +I | 2 | 2023-11-11 17:51:55.000 | 2023-11-11 17:52:00.000 | 1082 | | +I | 3 | 2023-11-11 17:51:55.000 | 2023-11-11 17:52:00.000 | 851 | | +I | 1 | 2023-11-11 17:51:55.000 | 2023-11-11 17:52:00.000 | 871 | | +I | 2 | 2023-11-11 17:52:00.000 | 2023-11-11 17:52:05.000 | 886 | | +I | 3 | 2023-11-11 17:52:00.000 | 2023-11-11 17:52:05.000 | 888 | | +I | 1 | 2023-11-11 17:52:00.000 | 2023-11-11 17:52:05.000 | 562 | | +I | 3 | 2023-11-11 17:52:05.000 | 2023-11-11 17:52:10.000 | 901 | | +I | 2 | 2023-11-11 17:52:05.000 | 2023-11-11 17:52:10.000 | 1264 | | +I | 1 | 2023-11-11 17:52:05.000 | 2023-11-11 17:52:10.000 | 652 | | +I | 1 | 2023-11-11 17:52:10.000 | 2023-11-11 17:52:15.000 | 1004 | | +I | 2 | 2023-11-11 17:52:10.000 | 2023-11-11 17:52:15.000 | 488 | | +I | 3 | 2023-11-11 17:52:10.000 | 2023-11-11 17:52:15.000 | 981 | | +I | 2 | 2023-11-11 17:52:15.000 | 2023-11-11 17:52:20.000 | 490 | proc_time select id, TUMBLE_START(pt, INTERVAL '5' SECOND) wstart, TUMBLE_END(pt, INTERVAL '5' SECOND) wend, sum(vc) sumVc from ws group by id, TUMBLE(pt, INTERVAL '5' SECOND); +----+-------------+-------------------------+-------------------------+-------------+ | op | id | wstart | wend | sumVc | +----+-------------+-------------------------+-------------------------+-------------+ | +I | 3 | 2023-11-11 17:54:50.000 | 2023-11-11 17:54:55.000 | 712 | | +I | 2 | 2023-11-11 17:54:50.000 | 2023-11-11 17:54:55.000 | 165 | | +I | 1 | 2023-11-11 17:54:50.000 | 2023-11-11 17:54:55.000 | 616 | | +I | 3 | 2023-11-11 17:54:55.000 | 2023-11-11 17:55:00.000 | 809 | | +I | 2 | 2023-11-11 17:54:55.000 | 2023-11-11 17:55:00.000 | 873 | | +I | 1 | 2023-11-11 17:54:55.000 | 2023-11-11 17:55:00.000 | 551 | | +I | 3 | 2023-11-11 17:55:00.000 | 2023-11-11 17:55:05.000 | 933 | | +I | 1 | 2023-11-11 17:55:00.000 | 2023-11-11 17:55:05.000 | 788 | | +I | 2 | 2023-11-11 17:55:00.000 | 2023-11-11 17:55:05.000 | 888 | | +I | 3 | 2023-11-11 17:55:05.000 | 2023-11-11 17:55:10.000 | 619 | | +I | 1 | 2023-11-11 17:55:05.000 | 2023-11-11 17:55:10.000 | 1500 | | +I | 2 | 2023-11-11 17:55:05.000 | 2023-11-11 17:55:10.000 | 420 | | +I | 3 | 2023-11-11 17:55:10.000 | 2023-11-11 17:55:15.000 | 1110 | | +I | 1 | 2023-11-11 17:55:10.000 | 2023-11-11 17:55:15.000 | 704 | | +I | 2 | 2023-11-11 17:55:10.000 | 2023-11-11 17:55:15.000 | 631 | 3)滑动窗口(时间属性字段,滑动步长,窗口长度) select id, HOP_START(et, INTERVAL '2' SECOND,INTERVAL '5' SECOND) wstart, HOP_END(et, INTERVAL '2' SECOND,INTERVAL '5' SECOND) wend, sum(vc) sumVc from ws group by id, HOP(et, INTERVAL '2' SECOND,INTERVAL '5' SECOND); +----+-------------+-------------------------+-------------------------+-------------+ | op | id | wstart | wend | sumVc | +----+-------------+-------------------------+-------------------------+-------------+ | +I | 1 | 2023-11-11 18:03:26.000 | 2023-11-11 18:03:31.000 | 289 | | +I | 3 | 2023-11-11 18:03:26.000 | 2023-11-11 18:03:31.000 | 192 | | +I | 2 | 2023-11-11 18:03:26.000 | 2023-11-11 18:03:31.000 | 464 | | +I | 1 | 2023-11-11 18:03:28.000 | 2023-11-11 18:03:33.000 | 717 | | +I | 2 | 2023-11-11 18:03:28.000 | 2023-11-11 18:03:33.000 | 552 | | +I | 3 | 2023-11-11 18:03:28.000 | 2023-11-11 18:03:33.000 | 857 | | +I | 3 | 2023-11-11 18:03:30.000 | 2023-11-11 18:03:35.000 | 975 | | +I | 1 | 2023-11-11 18:03:30.000 | 2023-11-11 18:03:35.000 | 1121 | | +I | 2 | 2023-11-11 18:03:30.000 | 2023-11-11 18:03:35.000 | 528 | | +I | 1 | 2023-11-11 18:03:32.000 | 2023-11-11 18:03:37.000 | 891 | | +I | 3 | 2023-11-11 18:03:32.000 | 2023-11-11 18:03:37.000 | 1134 | | +I | 2 | 2023-11-11 18:03:32.000 | 2023-11-11 18:03:37.000 | 276 | | +I | 1 | 2023-11-11 18:03:34.000 | 2023-11-11 18:03:39.000 | 773 | | +I | 2 | 2023-11-11 18:03:34.000 | 2023-11-11 18:03:39.000 | 387 | | +I | 3 | 2023-11-11 18:03:34.000 | 2023-11-11 18:03:39.000 | 1271 | | +I | 3 | 2023-11-11 18:03:36.000 | 2023-11-11 18:03:41.000 | 1301 | | +I | 1 | 2023-11-11 18:03:36.000 | 2023-11-11 18:03:41.000 | 593 | | +I | 2 | 2023-11-11 18:03:36.000 | 2023-11-11 18:03:41.000 | 671 | | +I | 2 | 2023-11-11 18:03:38.000 | 2023-11-11 18:03:43.000 | 730 | | +I | 3 | 2023-11-11 18:03:38.000 | 2023-11-11 18:03:43.000 | 1108 | | +I | 1 | 2023-11-11 18:03:38.000 | 2023-11-11 18:03:43.000 | 696 | 4)会话窗口(时间属性字段,会话间隔) select id, SESSION_START(et, INTERVAL '5' SECOND) wstart, SESSION_END(et, INTERVAL '5' SECOND) wend, sum(vc) sumVc from ws group by id, SESSION(et, INTERVAL '5' SECOND); +----+-------------+-------------------------+-------------------------+-------------+ | op | id | wstart | wend | sumVc | +----+-------------+-------------------------+-------------------------+-------------+
窗口表值函数(TVF) 聚合
滚动窗口 SELECT id, window_start, window_end, SUM(vc) as sumVC FROM TABLE( TUMBLE(TABLE ws, DESCRIPTOR(et), INTERVAL '5' SECONDS)) GROUP BY window_start, window_end, id; +----+-------------+-------------------------+-------------------------+-------------+ | op | id | window_start | window_end | sumVC | +----+-------------+-------------------------+-------------------------+-------------+ | +I | 3 | 2023-11-11 18:08:50.000 | 2023-11-11 18:08:55.000 | 767 | | +I | 1 | 2023-11-11 18:08:50.000 | 2023-11-11 18:08:55.000 | 728 | | +I | 2 | 2023-11-11 18:08:50.000 | 2023-11-11 18:08:55.000 | 852 | | +I | 3 | 2023-11-11 18:08:55.000 | 2023-11-11 18:09:00.000 | 681 | | +I | 1 | 2023-11-11 18:08:55.000 | 2023-11-11 18:09:00.000 | 738 | | +I | 2 | 2023-11-11 18:08:55.000 | 2023-11-11 18:09:00.000 | 959 | | +I | 3 | 2023-11-11 18:09:00.000 | 2023-11-11 18:09:05.000 | 850 | | +I | 1 | 2023-11-11 18:09:00.000 | 2023-11-11 18:09:05.000 | 864 | | +I | 2 | 2023-11-11 18:09:00.000 | 2023-11-11 18:09:05.000 | 908 | 2) 滑动窗口 要求: 窗口长度=滑动步长的整数倍 (底层会优化成多个小滚动窗口) 滑动步长需要是滑动窗口大小的整数倍,因为它底层会对滑动窗口做一个优化, 比如一条数据在2个窗口里, 这2个窗口都有它的状态,就有2份状态了,第一个窗口输出的时候它会输出,第二个窗口输出时它也会输出; 举例一个极端例子: 假如步长为1ms, 窗口大小为1h, 一条数据最多1h处理1ms, 步长及其的小,会被n个窗口重叠; 成整数倍就可以解决这个问题; 如果步长2s, 窗口长度3s, 会按步长一个个切分为小滚动窗口, 这样子每个数据只会属于一个小的滚动窗口,只会被存一个 计算一次; 长度需要是步长的整数倍; SELECT id, window_start, window_end, SUM(vc) sumVC FROM TABLE( HOP(TABLE ws, DESCRIPTOR(et), INTERVAL '2' SECONDS , INTERVAL '6' SECONDS)) GROUP BY window_start, window_end, id; +----+-------------+-------------------------+-------------------------+-------------+ | op | id | window_start | window_end | sumVC | +----+-------------+-------------------------+-------------------------+-------------+ | +I | 3 | 2023-11-11 18:24:44.000 | 2023-11-11 18:24:50.000 | 50 | | +I | 2 | 2023-11-11 18:24:44.000 | 2023-11-11 18:24:50.000 | 291 | | +I | 1 | 2023-11-11 18:24:44.000 | 2023-11-11 18:24:50.000 | 168 | | +I | 2 | 2023-11-11 18:24:46.000 | 2023-11-11 18:24:52.000 | 531 | | +I | 1 | 2023-11-11 18:24:46.000 | 2023-11-11 18:24:52.000 | 630 | | +I | 3 | 2023-11-11 18:24:46.000 | 2023-11-11 18:24:52.000 | 503 | | +I | 2 | 2023-11-11 18:24:48.000 | 2023-11-11 18:24:54.000 | 823 | | +I | 3 | 2023-11-11 18:24:48.000 | 2023-11-11 18:24:54.000 | 1126 | | +I | 1 | 2023-11-11 18:24:48.000 | 2023-11-11 18:24:54.000 | 860 | | +I | 3 | 2023-11-11 18:24:50.000 | 2023-11-11 18:24:56.000 | 1259 | | +I | 1 | 2023-11-11 18:24:50.000 | 2023-11-11 18:24:56.000 | 1121 | | +I | 2 | 2023-11-11 18:24:50.000 | 2023-11-11 18:24:56.000 | 888 | | +I | 1 | 2023-11-11 18:24:52.000 | 2023-11-11 18:24:58.000 | 857 | | +I | 3 | 2023-11-11 18:24:52.000 | 2023-11-11 18:24:58.000 | 1267 | | +I | 2 | 2023-11-11 18:24:52.000 | 2023-11-11 18:24:58.000 | 854 | | +I | 1 | 2023-11-11 18:24:54.000 | 2023-11-11 18:25:00.000 | 754 | | +I | 3 | 2023-11-11 18:24:54.000 | 2023-11-11 18:25:00.000 | 942 | | +I | 2 | 2023-11-11 18:24:54.000 | 2023-11-11 18:25:00.000 | 1084 | | +I | 3 | 2023-11-11 18:24:56.000 | 2023-11-11 18:25:02.000 | 1073 | | +I | 2 | 2023-11-11 18:24:56.000 | 2023-11-11 18:25:02.000 | 1192 | | +I | 1 | 2023-11-11 18:24:56.000 | 2023-11-11 18:25:02.000 | 666 | 3)累积窗口 累积窗口会在一定的统计周期内进行累积计算。累积窗口中有两个核心的参数:最大窗口长度(max window size)和累积步长(step)。所谓的最大窗口长度其实就是我们所说的“统计周期”,最终目的就是统计这段时间内的数据。 其实就是固定窗口间隔内提前触发的的滚动窗口 ,其实就是 Tumble Window + early-fire 的一个事件时间的版本。例如,从每日零点到当前这一分钟绘制累积 UV,其中 10:00 时的 UV 表示从 00:00 到 10:00 的 UV 总数。 累积窗口可以认为是首先开一个最大窗口大小的滚动窗口,然后根据用户设置的触发的时间间隔将这个滚动窗口拆分为多个窗口,这些窗口具有相同的窗口起点和不同的窗口终点。 注意: 窗口最大长度 = 累积步长的整数倍 统计当天(0-24h)实时累计的PV浏览量,每分钟更新一次, 窗口大小也必须是累计步长的整数倍; SELECT id, window_start, window_end, SUM(vc) sumVC FROM TABLE( CUMULATE(TABLE ws, DESCRIPTOR(et), INTERVAL '2' SECOND , INTERVAL '6' SECOND)) GROUP BY window_start, window_end, id; 起始时间都一样,都相差1个步长; 窗口越来越大 +----+-------------+-------------------------+-------------------------+-------------+ | op | id | window_start | window_end | sumVC | +----+-------------+-------------------------+-------------------------+-------------+ | +I | 2 | 2023-11-11 18:34:48.000 | 2023-11-11 18:34:52.000 | 55 | | +I | 1 | 2023-11-11 18:34:48.000 | 2023-11-11 18:34:52.000 | 237 | | +I | 3 | 2023-11-11 18:34:48.000 | 2023-11-11 18:34:52.000 | 322 | | +I | 1 | 2023-11-11 18:34:48.000 | 2023-11-11 18:34:54.000 | 573 | | +I | 2 | 2023-11-11 18:34:48.000 | 2023-11-11 18:34:54.000 | 329 | | +I | 3 | 2023-11-11 18:34:48.000 | 2023-11-11 18:34:54.000 | 666 | | +I | 1 | 2023-11-11 18:34:54.000 | 2023-11-11 18:34:56.000 | 385 | | +I | 2 | 2023-11-11 18:34:54.000 | 2023-11-11 18:34:56.000 | 345 | | +I | 3 | 2023-11-11 18:34:54.000 | 2023-11-11 18:34:56.000 | 272 | | +I | 3 | 2023-11-11 18:34:54.000 | 2023-11-11 18:34:58.000 | 679 | | +I | 1 | 2023-11-11 18:34:54.000 | 2023-11-11 18:34:58.000 | 689 | | +I | 2 | 2023-11-11 18:34:54.000 | 2023-11-11 18:34:58.000 | 541 | | +I | 2 | 2023-11-11 18:34:54.000 | 2023-11-11 18:35:00.000 | 989 | | +I | 3 | 2023-11-11 18:34:54.000 | 2023-11-11 18:35:00.000 | 882 | | +I | 1 | 2023-11-11 18:34:54.000 | 2023-11-11 18:35:00.000 | 1051 | | +I | 2 | 2023-11-11 18:35:00.000 | 2023-11-11 18:35:02.000 | 263 | | +I | 3 | 2023-11-11 18:35:00.000 | 2023-11-11 18:35:02.000 | 251 | | +I | 1 | 2023-11-11 18:35:00.000 | 2023-11-11 18:35:02.000 | 426 | | +I | 3 | 2023-11-11 18:35:00.000 | 2023-11-11 18:35:04.000 | 365 | | +I | 2 | 2023-11-11 18:35:00.000 | 2023-11-11 18:35:04.000 | 717 | | +I | 1 | 2023-11-11 18:35:00.000 | 2023-11-11 18:35:04.000 | 802 | | +I | 3 | 2023-11-11 18:35:00.000 | 2023-11-11 18:35:06.000 | 745 | | +I | 1 | 2023-11-11 18:35:00.000 | 2023-11-11 18:35:06.000 | 1365 | | +I | 2 | 2023-11-11 18:35:00.000 | 2023-11-11 18:35:06.000 | 955 | 4)grouping sets多维分析 SELECT id, window_start, window_end, SUM(vc) sumVC FROM TABLE( TUMBLE(TABLE ws, DESCRIPTOR(et), INTERVAL '5' SECONDS)) GROUP BY window_start, window_end, rollup( (id) ) -- cube( (id) ) -- grouping sets( (id),() ) ; +----+-------------+-------------------------+-------------------------+-------------+ | op | id | window_start | window_end | sumVC | +----+-------------+-------------------------+-------------------------+-------------+ | +I | 2 | 2023-11-11 18:39:15.000 | 2023-11-11 18:39:20.000 | 282 | | +I | <NULL> | 2023-11-11 18:39:15.000 | 2023-11-11 18:39:20.000 | 1224 | | +I | 1 | 2023-11-11 18:39:15.000 | 2023-11-11 18:39:20.000 | 442 | | +I | 3 | 2023-11-11 18:39:15.000 | 2023-11-11 18:39:20.000 | 500 | | +I | 1 | 2023-11-11 18:39:20.000 | 2023-11-11 18:39:25.000 | 912 | | +I | 2 | 2023-11-11 18:39:20.000 | 2023-11-11 18:39:25.000 | 886 | | +I | 3 | 2023-11-11 18:39:20.000 | 2023-11-11 18:39:25.000 | 835 | | +I | <NULL> | 2023-11-11 18:39:20.000 | 2023-11-11 18:39:25.000 | 2633 | | +I | 1 | 2023-11-11 18:39:25.000 | 2023-11-11 18:39:30.000 | 603 | | +I | <NULL> | 2023-11-11 18:39:25.000 | 2023-11-11 18:39:30.000 | 2546 | | +I | 2 | 2023-11-11 18:39:25.000 | 2023-11-11 18:39:30.000 | 1063 | | +I | 3 | 2023-11-11 18:39:25.000 | 2023-11-11 18:39:30.000 | 880 | | +I | 2 | 2023-11-11 18:39:30.000 | 2023-11-11 18:39:35.000 | 876 | | +I | 1 | 2023-11-11 18:39:30.000 | 2023-11-11 18:39:35.000 | 896 | | +I | 3 | 2023-11-11 18:39:30.000 | 2023-11-11 18:39:35.000 | 807 | | +I | <NULL> | 2023-11-11 18:39:30.000 | 2023-11-11 18:39:35.000 | 2579 |
Over聚合
1)按照时间区间聚合 统计每个传感器前10秒到现在收到的水位数据条数。 SELECT id, et, vc, count(vc) OVER (PARTITION BY id ORDER BY et RANGE BETWEEN INTERVAL '10' SECOND PRECEDING AND CURRENT ROW) AS cnt FROM ws +----+-------------+-------------------------+-------------+----------------------+ | op | id | et | vc | cnt | +----+-------------+-------------------------+-------------+----------------------+ | +I | 1 | 2023-11-11 18:51:23.783 | 13 | 1 | | +I | 1 | 2023-11-11 18:51:23.797 | 76 | 5 | | +I | 1 | 2023-11-11 18:51:23.797 | 44 | 5 | | +I | 1 | 2023-11-11 18:51:23.797 | 77 | 5 | | +I | 1 | 2023-11-11 18:51:23.797 | 54 | 5 | | +I | 3 | 2023-11-11 18:51:23.797 | 15 | 3 | | +I | 3 | 2023-11-11 18:51:23.797 | 17 | 3 | | +I | 3 | 2023-11-11 18:51:23.797 | 3 | 3 | | +I | 2 | 2023-11-11 18:51:23.797 | 66 | 2 | | +I | 2 | 2023-11-11 18:51:23.797 | 48 | 2 | | +I | 1 | 2023-11-11 18:51:24.777 | 31 | 9 | | +I | 1 | 2023-11-11 18:51:24.777 | 7 | 9 | | +I | 1 | 2023-11-11 18:51:24.777 | 56 | 9 | | +I | 1 | 2023-11-11 18:51:24.777 | 81 | 9 | | +I | 2 | 2023-11-11 18:51:24.777 | 63 | 5 | | +I | 2 | 2023-11-11 18:51:24.777 | 4 | 5 | | +I | 2 | 2023-11-11 18:51:24.777 | 19 | 5 | | +I | 3 | 2023-11-11 18:51:24.777 | 16 | 6 | | +I | 3 | 2023-11-11 18:51:24.777 | 41 | 6 | | +I | 3 | 2023-11-11 18:51:24.777 | 90 | 6 | | +I | 3 | 2023-11-11 18:51:25.776 | 36 | 7 | | +I | 1 | 2023-11-11 18:51:25.777 | 87 | 13 | | +I | 1 | 2023-11-11 18:51:25.777 | 81 | 13 | | +I | 1 | 2023-11-11 18:51:25.777 | 83 | 13 | | +I | 1 | 2023-11-11 18:51:25.777 | 40 | 13 | | +I | 2 | 2023-11-11 18:51:25.777 | 18 | 9 | | +I | 2 | 2023-11-11 18:51:25.777 | 34 | 9 | | +I | 2 | 2023-11-11 18:51:25.777 | 6 | 9 | 也可以用WINDOW子句来在SELECT外部单独定义一个OVER窗口,可以多次使用: SELECT id, et, vc, count(vc) OVER w AS cnt, sum(vc) OVER w AS sumVC FROM ws WINDOW w AS ( PARTITION BY id ORDER BY et RANGE BETWEEN INTERVAL '10' SECOND PRECEDING AND CURRENT ROW ) (2)按照行数聚合 统计每个传感器前5条到现在数据的平均水位 SELECT id, et, vc, avg(vc) OVER (PARTITION BY id ORDER BY et ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS avgVC FROM ws +----+-------------+-------------------------+-------------+-------------+ | op | id | et | vc | avgVC | +----+-------------+-------------------------+-------------+-------------+ | +I | 2 | 2023-11-11 18:54:19.346 | 14 | 14 | | +I | 3 | 2023-11-11 18:54:19.361 | 28 | 28 | | +I | 3 | 2023-11-11 18:54:19.361 | 13 | 20 | | +I | 3 | 2023-11-11 18:54:19.361 | 96 | 45 | | +I | 3 | 2023-11-11 18:54:19.361 | 22 | 39 | | +I | 3 | 2023-11-11 18:54:19.361 | 86 | 49 | | +I | 2 | 2023-11-11 18:54:19.361 | 83 | 48 | | +I | 1 | 2023-11-11 18:54:19.361 | 92 | 92 | | +I | 1 | 2023-11-11 18:54:19.361 | 99 | 95 | | +I | 1 | 2023-11-11 18:54:19.361 | 48 | 79 | | +I | 2 | 2023-11-11 18:54:20.339 | 51 | 49 | | +I | 2 | 2023-11-11 18:54:20.339 | 82 | 57 | | +I | 2 | 2023-11-11 18:54:20.339 | 64 | 58 | | +I | 2 | 2023-11-11 18:54:20.339 | 66 | 60 | | +I | 2 | 2023-11-11 18:54:20.339 | 96 | 73 | | +I | 2 | 2023-11-11 18:54:20.339 | 30 | 64 | | +I | 1 | 2023-11-11 18:54:20.339 | 38 | 69 | | +I | 1 | 2023-11-11 18:54:20.339 | 17 | 58 | | +I | 1 | 2023-11-11 18:54:20.339 | 24 | 53 | | +I | 3 | 2023-11-11 18:54:20.339 | 8 | 42 | | +I | 1 | 2023-11-11 18:54:21.338 | 1 | 37 | | +I | 1 | 2023-11-11 18:54:21.338 | 58 | 31 | | +I | 1 | 2023-11-11 18:54:21.338 | 20 | 26 | | +I | 1 | 2023-11-11 18:54:21.338 | 84 | 34 | | +I | 1 | 2023-11-11 18:54:21.338 | 42 | 38 | | +I | 2 | 2023-11-11 18:54:21.338 | 9 | 57 | | +I | 3 | 2023-11-11 18:54:21.338 | 98 | 53 | | +I | 3 | 2023-11-11 18:54:21.338 | 55 | 60 | 也可以用WINDOW子句来在SELECT外部单独定义一个OVER窗口: SELECT id, et, vc, avg(vc) OVER w AS avgVC, count(vc) OVER w AS cnt FROM ws WINDOW w AS ( PARTITION BY id ORDER BY et ROWS BETWEEN 5 PRECEDING AND CURRENT ROW ) TOP-N 目前在Flink SQL中没有能够直接调用的TOP-N函数,而是提供了稍微复杂些的变通实现方法,是固定写法,特殊支持的over用法。 2)案例 取每个传感器最高的3个水位值 select id, et, vc, rownum from ( select id, et, vc, row_number() over(partition by id order by vc desc) as rownum from ws ) where rownum<=3; +----+-------------+-------------------------+-------------+----------------------+ | op | id | et | vc | rownum | +----+-------------+-------------------------+-------------+----------------------+ | +I | 2 | 2023-11-11 18:58:21.414 | 90 | 1 | | +I | 1 | 2023-11-11 18:58:21.429 | 62 | 1 | | +I | 3 | 2023-11-11 18:58:21.429 | 78 | 1 | | -U | 2 | 2023-11-11 18:58:21.414 | 90 | 1 | | +U | 2 | 2023-11-11 18:58:21.429 | 99 | 1 | | +I | 2 | 2023-11-11 18:58:21.414 | 90 | 2 | | +I | 2 | 2023-11-11 18:58:21.429 | 59 | 3 | | -U | 3 | 2023-11-11 18:58:21.429 | 78 | 1 | | +U | 3 | 2023-11-11 18:58:21.429 | 80 | 1 | | +I | 3 | 2023-11-11 18:58:21.429 | 78 | 2 | | +I | 1 | 2023-11-11 18:58:21.429 | 42 | 2 | | +I | 3 | 2023-11-11 18:58:21.429 | 76 | 3 | | -U | 3 | 2023-11-11 18:58:21.429 | 80 | 1 | | +U | 3 | 2023-11-11 18:58:22.403 | 88 | 1 | | -U | 3 | 2023-11-11 18:58:21.429 | 78 | 2 | | +U | 3 | 2023-11-11 18:58:21.429 | 80 | 2 | | -U | 3 | 2023-11-11 18:58:21.429 | 76 | 3 | | +U | 3 | 2023-11-11 18:58:21.429 | 78 | 3 | | -U | 1 | 2023-11-11 18:58:21.429 | 42 | 2 | | +U | 1 | 2023-11-11 18:58:22.403 | 54 | 2 | | +I | 1 | 2023-11-11 18:58:21.429 | 42 | 3 | | -U | 1 | 2023-11-11 18:58:21.429 | 62 | 1 | | +U | 1 | 2023-11-11 18:58:22.403 | 100 | 1 | | -U | 1 | 2023-11-11 18:58:22.403 | 54 | 2 | | +U | 1 | 2023-11-11 18:58:21.429 | 62 | 2 | | -U | 1 | 2023-11-11 18:58:21.429 | 42 | 3 | | +U | 1 | 2023-11-11 18:58:22.403 | 54 | 3 | | -U | 2 | 2023-11-11 18:58:21.429 | 59 | 3 | | +U | 2 | 2023-11-11 18:58:22.403 | 61 | 3 | | -U | 1 | 2023-11-11 18:58:21.429 | 62 | 2 | | +U | 1 | 2023-11-11 18:58:22.404 | 97 | 2 | | -U | 1 | 2023-11-11 18:58:22.403 | 54 | 3 | | +U | 1 | 2023-11-11 18:58:21.429 | 62 | 3 | | -U | 3 | 2023-11-11 18:58:21.429 | 78 | 3 | | +U | 3 | 2023-11-11 18:58:23.404 | 80 | 3 | 去重: select id, et, vc, rownum from ( select id, et, vc, row_number() over(partition by id,vc order by et) as rownum from ws ) where rownum=1;
join操作
WATERMARK FOR et AS et - INTERVAL '0.001' SECOND --watermark
0.001 允许有事件时间相同的数据,watermark=t表示 <=t的时间都到齐了, 如果不减0.001 s ,
假设t=5, <=5s的数据都齐了, 因为同1s可能会生成多条数据, 来的数据里边可能有数据的时间是相同的,
5、5都到齐了,马上又来一个 5的数据, 而且它还不算迟到数据, 如果减去1ms, 即 <= 4999ms的数据都到齐了;
如果不减1ms那么数据就不能有时间相同的;
CREATE TABLE ws1 ( id INT, vc INT, pt AS PROCTIME(), --处理时间 et AS cast(CURRENT_TIMESTAMP as timestamp(3)), --事件时间 WATERMARK FOR et AS et - INTERVAL '0.001' SECOND --watermark ) WITH ( 'connector' = 'datagen', 'rows-per-second' = '1', 'fields.id.min' = '3', 'fields.id.max' = '5', 'fields.vc.min' = '1', 'fields.vc.max' = '100' ); select ws.id,ws1.id from ws inner join ws1 on ws.id = ws1.id; +----+-------------+-------------+ | op | id | id0 | +----+-------------+-------------+ | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | select ws.id,ws1.id from ws left join ws1 on ws.id = ws1.id; +----+-------------+-------------+ | op | id | id0 | +----+-------------+-------------+ | +I | 2 | <NULL> | | +I | 2 | <NULL> | ... | -D | 3 | <NULL> | | -D | 3 | <NULL> | | -D | 3 | <NULL> | | -D | 3 | <NULL> | | -D | 3 | <NULL> | | -D | 3 | <NULL> | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 3 | 3 | | +I | 1 | <NULL> | | +I | 1 | <NULL> | | +I | 3 | 3 | select ws.id, ws.vc, ws1.id, ws.vc from ws left join ws1 on ws.vc = ws1.vc; +----+-------------+-------------+-------------+-------------+ | op | id | vc | id0 | vc0 | +----+-------------+-------------+-------------+-------------+ | +I | 1 | 98 | <NULL> | 98 | | +I | 3 | 66 | <NULL> | 66 | ... | +I | 1 | 5 | <NULL> | 5 | | -D | 1 | 98 | <NULL> | 98 | | +I | 1 | 98 | 4 | 98 | | +I | 2 | 18 | <NULL> | 18 | | +I | 3 | 99 | <NULL> | 99 | | +I | 2 | 92 | <NULL> | 92 | | +I | 3 | 34 | <NULL> | 34 | | +I | 1 | 23 | <NULL> | 23 | | +I | 1 | 6 | <NULL> | 6 | | +I | 1 | 76 | <NULL> | 76 | | +I | 1 | 84 | <NULL> | 84 | | +I | 3 | 96 | 5 | 96 | | +I | 1 | 45 | <NULL> | 45 | | +I | 1 | 83 | <NULL> | 83 | | +I | 1 | 93 | <NULL> | 93 | select ws.id, ws.vc, ws1.id, ws.vc from ws right join ws1 on ws.vc = ws1.vc; +----+-------------+-------------+-------------+-------------+ | op | id | vc | id0 | vc0 | +----+-------------+-------------+-------------+-------------+ | +I | <NULL> | <NULL> | 3 | <NULL> | | -D | <NULL> | <NULL> | 3 | <NULL> | | +I | 2 | 94 | 3 | 94 | | +I | <NULL> | <NULL> | 4 | <NULL> | | +I | 1 | 94 | 3 | 94 | | +I | 3 | 48 | 5 | 48 | | +I | 3 | 94 | 3 | 94 | | +I | 3 | 48 | 5 | 48 | | +I | <NULL> | <NULL> | 5 | <NULL> | | +I | <NULL> | <NULL> | 5 | <NULL> | | +I | 1 | 76 | 3 | 76 | | -D | <NULL> | <NULL> | 4 | <NULL> | | +I | 1 | 51 | 4 | 51 | | -D | <NULL> | <NULL> | 5 | <NULL> | | +I | 3 | 21 | 5 | 21 | | +I | <NULL> | <NULL> | 5 | <NULL> | select ws.id, ws.vc, ws1.id, ws.vc from ws full join ws1 on ws.vc = ws1.vc; +----+-------------+-------------+-------------+-------------+ | op | id | vc | id0 | vc0 | +----+-------------+-------------+-------------+-------------+ | +I | <NULL> | <NULL> | 3 | <NULL> | | +I | 2 | 20 | <NULL> | 20 | | +I | 1 | 69 | <NULL> | 69 | | +I | 1 | 29 | <NULL> | 29 | | +I | 1 | 38 | <NULL> | 38 | | +I | <NULL> | <NULL> | 3 | <NULL> | | +I | 3 | 88 | <NULL> | 88 | | +I | 1 | 23 | <NULL> | 23 | | +I | 2 | 55 | <NULL> | 55 | | +I | 2 | 43 | <NULL> | 43 | | +I | <NULL> | <NULL> | 3 | <NULL> | | +I | 3 | 14 | <NULL> | 14 | | +I | 3 | 32 | <NULL> | 32 | | +I | 1 | 29 | <NULL> | 29 | | +I | 2 | 6 | <NULL> | 6 | | +I | 2 | 41 | <NULL> | 41 | | -D | <NULL> | <NULL> | 3 | <NULL> | | +I | 3 | 46 | 3 | 46 | | +I | 1 | 82 | <NULL> | 82 | | -D | <NULL> | <NULL> | 3 | <NULL> | | +I | 1 | 86 | 3 | 86 | | +I | 1 | 23 | <NULL> | 23 | | +I | 1 | 70 | <NULL> | 70 | | +I | <NULL> | <NULL> | 4 | <NULL> | | +I | 2 | 78 | <NULL> | 78 | | +I | 3 | 51 | <NULL> | 51 | 间隔联结查询: SELECT * FROM ws,ws1 WHERE ws.id = ws1. id AND ws.et BETWEEN ws1.et - INTERVAL '2' SECOND AND ws1.et + INTERVAL '2' SECOND ; +----+----+-----+-------------------------+-------------------------+-----+------+-------------------------+-------------------------+ | op | id | vc | pt | et | id0 | vc0 | pt0 | et0 | +----+----+-----+-------------------------+-------------------------+-----+------+-------------------------+-------------------------+ | +I | 3 | 57 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:44.163 | 3 | 37 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:44.164 | | +I | 3 | 24 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:44.163 | 3 | 37 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:44.164 | | +I | 3 | 93 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:42.164 | 3 | 37 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:44.164 | | +I | 3 | 37 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:42.164 | 3 | 37 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:44.164 | | +I | 3 | 41 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:42.165 | 3 | 37 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:44.164 | | +I | 3 | 82 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:43.163 | 3 | 37 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:44.164 | | +I | 3 | 81 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:43.163 | 3 | 37 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:44.164 | | +I | 3 | 36 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:43.163 | 3 | 37 | 2023-11-11 19:28:44.243 | 2023-11-11 19:28:44.164 | | +I | 3 | 9 | 2023-11-11 19:28:45.245 | 2023-11-11 19:28:45.164 | 3 | 37 | 2023-11-11 19:28:45.245 | 2023-11-11 19:28:44.164 | | +I | 3 | 7 | 2023-11-11 19:28:45.246 | 2023-11-11 19:28:45.164 | 3 | 37 | 2023-11-11 19:28:45.246 | 2023-11-11 19:28:44.164 | | +I | 3 | 65 | 2023-11-11 19:28:46.249 | 2023-11-11 19:28:46.164 | 3 | 37 | 2023-11-11 19:28:46.249 | 2023-11-11 19:28:44.164 | | +I | 3 | 49 | 2023-11-11 19:28:50.164 | 2023-11-11 19:28:48.163 | 3 | 26 | 2023-11-11 19:28:50.164 | 2023-11-11 19:28:50.163 | | +I | 3 | 61 | 2023-11-11 19:28:50.164 | 2023-11-11 19:28:49.163 | 3 | 26 | 2023-11-11 19:28:50.164 | 2023-11-11 19:28:50.163 | | +I | 3 | 49 | 2023-11-11 19:28:50.164 | 2023-11-11 19:28:49.163 | 3 | 26 | 2023-11-11 19:28:50.164 | 2023-11-11 19:28:50.163 | | +I | 3 | 28 | 2023-11-11 19:28:50.164 | 2023-11-11 19:28:49.164 | 3 | 26 | 2023-11-11 19:28:50.164 | 2023-11-11 19:28:50.163 | | +I | 3 | 89 | 2023-11-11 19:28:50.264 | 2023-11-11 19:28:50.163 | 3 | 26 | 2023-11-11 19:28:50.264 | 2023-11-11 19:28:50.163 | | +I | 3 | 2 | 2023-11-11 19:28:50.264 | 2023-11-11 19:28:50.164 | 3 | 26 | 2023-11-11 19:28:50.264 | 2023-11-11 19:28:50.163 | | +I | 3 | 58 | 2023-11-11 19:28:51.168 | 2023-11-11 19:28:51.163 | 3 | 26 | 2023-11-11 19:28:51.168 | 2023-11-11 19:28:50.163 | | +I | 3 | 64 | 2023-11-11 19:28:51.168 | 2023-11-11 19:28:51.163 | 3 | 26 | 2023-11-11 19:28:51.168 | 2023-11-11 19:28:50.163 | | +I | 3 | 87 | 2023-11-11 19:28:51.168 | 2023-11-11 19:28:51.164 | 3 | 26 | 2023-11-11 19:28:51.168 | 2023-11-11 19:28:50.163 | | +I | 3 | 92 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:52.163 | 3 | 26 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:50.163 | | +I | 3 | 44 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:52.163 | 3 | 26 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:50.163 | | +I | 3 | 89 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:50.163 | 3 | 60 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:52.163 | | +I | 3 | 2 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:50.164 | 3 | 60 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:52.163 | | +I | 3 | 58 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:51.163 | 3 | 60 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:52.163 | | +I | 3 | 64 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:51.163 | 3 | 60 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:52.163 | | +I | 3 | 87 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:51.164 | 3 | 60 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:52.163 | | +I | 3 | 92 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:52.163 | 3 | 60 | 2023-11-11 19:28:52.171 | 2023-11-11 19:28:52.163 |
维表联结查询
Lookup Join 其实就是维表 Join, 实时获取外部缓存的 Join, Lookup 的意思就是实时查找。
上面说的这几种 Join 都是流与流之间的 Join,而 Lookup Join 是流与 Redis,Mysql,HBase 这种外部存储介质的 Join。仅支持处理时间字段。
为什么只支持处理时间呢,因为它要拿的是它查询拿一刻的, 什么时候查它长什么样拿回来即可;
如果这个时候维度表表更了,之前查询拿到的name值是不会改变的;
表A JOIN 维度表名 FOR SYSTEM_TIME AS OF 表A.proc_time AS 别名 ON xx.字段=别名.字段 比如维表在mysql,维表join的写法如下: CREATE TABLE Customers ( id INT, name STRING, country STRING, zip STRING ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://hadoop102:3306/customerdb', 'table-name' = 'customers' ); -- order 表每来一条数据, 都会去 mysql的 customers 表查找维度数据 SELECT o.order_id, o.total, c.country, c.zip FROM Orders AS o JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c ON o.customer_id = c.id;
Order by 和 limit
order by 支持 Batch\Streaming,但在实时任务中一般用的非常少。
实时任务中,Order By 子句中必须要有时间属性字段,并且必须写在最前面且为升序。
SELECT * FROM ws ORDER BY et, id desc
et 相同的前提下,id 是降序排列的;
2)limit
SELECT * FROM ws LIMIT 3
SQL Hints
在执行查询时,可以在表名后面添加SQL Hints来临时修改表属性,对当前job生效。
select * from ws1/*+ OPTIONS('rows-per-second'='10')*/;
+----+-------------+-------------+-------------------------+-------------------------+ | op | id | vc | pt | et | +----+-------------+-------------+-------------------------+-------------------------+ | +I | 3 | 57 | 2023-11-11 19:49:50.518 | 2023-11-11 19:49:50.511 | | +I | 5 | 26 | 2023-11-11 19:49:50.521 | 2023-11-11 19:49:50.521 | | +I | 3 | 76 | 2023-11-11 19:49:50.521 | 2023-11-11 19:49:50.521 | | +I | 4 | 61 | 2023-11-11 19:49:50.522 | 2023-11-11 19:49:50.522 | | +I | 5 | 74 | 2023-11-11 19:49:50.522 | 2023-11-11 19:49:50.522 | | +I | 3 | 40 | 2023-11-11 19:49:50.522 | 2023-11-11 19:49:50.522 | | +I | 3 | 81 | 2023-11-11 19:49:50.522 | 2023-11-11 19:49:50.522 | | +I | 4 | 74 | 2023-11-11 19:49:50.522 | 2023-11-11 19:49:50.522 | | +I | 3 | 13 | 2023-11-11 19:49:50.522 | 2023-11-11 19:49:50.522 | | +I | 3 | 93 | 2023-11-11 19:49:50.522 | 2023-11-11 19:49:50.522 | | +I | 3 | 70 | 2023-11-11 19:49:51.499 | 2023-11-11 19:49:51.499 | | +I | 5 | 20 | 2023-11-11 19:49:51.499 | 2023-11-11 19:49:51.499 |
ws 是1,2,3
ws1 是3,4,5
集合操作
1)UNION 和 UNION ALL UNION:将集合合并并且去重 UNION ALL:将集合合并,不做去重。 (SELECT id FROM ws) UNION (SELECT id FROM ws1); +----+-------------+ | op | id | +----+-------------+ | +I | 3 | | +I | 2 | | +I | 1 | | +I | 4 | | +I | 5 | (SELECT id FROM ws) UNION ALL (SELECT id FROM ws1); +----+-------------+ | op | id | +----+-------------+ | +I | 1 | | +I | 1 | | +I | 2 | | +I | 1 | | +I | 2 | | +I | 2 | | +I | 2 | | +I | 1 | | +I | 2 | | +I | 1 | | +I | 5 | 2)Intersect 和 Intersect All Intersect:交集并且去重 Intersect ALL:交集不做去重 (SELECT id FROM ws) INTERSECT (SELECT id FROM ws1); +----+-------------+ | op | id | +----+-------------+ | +I | 3 | (SELECT id FROM ws) INTERSECT ALL (SELECT id FROM ws1); +----+-------------+ | op | id | +----+-------------+ | +U | 3 | | -U | 3 | | +U | 3 | | -U | 3 | | +U | 3 | | -U | 3 | | +U | 3 | | -U | 3 | | +U | 3 | | -U | 3 | | +U | 3 | | -U | 3 | | +U | 3 | | -U | 3 | | +U | 3 | | -U | 3 | 3)Except 和 Except All Except:差集并且去重 Except ALL:差集不做去重 (SELECT id FROM ws) EXCEPT (SELECT id FROM ws1); +----+-------------+ | op | id | +----+-------------+ | +I | 1 | | +I | 2 | | +I | 3 | | -D | 3 | (SELECT id FROM ws) EXCEPT ALL (SELECT id FROM ws1); +----+-------------+ | op | id | +----+-------------+ | +I | 1 | | +I | 3 | | -U | 3 | | +U | 3 | | +U | 3 | | +I | 2 | | -U | 2 | | +U | 2 | | +U | 2 | | -U | 1 | | +U | 1 | | +U | 1 | | -U | 1 | In 子查询的结果集只能有一列 SELECT id, vc FROM ws WHERE id IN (SELECT id FROM ws1) +----+-------------+-------------+ | op | id | vc | +----+-------------+-------------+ | +I | 3 | 55 | | +I | 3 | 80 | | +I | 3 | 68 | | +I | 3 | 41 | | +I | 3 | 97 | | +I | 3 | 97 | | +I | 3 | 58 |
内置函数
带LTZ可以解决时区问题; Flink SQL> select current_timestamp; +----+-------------------------+ | op | current_timestamp | +----+-------------------------+ | +I | 2023-11-11 20:02:33.583 | +----+-------------------------+ select current_date; +----+--------------+ | op | current_date | +----+--------------+ | +I | 2023-11-11 | +----+--------------+ Flink SQL> select localtimestamp; +----+-------------------------+ | op | localtimestamp | +----+-------------------------+ | +I | 2023-11-11 20:04:00.695 | +----+-------------------------+ Flink SQL> show functions;
modules插拔式
Flink SQL> show modules; +-------------+ | module name | +-------------+ | core | +-------------+ bin/yarn-session.sh -d bin/sql-client.sh embedded -s yarn-session -i conf/sql-client-init.sql Flink SQL> show full modules; +-------------+------+ | module name | used | +-------------+------+ | core | TRUE | +-------------+------+ 1 row in set module Flink SQL> select split('a,b', ','); [ERROR] Could not execute SQL statement. Reason: org.apache.calcite.sql.validate.SqlValidatorException: No match found for function signature split(<CHARACTER>, <CHARACTER>) Flink SQL> load module hive with ('hive-version'='3.1.2'); [INFO] Execute statement succeed.
外部连接器 Kafka
创建Kafka的映射表
CREATE TABLE t1( id int, ts bigint, vc int, `event_time` TIMESTAMP(3) METADATA FROM 'timestamp', `partition` BIGINT METADATA VIRTUAL, --列名和元数据名一致可以省略 FROM 'xxxx', VIRTUAL 表示只读 `offset` BIGINT METADATA VIRTUAL ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'hadoop103:9092', 'properties.group.id' = 'atguigu', 'scan.startup.mode' = 'earliest-offset', -- 'earliest-offset', 'latest-offset', 'group-offsets', 'timestamp' and 'specific-offsets' 'sink.partitioner' = 'fixed', -- fixed 为flink实现的分区器,一个并行度只写往kafka一个分区 'topic' = 'ws1', 'format' = 'json' ) ; CREATE TABLE source ( id INT, ts BIGINT, vc INT ) WITH ( 'connector' = 'datagen', 'rows-per-second'='1', 'fields.id.kind'='random', 'fields.id.min'='1', 'fields.id.max'='10', 'fields.ts.kind'='sequence', 'fields.ts.start'='1', 'fields.ts.end'='1000000', 'fields.vc.kind'='random', 'fields.vc.min'='1', 'fields.vc.max'='100' ); insert into t1(id,ts,vc) select * from source; select * from t1; +----+----+-----+-----+-------------------------+-----------+--------+ | op | id | ts | vc | event_time | partition | offset | +----+----+-----+-----+-------------------------+-----------+--------+ | +I | 4 | 1 | 49 | 2023-11-12 10:06:14.785 | 0 | 0 | | +I | 7 | 2 | 83 | 2023-11-12 10:06:14.813 | 0 | 1 | | +I | 7 | 3 | 7 | 2023-11-12 10:06:15.818 | 0 | 2 | | +I | 5 | 4 | 12 | 2023-11-12 10:06:16.822 | 0 | 3 | | +I | 4 | 5 | 16 | 2023-11-12 10:06:17.828 | 0 | 4 | | +I | 3 | 6 | 19 | 2023-11-12 10:06:18.834 | 0 | 5 | | +I | 3 | 7 | 91 | 2023-11-12 10:06:19.839 | 0 | 6 | | +I | 5 | 8 | 19 | 2023-11-12 10:06:20.845 | 0 | 7 | | +I | 6 | 9 | 45 | 2023-11-12 10:06:21.850 | 0 | 8 | | +I | 5 | 10 | 32 | 2023-11-12 10:06:22.857 | 0 | 9 | | +I | 10 | 11 | 62 | 2023-11-12 10:06:23.863 | 0 | 10 | | +I | 8 | 12 | 75 | 2023-11-12 10:06:24.871 | 0 | 11 | | +I | 6 | 13 | 35 | 2023-11-12 10:06:25.880 | 0 | 12 | | +I | 5 | 14 | 76 | 2023-11-12 10:06:26.786 | 0 | 13 | | +I | 6 | 15 | 99 | 2023-11-12 10:06:27.793 | 0 | 14 | | +I | 2 | 16 | 93 | 2023-11-12 10:06:28.800 | 0 | 15 | | +I | 2 | 17 | 70 | 2023-11-12 10:06:29.807 | 0 | 16 | | +I | 5 | 18 | 48 | 2023-11-12 10:06:30.814 | 0 | 17 | | +I | 4 | 19 | 72 | 2023-11-12 10:06:31.821 | 0 | 18 | | +I | 7 | 20 | 51 | 2023-11-12 10:06:32.830 | 0 | 19 | | +I | 5 | 21 | 46 | 2023-11-12 10:06:33.838 | 0 | 20 | | +I | 6 | 22 | 60 | 2023-11-12 10:06:34.846 | 0 | 21 | | +I | 4 | 23 | 70 | 2023-11-12 10:06:35.852 | 0 | 22 |
kafka的sink分区器,就是往kafka写的时候, flink作为kafka的生产者;
生产者可有个默认的分区器 指定key就哈希取模,指定分区号就往对应分区去写;
如果不指定就用粘性的分区器,还有一种分区器叫轮询的;
'sink.partitioner' = 'fixed' , fixed 为flink实现的分区器,一个并行度只写往kafka一个分区
撤回流: 有条更新就是 发送一条删除流,再发送一条insert流; 一个更新对应2条;
Upsert流,只有2中, 更新upsert和delete; upsert流是没发在代码中定义的, 需要靠外部系统;
upsert-kafka 表
如果当前表存在更新操作,那么普通的kafka连接器将无法满足,此时可以使用Upsert Kafka连接器。
Upsert Kafka 连接器支持以 upsert 方式从 Kafka topic 中读取数据并将数据写入 Kafka topic。
作为 source,upsert-kafka 连接器生产 changelog 流,其中每条数据记录代表一个更新或删除事件。更准确地说,数据记录中的 value 被解释为同一 key 的最后一个 value 的 UPDATE,如果有这个 key(如果不存在相应的 key,则该更新被视为 INSERT)。用表来
类比,changelog 流中的数据记录被解释为 UPSERT,也称为 INSERT/UPDATE,因为任何具有相同 key 的现有行都被覆盖。另外,value 为空的消息将会被视作为 DELETE 消息。
作为 sink,upsert-kafka 连接器可以消费 changelog 流。它会将 INSERT/UPDATE_AFTER 数据作为正常的 Kafka 消息写入,并将 DELETE 数据以 value 为空的 Kafka 消息写入(表示对应 key 的消息被删除)。Flink 将根据主键列的值对数据进行分区,从而保证主
键上的消息有序,因此同一主键上的更新/删除消息将落在同一分区中。
kafka中k,v的形式存储:
如果kafka中数据为(1,a),
想要把它更新为(1,b);
如果是撤回流, -(1,a)、+(1,b)
如果是upsert,
如果用upsert-kafka连接器去读kafka中2条数据: (1,a), (1,b)
它会从源头开始读,不能指定offset, 读到的结果就是(1,b)
如果是删除就直接给个null值;
它将同一主键的多个结果全存下来,然后自动取最新;
另外flink会对相同主键的值对数据进行分区,因为flink是多并行度,
如果相同key的数据在不同的并行度处理起来就乱了,
它会对相同主键列的值分区;
Flink SQL> insert into t1 select id,ts,sum(vc) sumvc from source group by id,ts; [ERROR] Could not execute SQL statement. Reason: org.apache.flink.table.api.ValidationException: Column types of query result and sink for 'default_catalog.mydatabase.t1' do not match. Cause: Different number of columns. Query schema: [id: INT, ts: BIGINT, sumvc: INT] Sink schema: [id: INT, ts: BIGINT, vc: INT, event_time: TIMESTAMP(3)] select id,sum(vc) sumvc from source group by id; +----+-------------+-------------+ | op | id | sumvc | +----+-------------+-------------+ | +I | 1 | 62 | | +I | 3 | 81 | | +I | 4 | 60 | | -U | 4 | 60 | | +U | 4 | 155 | | +I | 8 | 34 | | +I | 2 | 89 | | -U | 3 | 81 | | +U | 3 | 97 | | -U | 8 | 34 | | +U | 8 | 84 |
(1)创建upsert-kafka的映射表(必须定义主键) CREATE TABLE t2( id int , sumVC int , primary key (id) NOT ENFORCED ) WITH ( 'connector' = 'upsert-kafka', 'properties.bootstrap.servers' = 'hadoop102:9092', 'topic' = 'ws2', 'key.format' = 'json', 'value.format' = 'json' ) ; (2)插入upsert-kafka表 insert into t2 select id,sum(vc) sumVC from source group by id; (3)查询upsert-kafka表 upsert-kafka 无法从指定的偏移量读取,只会从主题的源读取。如此,才知道整个数据的更新过程。并且通过 -U,+U,+I 等符号来显示数据的变化过程。 select * from t2; +----+-------------+-------------+ | op | id | sumVC | +----+-------------+-------------+ | +I | 5 | 13 | | +I | 4 | 56 | | -U | 4 | 56 | | +U | 4 | 114 | | -U | 5 | 13 | | +U | 5 | 95 | | +I | 8 | 87 | | +I | 3 | 73 | | -U | 8 | 87 | | +U | 8 | 148 | | -U | 3 | 73 | | +U | 3 | 126 | | +I | 9 | 58 | | -U | 8 | 148 | | +U | 8 | 226 | | -U | 8 | 226 | | +U | 8 | 316 | | +I | 2 | 95 | | +I | 6 | 72 | | -U | 2 | 95 | | +U | 2 | 140 | | -U | 2 | 140 | | +U | 2 | 211 |
File
1)创建FileSystem映射表 CREATE TABLE t3( id int, ts bigint , vc int ) WITH ( 'connector' = 'filesystem', 'path' = 'hdfs://hadoop102:8020/data/t3', 'format' = 'csv' ) 2)写入 insert into t3 values(1,210009987,3); 3)查询 select * from t3; +----+-------------+----------------------+-------------+ | op | id | ts | vc | +----+-------------+----------------------+-------------+ | +I | 1 | 210009987 | 3 | +----+-------------+----------------------+-------------+
JDBC Mysql
CREATE TABLE `ws2` ( `id` int(11) NOT NULL, `ts` bigint(20) DEFAULT NULL, `vc` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ; CREATE TABLE t4 ( id INT, ts BIGINT, vc INT, PRIMARY KEY (id) NOT ENFORCED ) WITH ( 'connector'='jdbc', 'url' = 'jdbc:mysql://hadoop102:3306/test?useUnicode=true&characterEncoding=UTF-8', 'username' = 'root', 'password' = '000000', 'connection.max-retry-timeout' = '60s', 'table-name' = 'ws2', 'sink.buffer-flush.max-rows' = '500', 'sink.buffer-flush.interval' = '5s', 'sink.max-retries' = '3', 'sink.parallelism' = '1' ); insert into t4 values(1,10000, 2); insert into t4 values(1,10000, 20); mysql> select * from ws2; +----+-------+------+ | id | ts | vc | +----+-------+------+ | 1 | 10000 | 2 | +----+-------+------+ 1 row in set (0.00 sec) mysql> select * from ws2; +----+-------+------+ | id | ts | vc | +----+-------+------+ | 1 | 10000 | 20 | +----+-------+------+ 1 row in set (0.00 sec)
savepoint
CREATE TABLE source ( id INT, ts BIGINT, vc INT ) WITH ( 'connector' = 'datagen', 'rows-per-second'='1', 'fields.id.kind'='random', 'fields.id.min'='1', 'fields.id.max'='10', 'fields.ts.kind'='sequence', 'fields.ts.start'='1', 'fields.ts.end'='1000000', 'fields.vc.kind'='random', 'fields.vc.min'='1', 'fields.vc.max'='100' ); CREATE TABLE sink ( id INT, ts BIGINT, vc INT ) WITH ( 'connector' = 'print' ); 停止作业, 触发savepoint SET state.checkpoints.dir='hdfs://hadoop102:8020/checkpoints'; SET state.savepoints.dir='hdfs://hadoop102:8020/savepoint'; insert into sink select * from source; show jobs; ---------------------------------+---------------------------------------------+----------+-------------------------+ | job id | job name | status | start time | +----------------------------------+---------------------------------------------+----------+-------------------------+ | ce22e75d58ca93478969629376e42bba | insert-into_default_catalog.mydatabase.t4 | FINISHED | 2023-11-12T03:27:06.613 | | 137a22fd08c28fa7c33821dfcbfa6962 | insert-into_default_catalog.mydatabase.t3 | FINISHED | 2023-11-12T03:09:53.954 | | 32f188f4725f51fb36618fedb6badc39 | collect | FINISHED | 2023-11-12T03:10:24.905 | | df987b6d6e486af59614125d99c5374d | insert-into_default_catalog.mydatabase.t4 | FINISHED | 2023-11-12T03:25:03.052 | | 140e4f0c0175b7c720eb2f36f6e0f3c9 | insert-into_default_catalog.mydatabase.sink | RUNNING | 2023-11-12T03:40:28.007 | +----------------------------------+---------------------------------------------+----------+-------------------------+ 停止作业,触发savepoint STOP JOB '140e4f0c0175b7c720eb2f36f6e0f3c9' WITH SAVEPOINT; +---------------------------------------------------------------+ | savepoint path | +---------------------------------------------------------------+ | hdfs://hadoop102:8020/savepoint/savepoint-140e4f-191c2883c7c6 | +---------------------------------------------------------------+ 从savepoint恢复 -- 设置从savepoint恢复的路径 SET execution.savepoint.path='hdfs://hadoop102:8020/savepoint/savepoint-140e4f-191c2883c7c6'; show jobs; +----------------------------------+---------------------------------------------+----------+-------------------------+ | job id | job name | status | start time | +----------------------------------+---------------------------------------------+----------+-------------------------+ | 140e4f0c0175b7c720eb2f36f6e0f3c9 | insert-into_default_catalog.mydatabase.sink | FINISHED | 2023-11-12T03:40:28.007 | | ce22e75d58ca93478969629376e42bba | insert-into_default_catalog.mydatabase.t4 | FINISHED | 2023-11-12T03:27:06.613 | | 137a22fd08c28fa7c33821dfcbfa6962 | insert-into_default_catalog.mydatabase.t3 | FINISHED | 2023-11-12T03:09:53.954 | | 32f188f4725f51fb36618fedb6badc39 | collect | FINISHED | 2023-11-12T03:10:24.905 | | df987b6d6e486af59614125d99c5374d | insert-into_default_catalog.mydatabase.t4 | FINISHED | 2023-11-12T03:25:03.052 | +----------------------------------+---------------------------------------------+----------+-------------------------+ insert into sink select * from source; -- 之后直接提交sql,就会从savepoint恢复 Latest Restore ID: 1 Restore Time: 2023-11-12 11:50:49 Type: Savepoint Path: hdfs://hadoop102:8020/savepoint/savepoint-305e93-1e617750771e --允许跳过无法还原的保存点状态 set 'execution.savepoint.ignore-unclaimed-state' = 'true'; 5)恢复后重置路径 指定execution.savepoint.path后,将影响后面执行的所有DML语句,可以使用RESET命令重置这个配置选项。 RESET execution.savepoint.path; 如果出现reset没生效的问题,可能是个bug,我们可以退出sql-client,再重新进,不需要重启flink的集群。
catalog
CREATE CATALOG my_jdbc_catalog WITH( 'type' = 'jdbc', 'default-database' = 'test', 'username' = 'root', 'password' = '000000', 'base-url' = 'jdbc:mysql://hadoop102:3306' ); 查看Catalog Flink SQL> show catalogs; +-----------------+ | catalog name | +-----------------+ | default_catalog | | my_jdbc_catalog | --查看当前的CATALOG SHOW CURRENT CATALOG; +----------------------+ | current catalog name | +----------------------+ | default_catalog | +----------------------+ 使用指定Catalog USE CATALOG my_jdbc_catalog; --查看当前的CATALOG SHOW CURRENT CATALOG; hive --service metastore & [1] 16943 $ 2023-11-12 12:10:41: Starting Hive Metastore Server $ netstat -anp | grep 9083 (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) tcp6 0 0 :::9083 :::* LISTEN 16943/java $ ps -ef | grep -i metastore 16943 2004 19 12:10 pts/0 00:00:09 /opt/module/jdk1.8.0_212/bin/java -Dproc_jar -Dproc_metastore -Dlog4j.configurationFile=hive-log4j2.properties -Djava.util.logging.config.file=/opt/module/hive/conf/parquet-logging.properties -Dyarn.log.dir=/opt/module/hadoop-3.1.3/logs -Dyarn.log.file=hadoop.log -Dyarn.home.dir=/opt/module/hadoop-3.1.3 -Dyarn.root.logger=INFO,console -Djava.library.path=/opt/module/hadoop-3.1.3/lib/native -Xmx256m -Dhadoop.log.dir=/opt/module/hadoop-3.1.3/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/module/hadoop-3.1.3 -Dhadoop.id.str=atguigu -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/module/hive/lib/hive-metastore-3.1.2.jar org.apache.hadoop.hive.metastore.HiveMetaStore 17102 2004 0 12:11 pts/0 00:00:00 grep --color=auto -i metastore CREATE CATALOG myhive WITH ( 'type' = 'hive', 'default-database' = 'default', 'hive-conf-dir' = '/opt/module/hive/conf' ); 查看Catalog SHOW CATALOGS; Flink SQL> SHOW CATALOGS; +-----------------+ | catalog name | +-----------------+ | default_catalog | | my_jdbc_catalog | | myhive | +-----------------+ 使用指定Catalog USE CATALOG myhive; --查看当前的CATALOG SHOW CURRENT CATALOG; +----------------------+ | current catalog name | +----------------------+ | myhive | +----------------------+