Flink入门

1. Flink架构

1.1 Flink的角色

  • Client:获取、转换、提交代码给jm.
  • JM:对job做任务调度,再对job进一步处理转换,然后分发给TM.
  • TM:数据处理.

1.2 部署模式

    1. Standalone
#1.配置web访问的IP
vim flink-conf.yaml

rest.address: 172.31.12.2
rest.bind-address: 172.31.12.2
env.java.opts.jobmanager: -Dfile.encoding=UTF-8 -Dlog4j2.formatMsgNoLookups=true -Dlog4j.formatMsgNoLookups=true
env.java.opts.taskmanager: -Dfile.encoding=UTF-8 -Dlog4j2.formatMsgNoLookups=true -Dlog4j.formatMsgNoLookups=true
classloader.check-leaked-classloader: false
env.java.opts.all: -Dfile.encoding=UTF-8

#2.配置master
vim masters

172.31.12.2:8081

#3.启动集群 JM和TM  也可以手动启动taskmanager.sh
./bin/start-cluster.sh

#5.web界面 
http://172.31.12.2:8081

#6.demo
./bin/flink run ./examples/batch/WordCount.jar

    1. Yarn
#vim conf/masters
172.31.15.13:8081

#vim conf/workers 
172.31.15.13
172.31.15.12
172.31.15.4
172.31.15.14
172.31.15.16

#vim conf/flink-conf.yaml
jobmanager.rpc.address: 172.31.15.13
jobmanager.rpc.port: 6123
jobmanager.bind-host: 0.0.0.0
#每台节点改为各自的ip
taskmanager.bind-host: 0.0.0.0
taskmanager.host: 172.31.15.4
rest.address: 172.31.15.13
rest.bind-address: 0.0.0.0
  • K8S

区别:集群的生命周期和资源的分配方法,代码的main方式在client执行还是JM执行。

  • session会话:启动一个集群保持会话,通过client提交作业给jm.资源共享适合小而多的job.
#先关闭stop-cluster.sh 再启动session
yarn-session.sh -d -nm test 
flink run -d -c com.wsl.test wc2.jar # core1 7777

  • per-job单作业:每个job启动一个集群,job完成集群关闭资源释放。需要资源调度框架yarn、k8s.
# < 1.0
flink run -d -m yarn-cluster -c com.wsl.day00.base.test wc2.jar -ynm wc  
flink run -d -m yarn-cluster flink-1.13.0/examples/batch/WordCount.jar -input hdfs://master1:8020/input/wc.txt -output hdfs://master1:8020/output/result2 
# > 1.0
flink run -d \
-p 1\
-Dyarn.application.name=syslog   \
-Dtaskmanager.numberOfTaskSlots=3 \
-Dtaskmanager.memory.process.size=2048mb \
-Dyarn.containers.vcores=2 \
-Drest.flamegraph.enabled=true \
-Denv.java.opts="-XX:+PrintGCDetails -XX:+PrintGCDateStamps" \
-t yarn-per-job \
-c com.wsl.day00.base.test \
wc2.jar 


flink list -t yarn-per-job -Dyarn.application.id=application_XXXX_YY
/data/module/hadoop-3.2.4/bin/yarn application -list
yarn application -list 2>/dev/null | awk '{print $2}' | grep dpi | wc -l #任务名
yarn logs -applicationId application_1627612938926_0005 #查看任务的log

  • application应用:不在client上执行代码,直接提交到JM上执行,每个job启动一个JM.
flink run-application -t yarn-application -c xxx xxx.jar

2. 核心概念

Window分类

  • TimeWindow
  • CountWindow
  • SlidingPreocessing、TumblingProcessing、Session、Global

//按键分区(Keyed)相同的key会发送到同一并行子任务,每个Key都定义了一组窗口单独计算。
stream.keyBy(...)
      .window(...)

//非按键分区(Non-Keyed)并行度=1
stream.windowAll(...)       

聚合函数:

  • 全量聚合 apply() process()
  • 增量聚合 aggregate() reduce()
  • 富函数:侧输出流、运行时上下文可做状态编程、生命周期方法
  • Process函数:侧输出流、运行时上下文可做状态编程、生命周期、定时器

WaterMark

.assignTimestampsAndWatermarks(forBoundedOutOfOrderness(1))) :到达窗口结束时间触发当前计算,但不关闭窗口,以后每来一次迟到数据触发一次计算.
.allowedLateness(Time.seconds(3)) :当允许迟到时间过了才会真正关闭窗口。 
.getSideOutput(0) :窗口关闭了后又来了迟到数据,通过测输出流输出。

状态

  • Row State
  • Managed State
    • Keyed State
    • Operator State (BroadCastState))

状态后端

本地状态的管理

  • 默认的 HashMapStateBackend:TM的jvm里
  • 内嵌RocksDB EmbeddedRocksDBStateBackend:持久化到TM的本地数据目录里,序列化和反序列化

配置

  • flink-conf.yaml state.backend: hashmap
  • 代码里为每个job配置

checkpoint

定时存档,遇到故障从检查点读档恢复之前的状态。

  • 周期性触发保存:代码设置
  • 保存的时间点(barrier):当所有的task处理到同一数据后,将此时的状态保存下来做一次快照。

barrier

触发检查点保存的时间点

JM定时向TM发送指令,触发保持检查点(带检查点id),向source中插入一条barrier,source任务将自己的状态保存起来,做一次ckp,barrier继续向下游传递,之后source就可以继续读入新数据了,然后后面的算子任务继续做ckp。遇到keyby分区,barrier需要广播到每个并行度,所以这个下游算子会收到多个barrier,需要做执行“分界线对齐”操作,即需要等到所有并行分区的barrier都到齐,才可以开始状态的保存。

  • Barrier对齐:下游任务等到所有并行度的barrier到齐的过程中,barrier已经到的上游任务又收到了数据(这已经是下次的ckp):
    • 至少一次:直接计算。重新启动两次ckp之间的数据会重复计算。
    • 精准一次:两次ckp之间的数据不会直接计算,而且缓存起来。
  • Barrier不对齐:直接把数据和barrier缓存起来
    • 精准一次

端到端一致性

3.SQL

3.1 概念

动态表
持续查询 更新查询Upadate 追加查询Append
表->流: Append-only流(+) Retract流(-) Upsert流(-+)

--1.vim sql-client-init-paimon.sql

CREATE CATALOG fs_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 'hdfs://172.31.15.13:4007/paimon/fs'
);

USE CATALOG fs_catalog;
CREATE DATABASE MyDatabase;
USE MyDatabase;

SET 'sql-client.execution.result-mode' = 'tableau';
SET 'execution.runtime-mode' = 'streaming'; 
SET parallelism.default=1;
SET table.exec.state.ttl=1000;

--2.启动flink
yarn-session.sh -d

--3.启动client
sql-client.sh embedded -s yarn-session -i conf/sql-client-init.sql

3.2 查询

1.DataGen & print

CREATE TABLE source ( 
     id INT, 
     ts BIGINT, 
     vc INT
 ) WITH ( 
'connector' = 'datagen', 
'rows-per-second'='1', 
'fields.id.kind'='random', 
'fields.id.min'='1',
'fields.id.max'='10',
'fields.ts.kind'='sequence',
'fields.ts.start'='1', 
'fields.ts.end'='1000000',
'fields.vc.kind'='random', 
'fields.vc.min'='1', 
'fields.vc.max'='10'
);

select * from source;
CREATE TABLE sink (
    id INT, 
    ts BIGINT, 
    vc INT
) WITH (
'connector' = 'print'
);

INSERT INTO sink select  * from source;

2.with

WITH source_with_total AS (
    SELECT id, vc+10 AS total
    FROM source
)

SELECT id, SUM(total)
FROM source_with_total
GROUP BY id;

3.分组聚合

CREATE TABLE source2 (
dim STRING,  --维度
user_id BIGINT,
price BIGINT,
row_time AS cast(CURRENT_TIMESTAMP as timestamp(3)),
WATERMARK FOR row_time AS row_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'datagen',
'rows-per-second' = '1',
'fields.dim.length' = '1',
'fields.user_id.min' = '1',
'fields.user_id.max' = '10',
'fields.price.min' = '1',
'fields.price.max' = '10'
);

select dim,
count(*) as pv,
sum(price) as sum_price,
max(price) as max_price,
min(price) as min_price,
-- 计算 uv 数
count(distinct user_id) as uv,
cast((UNIX_TIMESTAMP(CAST(row_time AS STRING))) / 60 as bigint) as window_start
from source2
group by
dim,
-- UNIX_TIMESTAMP得到秒的时间戳,将秒级别时间戳 / 60 转化为 1min, 
cast((UNIX_TIMESTAMP(CAST(row_time AS STRING))) / 60 as bigint);

4.分组窗口聚合(1.13过时)

CREATE TABLE ws (
  id INT,
  vc INT,
  pt AS PROCTIME(), --处理时间
  et AS cast(CURRENT_TIMESTAMP as timestamp(3)), --事件时间
  WATERMARK FOR et AS et - INTERVAL '5' SECOND   --watermark
) WITH (
  'connector' = 'datagen',
  'rows-per-second' = '10',
  'fields.id.min' = '1',
  'fields.id.max' = '3',
  'fields.vc.min' = '1',
  'fields.vc.max' = '1'
);
--滚动
select  
id,
TUMBLE_START(et, INTERVAL '5' SECOND)  wstart,
TUMBLE_END(et, INTERVAL '5' SECOND)  wend,
sum(vc) sumVc
from ws
group by id, TUMBLE(et, INTERVAL '5' SECOND);
--滑动
select  
id,
HOP_START(et, INTERVAL '3' SECOND,INTERVAL '5' SECOND)   wstart,
HOP_END(et, INTERVAL '3' SECOND,INTERVAL '5' SECOND)  wend,
sum(vc) sumVc
from ws
group by id, HOP(et, INTERVAL '3' SECOND,INTERVAL '5' SECOND);

5.窗口表值函数聚合TVF

--滚动
SELECT 
window_start, 
window_end, 
id , 
SUM(vc) sumVC
FROM TABLE(
  TUMBLE(TABLE ws, DESCRIPTOR(et), INTERVAL '5' SECONDS))
GROUP BY window_start, window_end, id;
--滑动
SELECT 
window_start, 
window_end, 
id , 
SUM(vc) sumVC
FROM TABLE(
  HOP(TABLE ws, DESCRIPTOR(et), INTERVAL '5' SECONDS , INTERVAL '10' SECONDS))
GROUP BY window_start, window_end, id;
--累计窗口
SELECT 
window_start, 
window_end, 
id , 
SUM(vc) sumVC
FROM TABLE(
  CUMULATE(TABLE ws1, DESCRIPTOR(et), INTERVAL '2' SECONDS , INTERVAL '6' SECONDS))--累积步长,最大窗口长度
GROUP BY window_start, window_end, id;
--多维分析
SELECT 
window_start, 
window_end, 
id , 
SUM(vc) sumVC
FROM TABLE(
  TUMBLE(TABLE ws, DESCRIPTOR(et), INTERVAL '5' SECONDS))
GROUP BY window_start, window_end,
rollup( (id) )
--  cube( (id) )
--  grouping sets( (id),()  );

6.Over

SELECT
  agg_func(agg_col) OVER (
    [PARTITION BY col1[, col2, ...]]
    ORDER BY time_col  --oreder by 只能是时间戳列,只能升序
    range_definition), --按照行聚合(当前数据往前推10行+当前行=11行);按照时间区间聚合
  ...
FROM ...
--统计每个传感器前10秒到现在收到的水位数据条数
SELECT 
    id, 
    et, 
    vc,
    count(vc) OVER (
        PARTITION BY id
        ORDER BY et
        RANGE BETWEEN INTERVAL '10' SECOND PRECEDING AND CURRENT ROW
  ) AS cnt
FROM ws;

SELECT 
    id, 
    et, 
    vc,
count(vc) OVER w AS cnt,
sum(vc) OVER w AS sumVC
FROM ws
WINDOW w AS (
    PARTITION BY id
    ORDER BY et
    RANGE BETWEEN INTERVAL '10' SECOND PRECEDING AND CURRENT ROW
);

--统计每个传感器前5条到现在数据的平均水位
SELECT 
    id, 
    et, 
    vc,
    avg(vc) OVER (
    	PARTITION BY id
    	ORDER BY et
    	ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
) AS avgVC
FROM ws;

SELECT 
    id, 
    et, 
    vc,
avg(vc) OVER w AS avgVC,
count(vc) OVER w AS cnt
FROM ws
WINDOW w AS (
    PARTITION BY id
    ORDER BY et
    ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
);

7.TOPN

--order by 可以不是时间字段,也可以降序
--取每个传感器最高的3个水位值  
select 
    id,
    et,
    vc,
    rownum
from 
(
    select 
        id,
        et,
        vc,
        row_number() over(partition by id  order by vc desc ) as rownum
    from ws
)
where rownum<=3;

8.Deduplication去重

--order by 一定是时间属性列,可以降序(降序就是取时间最新的一条),Deduplication比topN做了优化
--对每个传感器的水位值去重
select 
    id,
    et,
    vc,
    rownum
from 
(
    select 
        id,
        et,
        vc,
        row_number() over(partition by id,vc  order by et  ) as rownum
    from ws
)
where rownum=1;

9.联结查询JOIN

常规联结查询 与sql一样

--Inner Join:两条流join到才输出
SELECT *
FROM ws
INNER JOIN ws1
ON ws.id = ws1.id

--left Join:左边来了就会输出,后边为null,右边到达之后发起回撤流 先- 后+
SELECT *
FROM ws
LEFT JOIN ws1
ON ws.id = ws1.id

--Right Join
SELECT *
FROM ws
RIGHT JOIN ws1
ON ws.id = ws1.id

--Full Join
SELECT *
FROM ws
FULL OUTER JOIN ws1
ON ws.id = ws.id

间隔联结查询 Interval Join

SELECT 
  *
FROM ws,ws1
WHERE ws.id = ws1. id
AND ws.et BETWEEN ws1.et - INTERVAL '2' SECOND AND ws1.et + INTERVAL '2' SECOND 

维表联结查询 Lookup Join

--上面几种join都是流与流的join,问题是两条流历史数据都是存内存的,因为要设置ttl。但是维度数据是要一直存在的。
--Lookup join迎合主流大表从流小表,主流来一条去从表里查询,不保存主流数据。
表A
JOIN 维度表名 FOR SYSTEM_TIME AS OF 表A.proc_time AS 别名
ON xx.字段=别名.字段
CREATE TABLE customers (
  id INT,
  name STRING,
  country STRING,
  zip STRING
) WITH (
  'connector' = 'jdbc',
  'url' = 'jdbc:mysql://hadoop102:3306/customerdb',
  'table-name' = 'customers'
);

--order表每来一条数据,都会去mysql的customers表查找维度数据
SELECT o.order_id, o.total, c.country, c.zip
FROM Orders AS o
JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
ON o.customer_id = c.id;

10.Hints

select * from ws1/*+ OPTIONS('rows-per-second'='10')*/;

11.Module

--支持的module:core、hive、自定义

--1.上传jar包 flink-sql-connector-hive-3.1.3_2.12-1.17.0.jar hadoop-mapreduce-client-core-3.3.4.jar
--2.加载
load module hive with ('hive-version'='3.1.3');
SHOW MODULES;
-- 可以调用hive的split函数
select split('a,b', ',');

3.3 Connector

kafka Connector

--flink-sql-connector-kafka-1.17.0.jar放到lib
CREATE TABLE kafkat( 
  `event_time` TIMESTAMP(3) METADATA FROM 'timestamp',
  `partition` BIGINT METADATA VIRTUAL,
  `offset` BIGINT METADATA VIRTUAL,
id int, 
ts bigint , 
vc int )
WITH (
  'connector' = 'kafka',
  'properties.bootstrap.servers' = '172.31.15.21:9092',
  'properties.group.id' = 'wsl',
  'scan.startup.mode' = 'earliest-offset',
  'sink.partitioner' = 'fixed',
  'topic' = 'ws1',
  'format' = 'json'
); 

insert into kafkat(id,ts,vc) select * from source;
select * from kafkat;

upsert-kafka

--Upsert Kafka 连接器支持以 upsert 方式从 Kafka topic 中读取数据并将数据写入 Kafka topic,
--作为source和sink 生产和消费changelog流
CREATE TABLE kafkat2( 
    id int , 
    sumVC int ,
    primary key (id) NOT ENFORCED --必须定义主键
)
WITH (
  'connector' = 'upsert-kafka',
  'properties.bootstrap.servers' = '172.31.15.21:9092',
  'topic' = 'wsl2',
  'key.format' = 'json',
  'value.format' = 'json'
);


insert into kafkat2 select  id,sum(vc) sumVC  from source group by id;
select * from kafkat2;

File connector

--默认自带flink-connector-files-1.17.0.jar,但是与flink-sql-connector-hive-3.1.3_2.12-1.17.0.jar冲突,将此包去掉,或者将opt/flink-table-planner_2.12-1.17.0.jarti替换掉/lib/flink-table-planner-loader-1.17.0.jar
CREATE TABLE file( id int, ts bigint , vc int )
WITH (
  'connector' = 'filesystem',
  'path' = 'hdfs://172.31.15.12:4007/paimon/fs',
  'format' = 'csv'
);

insert into file select * from source;

JDBC Connect

--flink-connector-jdbc-1.17-20230109.003314-120.jar
CREATE TABLE t4
(
    id INT,
    ts BIGINT,
    vc INT,
    PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'connector'='jdbc',
    'url' = 'jdbc:mysql://hadoop102:3306/test?useUnicode=true&characterEncoding=UTF-8',
    'username' = 'root',
    'password' = '000000',
    'connection.max-retry-timeout' = '60s',
    'table-name' = 'ws2',
    'sink.buffer-flush.max-rows' = '500',
    'sink.buffer-flush.interval' = '5s',
    'sink.max-retries' = '3',
    'sink.parallelism' = '1'
);

3.4 savepoint

--也可以在配置文件里设置
SET state.checkpoints.dir='hdfs://172.31.15.12:4007/chk'; 
SET state.savepoints.dir='hdfs://172.31.15.12:4007/sp';

--启动任务
INSERT INTO sink select  * from source;
show jobs;
--停止任务
STOP JOB '228d70913eab60dda85c5e7f78b5782c' WITH SAVEPOINT;
--从sp位置启动
SET execution.savepoint.path='hdfs://172.31.15.12:4007/sp/savepoint-d139a5-80e7168d5664'; 
INSERT INTO sink select  * from source;

遇到这种情况,先设置savepoint再提交作业。

--reset无效 需要重启sql-client
set pipeline.name=test
reset pipeline.name;

3.5 Catalog

catalog元数据信息, Flink元数据存储持久化到其他地方 ;或者从其他地方访问元数据映射到flink。

  • GenericInMemoryCatalog : default_catalog
  • JdbcCatalog
--flink-connector-jdbc-1.17-20230109.003314-120.jar
CREATE CATALOG my_jdbc_catalog WITH(
    'type' = 'jdbc',
    'default-database' = 'bigdata',
    'username' = 'root',
    'password' = 'Xnetworks@c0M',
    'base-url' = 'jdbc:mysql://172.31.15.7:3306'
);

SHOW CATALOGS;
SHOW CURRENT CATALOG;
USE CATALOG my_jdbc_catalog;
  • HiveCatalog
posted @   小花生hadoop  阅读(69)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?
点击右上角即可分享
微信分享提示