Flink处理函数
1. 简介
1. 简介
处理函数ProcessFunction主要是定义数据流的转换操作,也可以把它划分为转换算子。处理函数提供了一个"定时服务", 我们可以通过它访问流中的事件、事件戳、水位线、甚至可以注册定时事件。该类继承自AbstractRichFunction,拥有富函数类的所有特性,同样可以访问状态(state)和其他运行时信息。另外处理函数可以将数据直接输出到侧输出流。 处理函数可以实现各种自定义的业务逻辑。
处理函数的使用与其他转换类似,基于DataStream的.process 方法,传入一个ProcessFunction 函数。
简单使用:
package cn.qz.process;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;
import java.time.Duration;
public class Process1 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1500L),
new MyEvent("zs", "/product?id=1", 2000L),
new MyEvent("zs", "/product?id=2", 2300L),
new MyEvent("zs", "/product?id=3", 1800L),
new MyEvent("ls", "/user", 1000L),
new MyEvent("ls", "/order", 1500L),
new MyEvent("ls", "/product?id=1", 2000L),
new MyEvent("ls", "/product?id=2", 2300L),
new MyEvent("ls", "/product?id=3", 1800L),
new MyEvent("tq", "/product?id=3", 1800L)
);
// 无序流
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 用户分到相同的组进行计算
// .keyBy(data -> true)
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.process(new ProcessFunction<MyEvent, String>() {
@Override
public void processElement(MyEvent value, Context ctx, Collector<String> out) throws Exception {
if ("zs".equals(value.user)) {
out.collect(value.getUser());
}
}
}).print();
executionEnvironment.execute();
}
}
结果:
5> zs
3> zs
4> zs
6> zs
7> zs
简单解释:
1. ProcessFunction 继承自AbstractRichFunction。 接收两个泛型:I-输入数据类型;0-输出的数据类型
2. 内部定义了两个方法:
抽象方法processElement:对于流中的每次元素都会调用一次,没有输出、结果的输出以Collector 输出
非抽象方法onTimer: 用于定时触发的操作。这个方法只有在注册好的定时器触发的时候才会调用,而定时器是通过"定时服务"TimerService 来注册的。并且只有按键分区流"KeyedStream" 才支持定时操作。
2. 分类
大致分为8个不同的处理函数:
1.processFunction: 最基本的处理函数,基于DataStream直接调用.process()时作为参数传入
2.KeyedProcessFunction:基于按键分区流KeyedStream。可以使用定时器。
3.ProcessWindowFunction:开窗之后的处理函数,也是全窗口函数的代表。基于WindowedStream
4.ProcessAllWindowFunction:基于AllWindowedStream 调用 .process()
5.CoProcessFunction:合并两条流之后的处理函数。基于ConnectedStream
6.ProcessJoinFunction:间隔连接两条流之后的处理函数,基于IntervalJoined
7.BroadcastProcessFunction:广播连接流处理函数,基于BroadcastConnectedStream
8.KeyedBroadcastProcessFunction:按键分区的广播连接流处理函数,同样基于BroadcastConnectedStream。与上面不同的是,这时的广播流是一个KeyedStream与广播流BroadcastStream做连接之后的产物。
下面主要对KeyedProceddFunction 和 ProcessWindowFunction 做研究。
2. 按键分区处理函数KeyedProcessFunction
按键分区是为了实现聚合统计或者开窗计算,分区后数据会分到不同的组,然后分配到不同的并行子任务中。另外只有KeyedStream 才支持TimerService 设置定时器的操作。所以一般会先用keyBy 进行分区后计算。
1. 定时器(Timer)和定时服务TimerService
TImerService 是Flink 关于时间和定时器的基础服务接口,包含一下六个方法:
/// 获取当前的处理时间
long currentProcessingTime();
/// 获取当前的水位线(事件时间)
long currentWatermark();
/// 注册处理时间定时器,当处理时间超过 time 时触发
void registerProcessingTimeTimer(long time);
/// 注册事件时间定时器,当水位线超过 time 时触发
void registerEventTimeTimer(long time);
/// 删除触发时间为 time 的处理时间定时器
void deleteProcessingTimeTimer(long time);
/// 删除触 发时间为 time 的处理时间定时器
void deleteEventTimeTimer(long time);
可以分类两大类三种操作:基于处理时间和基于事件时间,获取、注册、删除定时器操作。
对于每个key和时间戳,最多只有一个定时器。如果注册了多次,onTimer方法只被调用一次。另外onTimer方法和processElement 方法是同步调用的,不会出现状态的并发修改。
1. 基于处理时间的定时器
package cn.qz.process;
import cn.qz.time.MyEvent;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.util.Collector;
import java.sql.Timestamp;
import java.util.Calendar;
import java.util.Random;
public class ProcessingTimeTimer {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
executionEnvironment.setParallelism(1);
// 处理时间语义,不需要分配时间戳和watermark
SingleOutputStreamOperator<MyEvent> dataStreamSource = executionEnvironment.addSource(new ClickSource());
// 要用定时器,必须基于KeyedStream
dataStreamSource.keyBy(data -> true)
.process(new KeyedProcessFunction<Boolean, MyEvent, String>() {
@Override
public void processElement(MyEvent value, Context ctx, Collector<String> out) throws Exception {
Long currTs = ctx.timerService().currentProcessingTime();
out.collect("数据到达,到达时间:" + new Timestamp(currTs));
// 注册一个10秒后的定时器
ctx.timerService().registerProcessingTimeTimer(currTs + 10 * 1000L);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
out.collect("定时器触发S,触发时间:" + new Timestamp(timestamp));
Thread.sleep(1 * 1000);
out.collect("定时器触发S,触发时间:" + new Timestamp(timestamp));
}
})
.print();
executionEnvironment.execute();
}
}
class ClickSource implements SourceFunction<MyEvent> {
// 声明一个布尔变量,作为控制数据生成的标识位
private Boolean running = true;
@Override
public void run(SourceContext<MyEvent> ctx) throws Exception {
Random random = new Random(); // 在指定的数据集中随机选取数据
String[] users = {"Mary", "Alice", "Bob", "Cary"};
String[] urls = {"./home", "./cart", "./fav", "./prod?id=1", "./prod?id=2"};
while (running) {
ctx.collect(new MyEvent(
users[random.nextInt(users.length)],
urls[random.nextInt(urls.length)],
Calendar.getInstance().getTimeInMillis()
));
// 隔1秒生成一个点击事件,方便观测
Thread.sleep(5000);
}
}
@Override
public void cancel() {
running = false;
}
}
结果:
数据到达,到达时间:2022-08-30 14:13:09.948
数据到达,到达时间:2022-08-30 14:13:14.97
定时器触发S,触发时间:2022-08-30 14:13:19.948
定时器触发S,触发时间:2022-08-30 14:13:19.948
数据到达,到达时间:2022-08-30 14:13:20.951
定时器触发S,触发时间:2022-08-30 14:13:24.97
定时器触发S,触发时间:2022-08-30 14:13:24.97
数据到达,到达时间:2022-08-30 14:13:25.973
2. 基于事件时间的定时器
package cn.qz.process;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.util.Collector;
public class EventTimeTimer {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<MyEvent> stream = env.addSource(new CustomSource())
.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.timestamp;
}
}));
// 基于KeyedStream定义事件时间定时器
stream.keyBy(data -> true)
.process(new KeyedProcessFunction<Boolean, MyEvent, String>() {
@Override
public void processElement(MyEvent value, Context ctx, Collector<String> out) throws Exception {
out.collect("数据到达,时间戳为:" + ctx.timestamp());
out.collect("数据到达,水位线为:" + ctx.timerService().currentWatermark() + "\n -------分割线-------");
// 注册一个10秒后的定时器
ctx.timerService().registerEventTimeTimer(ctx.timestamp() + 10 * 1000L);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
out.collect("定时器触发,触发时间:" + timestamp);
}
})
.print();
env.execute();
}
// 自定义测试数据源
public static class CustomSource implements SourceFunction<MyEvent> {
@Override
public void run(SourceContext<MyEvent> ctx) throws Exception {
// 直接发出测试数据
ctx.collect(new MyEvent("Mary", "./home", 1000L));
// 为了更加明显,中间停顿5秒钟
Thread.sleep(5000L);
// 发出10秒后的数据
ctx.collect(new MyEvent("Mary", "./home", 11000L));
Thread.sleep(5000L);
// 发出10秒+1ms后的数据
ctx.collect(new MyEvent("Alice", "./cart", 11001L));
Thread.sleep(5000L);
}
@Override
public void cancel() { }
}
}
结果:
数据到达,时间戳为:1000
数据到达,水位线为:-9223372036854775808
-------分割线-------
数据到达,时间戳为:11000
数据到达,水位线为:999
-------分割线-------
数据到达,时间戳为:11001
数据到达,水位线为:10999
-------分割线-------
定时器触发,触发时间:11000
定时器触发,触发时间:21000
定时器触发,触发时间:21001
可以看出事件语义下,定时器触发的条件就是水位线推进到设定的时间。
3. 窗口处理函数
比较常用的还有基于窗口的ProcessWindowFunction和ProcessAllWindowFunction。
进行窗口计算,我们可以直接用现成的聚合方法(sum/max/min), 也可以通过调用reduce或者aggregate 来自定义一般的增量聚合函数(ReduceFunction/AggregateFunction)。对于其他更复杂、需要窗口信息和额外状态的一些场景可以直接使用全窗口函数,把数据全部保存在窗口内,等到触发窗口计算时再统一处理。
1. ProcessWindowFunction
ProcessWindowFunction 既是处理函数又是全窗口函数。
public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window>
extends AbstractRichFunction {
public abstract void process(
KEY key, Context context, Iterable<IN> elements, Collector<OUT> out) throws Exception;
public void clear(Context context) throws Exception {}
/** The context holding window metadata. */
public abstract class Context implements java.io.Serializable {
/** Returns the window that is being evaluated. */
public abstract W window();
/** Returns the current processing time. */
public abstract long currentProcessingTime();
/** Returns the current event-time watermark. */
public abstract long currentWatermark();
/**
* State accessor for per-key and per-window state.
*
* <p><b>NOTE:</b>If you use per-window state you have to ensure that you clean it up by
* implementing {@link ProcessWindowFunction#clear(Context)}.
*/
public abstract KeyedStateStore windowState();
/** State accessor for per-key global state. */
public abstract KeyedStateStore globalState();
/**
* Emits a record to the side output identified by the {@link OutputTag}.
*
* @param outputTag the {@code OutputTag} that identifies the side output to emit to.
* @param value The record to emit.
*/
public abstract <X> void output(OutputTag<X> outputTag, X value);
}
}
process方法和之前的processElement 方法不同,不是一条一条处理,而是一批进行处理;多了一个clear 方法。clear 方法主要用于窗口的清理工作。
没有TimerService 对象,只能通过context 的current... 获取相关的时间,与此同时,context 还有获取窗口状态和全局状态的方法。
窗口本身就包含了一个触发计算的时间点,如果需要再引入其他定时,可以借助于触发器(Trigger),trigger 中的TriggerContext 可以起到类似于TimerService 的作用:获取时间、注册和删除定时器。
2. ProcessAllWindowFunction
它和上面的ProcessWindowFunction 作用非常类似,只不过它是基于AllWindowedStream,相当于没用keyBy 的数据流直接开窗并调用.process() 方法。其API如下:
public abstract class ProcessAllWindowFunction<IN, OUT, W extends Window>
extends AbstractRichFunction {
public abstract void process(Context context, Iterable<IN> elements, Collector<OUT> out)
throws Exception;
public void clear(Context context) throws Exception {}
...
}
3. 应用案例 - Top N
比如我们实时统计一段时间内的热门url。统计最近十秒钟内最热门的两个url,每五秒钟更新一次。
分析: 可以看出是一个滑动窗口来实现。
0. Source 代码
package cn.qz.process;
import cn.qz.time.MyEvent;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import java.util.Calendar;
import java.util.Random;
class ClickSource implements SourceFunction<MyEvent> {
// 声明一个布尔变量,作为控制数据生成的标识位
private Boolean running = true;
@Override
public void run(SourceContext<MyEvent> ctx) throws Exception {
Random random = new Random(); // 在指定的数据集中随机选取数据
String[] users = {"Mary", "Alice", "Bob", "Cary"};
String[] urls = {"./home", "./cart", "./fav", "./prod?id=1", "./prod?id=2"};
while (running) {
ctx.collect(new MyEvent(
users[random.nextInt(users.length)],
urls[random.nextInt(urls.length)],
Calendar.getInstance().getTimeInMillis()
));
// 隔1秒生成一个点击事件,方便观测
Thread.sleep(1000);
}
}
@Override
public void cancel() {
running = false;
}
}
1. 基于ProcessAllWindowFunction 实现
这个思路比较简单。 一个窗口内的所有数据都在一个子任务中计算。先用hashMap 维护url以及访问量; 然后转为Tuple2 存入ArrayList 进行排序后输出。
代码:
package cn.qz.process;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessAllWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
public class ProcessAllWindowTopN {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
executionEnvironment.setParallelism(1);
DataStreamSource<MyEvent> myEventDataStreamSource = executionEnvironment.addSource(new ClickSource());
// 水位线
SingleOutputStreamOperator<MyEvent> eventStream = myEventDataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forMonotonousTimestamps().withTimestampAssigner(
new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}
));
eventStream.map(new MapFunction<MyEvent, String>() {
@Override
public String map(MyEvent value) throws Exception {
return value.getUrl();
}
})
// 开滑动窗口
.windowAll(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.process(new ProcessAllWindowFunction<String, String, TimeWindow>() {
@Override
public void process(Context context, Iterable<String> elements, Collector<String> out) throws Exception {
// 用HashMap 维护记录。 <url, 次数>
HashMap<String, Long> urlCountMap = new HashMap<>();
for (String url : elements) {
if (urlCountMap.containsKey(url)) {
urlCountMap.put(url, urlCountMap.get(url) + 1);
} else {
urlCountMap.put(url, 1L);
}
}
// 采用一定的算法对hashMap 中的key 进行排序,排序完成之后输出. 可以转为Tuple2 然后排序后输出
List<Tuple2<String, Long>> tuple2s = new ArrayList<>();
urlCountMap.forEach((k, v) -> {
tuple2s.add(Tuple2.of(k, v));
});
tuple2s.sort(new Comparator<Tuple2<String, Long>>() {
@Override
public int compare(Tuple2<String, Long> o1, Tuple2<String, Long> o2) {
return o2.f1.intValue() - o1.f1.intValue();
}
});
// 取排序后的前两名,构建输出结果
StringBuilder result = new StringBuilder();
result.append("========================================\n");
for (int i = 0; i < 2; i++) {
if (tuple2s.size() >= (i + 1)) {
Tuple2<String, Long> temp = tuple2s.get(i);
String info = "浏览量No." + (i + 1) +
" url:" + temp.f0 +
" 浏览量:" + temp.f1 +
" 窗口结束时间:" + new Timestamp(context.window().getEnd()) + "\n";
result.append(info);
}
}
result.append("========================================\n");
out.collect(result.toString());
}
})
// 打印
.print();
executionEnvironment.execute();
}
}
结果:
========================================
浏览量No.1 url:./cart 浏览量:1 窗口结束时间:2022-08-30 15:31:50.0
浏览量No.2 url:./fav 浏览量:1 窗口结束时间:2022-08-30 15:31:50.0
========================================
========================================
浏览量No.1 url:./prod?id=2 浏览量:2 窗口结束时间:2022-08-30 15:31:55.0
浏览量No.2 url:./prod?id=1 浏览量:2 窗口结束时间:2022-08-30 15:31:55.0
========================================
========================================
浏览量No.1 url:./prod?id=2 浏览量:3 窗口结束时间:2022-08-30 15:32:00.0
浏览量No.2 url:./prod?id=1 浏览量:3 窗口结束时间:2022-08-30 15:32:00.0
========================================
========================================
浏览量No.1 url:./fav 浏览量:3 窗口结束时间:2022-08-30 15:32:05.0
...
2. 基于ProcessWindowFunction + KeyedProcessFunction
这个比较复杂:因为按url分组后实际是同一窗口的数据分到不同的子任务,分组计算后简单的拿到一个窗口内的统计结果(包括url、访问次数、窗口开始时间、窗口结束时间),然后对统计结果在keyBy 按窗口结束时间分组再次调用KeyedProcessFunction 汇总结果。
1》先按url进行分组,汇总得到UrlViewCount 对象(包括url、访问次数、窗口开始时间、窗口结束时间)。
2》再按UrlViewCount.窗口结束时间进行分组。分组后用KeyedProcessFunction 进行统计计算TopN
代码: 注意TopN 代码内部缓存元素的操作API以及定时器计算的API。
package cn.qz.process;
import cn.qz.time.MyEvent;
import cn.qz.window.UrlViewCount;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Comparator;
public class ProcessWindowFunctionTopN {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
executionEnvironment.setParallelism(1);
SingleOutputStreamOperator<MyEvent> eventSingleOutputStreamOperator = executionEnvironment.addSource(new ClickSource())
.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
})
);
// 按照url分组,计算每个url 的访问量
SingleOutputStreamOperator<UrlViewCount> aggregate = eventSingleOutputStreamOperator.keyBy(data -> data.url)
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.aggregate(new UrlViewCountAgg(), new UrlViewCountResult());
// 对结果中同一个窗口的统计数据,进行排序处理
SingleOutputStreamOperator<String> result = aggregate.keyBy(data -> data.windowEnd)
.process(new TopN(2));
result.print("result");
executionEnvironment.execute();
}
// 自定义增量聚合
public static class UrlViewCountAgg implements AggregateFunction<MyEvent, Long, Long> {
@Override
public Long createAccumulator() {
return 0L;
}
@Override
public Long add(MyEvent value, Long accumulator) {
return accumulator + 1;
}
@Override
public Long getResult(Long accumulator) {
return accumulator;
}
@Override
public Long merge(Long a, Long b) {
return null;
}
}
// 自定义全窗口函数,只需要包装窗口信息
public static class UrlViewCountResult extends ProcessWindowFunction<Long, UrlViewCount, String, TimeWindow> {
@Override
public void process(String url, Context context, Iterable<Long> elements, Collector<UrlViewCount> out) throws Exception {
// 结合窗口信息,包装输出内容
Long start = context.window().getStart();
Long end = context.window().getEnd();
out.collect(new UrlViewCount(url, elements.iterator().next(), start, end));
}
}
// 自定义处理函数,排序取top n
public static class TopN extends KeyedProcessFunction<Long, UrlViewCount, String> {
// 将n作为属性
private Integer n;
// 定义一个列表状态
private ListState<UrlViewCount> urlViewCountListState;
public TopN(Integer n) {
this.n = n;
}
@Override
public void open(Configuration parameters) throws Exception {
// 从环境中获取列表状态句柄
urlViewCountListState = getRuntimeContext().getListState(
new ListStateDescriptor<UrlViewCount>("url-view-count-list",
Types.POJO(UrlViewCount.class)));
}
@Override
public void processElement(UrlViewCount value, Context ctx, Collector<String> out) throws Exception {
// 将count数据添加到列表状态中,保存起来
urlViewCountListState.add(value);
// 注册 window end + 1ms后的定时器,等待所有数据到齐开始排序
ctx.timerService().registerEventTimeTimer(ctx.getCurrentKey() + 1);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
// 将数据从列表状态变量中取出,放入ArrayList,方便排序
ArrayList<UrlViewCount> urlViewCountArrayList = new ArrayList<>();
for (UrlViewCount urlViewCount : urlViewCountListState.get()) {
urlViewCountArrayList.add(urlViewCount);
}
// 清空状态,释放资源
urlViewCountListState.clear();
// 排序
urlViewCountArrayList.sort(new Comparator<UrlViewCount>() {
@Override
public int compare(UrlViewCount o1, UrlViewCount o2) {
return o2.count.intValue() - o1.count.intValue();
}
});
// 取前两名,构建输出结果
StringBuilder result = new StringBuilder();
result.append("========================================\n");
result.append("窗口结束时间:" + new Timestamp(timestamp - 1) + "\n");
for (int i = 0; i < this.n; i++) {
UrlViewCount UrlViewCount = urlViewCountArrayList.get(i);
String info = "No." + (i + 1) + " "
+ "url:" + UrlViewCount.url + " "
+ "浏览量:" + UrlViewCount.count + "\n";
result.append(info);
}
result.append("========================================\n");
out.collect(result.toString());
}
}
}
UrlViewCount:
package cn.qz.window;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@AllArgsConstructor
@NoArgsConstructor
public class UrlViewCount {
public String url;
public Long count;
public Long windowStart;
public Long windowEnd;
}
结果:
result> ========================================
窗口结束时间:2022-08-30 16:02:50.0
No.1 url:./fav 浏览量:3
No.2 url:./prod?id=1 浏览量:1
========================================
result> ========================================
窗口结束时间:2022-08-30 16:02:55.0
No.1 url:./fav 浏览量:5
No.2 url:./prod?id=1 浏览量:2
========================================
result> ========================================
窗口结束时间:2022-08-30 16:03:00.0
No.1 url:./fav 浏览量:3
No.2 url:./prod?id=2 浏览量:3
========================================
...
3. 侧输出流
处理函数还有一个特有功能,就是将自定义的数据放入"侧输出流"输出。可以认为侧输出流实际是主流上分叉出的支流。
package cn.qz.process;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
import java.time.Duration;
public class Process1 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1500L),
new MyEvent("zs", "/product?id=1", 2000L),
new MyEvent("zs", "/product?id=2", 2300L),
new MyEvent("zs", "/product?id=3", 1800L),
new MyEvent("ls", "/user", 1000L),
new MyEvent("ls", "/order", 1500L),
new MyEvent("ls", "/product?id=1", 2000L),
new MyEvent("ls", "/product?id=2", 2300L),
new MyEvent("ls", "/product?id=3", 1800L),
new MyEvent("tq", "/product?id=3", 1800L)
);
// 定义侧输出流标签
OutputTag<String> outputTag = new OutputTag<String>("late") {
};
// 无序流
SingleOutputStreamOperator<String> process = dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 用户分到相同的组进行计算
// .keyBy(data -> true)
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.process(new ProcessFunction<MyEvent, String>() {
@Override
public void processElement(MyEvent value, Context ctx, Collector<String> out) throws Exception {
if ("zs".equals(value.user)) {
out.collect(value.getUser());
} else {
ctx.output(outputTag, value.getUser());
}
}
});
process.print();
process.getSideOutput(outputTag).print("late");
executionEnvironment.execute();
}
}
结果:
late:7> ls
2> zs
late:6> ls
1> zs
5> zs
3> zs
4> zs
late:8> ls
late:3> tq
late:1> ls
late:2> ls