Flink窗口
1. 概念
窗口是用来处理无界流的核心。定义窗口一般是包含起始时间、不包含结束时间的,用数学符号表示就是一个左闭右开的区间,例如0-10s的窗口表示为[0, 10)。
需要注意的是,Flink 的窗口不是静态准备好的,而是动态创建的。当有落在这个窗口区间范围的数据到达时,才会创建对应的窗口。到达窗口结束时间时,窗口就触发计算并关闭。
举个例子:水位线设置延迟时间为2s、时间窗口的步长为10s,数据到达情况如下顺序: 2、3、11、9、12、15、22、23
1.2s数据到达时,创建第一个窗口[0, 10)
2.3s数据到达后正常落到第一个窗口
3.11s数据到达后,不属于第一个窗口,所以创建第二个窗口,并将11s的数据落到该窗口。由于水位线设置延迟时间是2s,所以现在时钟是9s,第一个窗口没有到达关闭时间。
4.9s数据到达, 正常落到第一个创建
5.12s数据到达,这时的水位线推进到10s,所以第一个窗口关闭,并且该数据落到第二个窗口
6.15到达正常落到第二个窗口
7.22s到达后,水位线达到20s,第二个窗口关闭;并且开启第三个窗口,该数据落到第三个窗口
8.23s到达后正常落到第三个窗口
2. 分类
- 按照驱动类型分类
按照时间段截取数据称为时间窗口(Time);按照固定个数驱动称为计数窗口(Count)。
- 按照窗口分配数据的规则
可以分为:滚动窗口(Tumbling)、滑动窗口(Sliding)、会话(Session)、以及全局窗口(Global)
1.滚动窗口:也可以理解为固定窗口,可以基于时间、也可以基于个数,每个数据只会属于一个窗口。该窗口只需要一个窗口的大小参数。
2.滑动窗口:滑动窗口直接并不是首尾连接的,而是可以错开一定的位置。如果看作一个窗口的运动,那么就像是向前小步"滑动"一样。需要两个参数:(窗口大小, 滑动步长-代表窗口计算的频率)。可以用下图表示:
比如定义一个长度为1小时、步长为5分钟的滑动窗口,那么滑动步长就代表了计算频率。那么就会统计1小时内的数据,每5分钟统计一次。同样,滑动窗口可以基于时间定义、也可以基于个数定义。
3.会话窗口:只能基于时间定义。简单说:如果数据来了开启session,接下来一段时间还有数据就加入session,超过一定时间没有数据就自动结束会话并且计算。需要一个参数gap指定时间间隔或者自定义一个SessionWindowTimeGapExtractor 间隙提取器。
4.全局窗口:把相同key的所有数据分到同一个窗口中。可以理解为没有分窗口,默认不触发计算。如果希望它能对数据进行计算处理,需要自定义触发器(trigger)。
3. 简单使用
1. 按键分区和非按键分区
按键分区经过keyBy 操作后,数据流会按照key 被分为多条逻辑流KeyedStream。
stream.keyBy(...)
.window(...) # 按键分区
非按键分区直接基于DataStream 调用.windowAll( ) 定义窗口: (是一个非并行操作,对于设置并行度也无效)
dataStreamSource.windowAll()
2. 窗口API简单使用
窗口操作有两个部分:窗口分配器(window Assigners)-指明窗口类型和窗口函数-定义窗口的具体逻辑(Window Functions)
dataStreamSource.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(<window function>)
1. 窗口分配器
时间窗口处理器: 处理时间、事件时间 * 滚动、滑动、会话 = 6 种
// 滚动处理时间
dataStreamSource.keyBy(data -> data.user)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.aggregate(...)
// 滑动处理时间
dataStreamSource.keyBy(data -> data.user)
.window(SlidingProcessingTimeWindows.of(Time.seconds(5), Time.seconds(2)))
.aggregate(...)
// 处理时间会话窗口
// dataStreamSource.keyBy(data -> data.user)
// .window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
// .aggregate(...)
dataStreamSource.keyBy(data -> data.user)
.window(ProcessingTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor<MyEvent>() {
@Override
public long extract(MyEvent element) {
return 0;
}
}))
.aggregate(...)
// 滚动事件时间
dataStreamSource.keyBy(data -> data.user)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.aggregate(...)
// 滑动事件时间
dataStreamSource.keyBy(data -> data.user)
.window(SlidingEventTimeWindows.of(Time.seconds(5), Time.seconds(2)))
.aggregate(...)
// 事件时间会话窗口
dataStreamSource.keyBy(data -> data.user)
.window(EventTimeSessionWindows.withGap(Time.seconds(10)))
.aggregate(...)
// dataStreamSource.keyBy(data -> data.user)
// .window(ProcessingTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor<MyEvent>() {
// @Override
// public long extract(MyEvent element) {
// return 0;
// }
// }))
// .aggregate(...)
计数窗口:滚动计数和滑动计数
// 滚动计数
dataStreamSource.keyBy(data -> data.user)
.countWindow(10)
.aggregate(...)
// 滑动计数: 传入size 窗口大小和slide 步长
dataStreamSource.keyBy(data -> data.user)
.countWindow(10, 3)
.aggregate(...)
全局窗口: 是计数窗口的底层实现,一般在自定义窗口时使用
// 全局窗口
dataStreamSource.keyBy(data -> data.user)
.window(GlobalWindows.create())
查看计数窗口源码发现其底层是调用的全局窗口,且自己写了触发器:(自定义触发器可以参考这个实现)
org.apache.flink.streaming.api.datastream.KeyedStream#countWindow(long)
public WindowedStream<T, KEY, GlobalWindow> countWindow(long size) {
return window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size)));
}
2. 窗口函数
数据流经过转换得到的数据类型是WindowedStream,这个类型不是DataStream,无法直接进行其他转换,必须进一步调用窗口函数,对收集到的数据进行处理计算之后才能最终再次得到DataStream:
根据处理方式分为增量聚合函数和全窗口函数。
1. 增量聚合函数 reduce、aggregate
用批处理的思路做实时流处理。其实是在用流处理的思路来处理有界数据,核心是保持一个聚合状态,当有数据到来时不停地更新状态。
典型的有ReduceFunction、AggregateFunction
(1)ReduceFunction 归约函数:将窗口中收集到的数据两两进行归约。
比如: 滚动窗口求每个用户每2s内访问的url
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.time.Duration;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1500L),
new MyEvent("zs", "/product?id=1", 2000L),
new MyEvent("zs", "/product?id=2", 2300L),
new MyEvent("zs", "/product?id=3", 1800L),
new MyEvent("ls", "/user", 1000L),
new MyEvent("ls", "/order", 1500L),
new MyEvent("ls", "/product?id=1", 2000L),
new MyEvent("ls", "/product?id=2", 2300L),
new MyEvent("ls", "/product?id=3", 1800L)
);
// 无序流(且延迟时间是0s)
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 根据user分组,开窗统计
.keyBy(data -> data.user)
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.reduce(new ReduceFunction<MyEvent>() {
@Override
public MyEvent reduce(MyEvent value1, MyEvent value2) throws Exception {
MyEvent myEvent = new MyEvent();
myEvent.setUrl(value1.getUrl() + "," + value2.getUrl());
myEvent.setUser(value1.getUser());
return myEvent;
}
})
.print();
executionEnvironment.execute();
}
}
结果:
7> MyEvent(user=zs, url=/user,/order,/product?id=3, timestamp=null)
7> MyEvent(user=zs, url=/product?id=1,/product?id=2, timestamp=null)
1> MyEvent(user=ls, url=/user,/order,/product?id=3, timestamp=null)
1> MyEvent(user=ls, url=/product?id=1,/product?id=2, timestamp=null)
修改为滚动窗口再次查看结果:
1》修改代码为滚动窗口
.window(SlidingEventTimeWindows.of(Time.seconds(2), Time.seconds(1)))
2》结果:
1> MyEvent(user=ls, url=/user,/order,/product?id=3, timestamp=null)
7> MyEvent(user=zs, url=/user,/order,/product?id=3, timestamp=null)
7> MyEvent(user=zs, url=/user,/order,/product?id=1,/product?id=2,/product?id=3, timestamp=null)
7> MyEvent(user=zs, url=/product?id=1,/product?id=2, timestamp=null)
1> MyEvent(user=ls, url=/user,/order,/product?id=1,/product?id=2,/product?id=3, timestamp=null)
1> MyEvent(user=ls, url=/product?id=1,/product?id=2, timestamp=null)
(2)聚合函数(AggregateFunction)
ReduceFunction 有一个限制,输入输出参数必须和数据类型一样。有时候需要特殊处理,就需要在处理前转为map 然后做处理。有时候我们期望,输入数据、中间状态、输出结果三者类型都可以不同,就需要AggregateFunction。
AggregateFunction 相关源码如下:
public interface AggregateFunction<IN, ACC, OUT> extends Function, Serializable {
// 创建累加器,为聚合创建一个初始状态。每个聚合任务只会调用一次。
ACC createAccumulator();
// 将新来的数据进行聚合计算
ACC add(IN value, ACC accumulator);
// 从累加器提取输出结果
OUT getResult(ACC accumulator);
// 合并两个累加器
ACC merge(ACC a, ACC b);
}
可以看做是ReduceFunction的通用版本。泛型分别表示:输入类型、累加器类型、输出类型。IN 是输入流中元数据的类型, ACC 就是我们进行聚合的中间状态类型;输出类型就是返回值类型。
比如: PV(页面浏览量)和UV(独立访客数)。PV/UV 可以表示人均重复访问量,也就是每个用户平均访问次数。(基于滚动时间窗口进行计算,输出user集合以及平均点击量)
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.time.Duration;
import java.util.HashSet;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1500L),
new MyEvent("zs1", "/product?id=1", 2000L),
new MyEvent("zs1", "/product?id=2", 2300L),
new MyEvent("zs2", "/product?id=3", 1800L),
new MyEvent("ls", "/user", 1000L),
new MyEvent("ls", "/order", 1500L),
new MyEvent("ls1", "/product?id=1", 2000L),
new MyEvent("ls1", "/product?id=2", 2300L),
new MyEvent("ls2", "/product?id=3", 1800L)
);
// 无序流(且延迟时间是0s)
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 用户分到相同的组进行计算
.keyBy(data -> true)
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.aggregate(new AggregateFunction<MyEvent, Tuple2<HashSet<String>, Long>, String>() {
@Override
public Tuple2<HashSet<String>, Long> createAccumulator() {
return Tuple2.of(new HashSet<String>(), 0L);
}
@Override
public Tuple2<HashSet<String>, Long> add(MyEvent value, Tuple2<HashSet<String>, Long> accumulator) {
// 累加一次之后返回累加器
accumulator.f0.add(value.getUser());
return Tuple2.of(accumulator.f0, accumulator.f1 + 1L);
}
@Override
public String getResult(Tuple2<HashSet<String>, Long> accumulator) {
// 窗口闭合时计算结果发送到下游
return accumulator.f0 + "" + String.valueOf((double) accumulator.f1 / accumulator.f0.size());
}
@Override
public Tuple2<HashSet<String>, Long> merge(Tuple2<HashSet<String>, Long> a, Tuple2<HashSet<String>, Long> b) {
return null;
}
})
.print();
executionEnvironment.execute();
}
}
结果:
5> [ls, zs2, zs, ls2]1.5
5> [zs1, ls1]2.0
(3) 另外,flink聚合提供了一系列预定义的简单聚合方法,可以直接基于windowedStream 调用。 包括 sum/max/maxBy/min/minBy,与KeyedStream 聚合非常相似,底层都是通过AggregateFunction 实现的。
比如max 使用:
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.time.Duration;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1500L),
new MyEvent("zs", "/product?id=1", 2000L),
new MyEvent("zs", "/product?id=2", 2300L),
new MyEvent("zs", "/product?id=3", 1800L),
new MyEvent("ls", "/user", 1000L),
new MyEvent("ls", "/order", 1500L),
new MyEvent("ls", "/product?id=1", 2000L),
new MyEvent("ls", "/product?id=2", 2300L),
new MyEvent("ls", "/product?id=3", 1800L)
);
// 无序流(且延迟时间是0s)
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 用户分到相同的组进行计算
.keyBy(data -> data.getUser())
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.max("timestamp")
.print();
executionEnvironment.execute();
}
}
结果:
7> MyEvent(user=zs, url=/user, timestamp=1800)
7> MyEvent(user=zs, url=/product?id=1, timestamp=2300)
1> MyEvent(user=ls, url=/user, timestamp=1800)
1> MyEvent(user=ls, url=/product?id=1, timestamp=2300)
改为maxBy
.maxBy("timestamp")
结果:
1> MyEvent(user=ls, url=/product?id=3, timestamp=1800)
1> MyEvent(user=ls, url=/product?id=2, timestamp=2300)
7> MyEvent(user=zs, url=/product?id=3, timestamp=1800)
7> MyEvent(user=zs, url=/product?id=2, timestamp=2300)
可以看出max只保证需要求的字段是最大值,其他属性取第一条数据的属性; maxBy 会取满足条件的属性。
2. 全窗口函数 apply、process
全窗口函数就是先收集数据,并且在内部缓存起来,等到窗口输出结果的时候再取出进行计算。典型的批处理思路。
另外这类函数可以拿到一些上下文信息,包括窗口起始时间等,
- 窗口函数 windouFunction
其原接口如下:
public interface WindowFunction<IN, OUT, KEY, W extends Window> extends Function, Serializable {
void apply(KEY key, W window, Iterable<IN> input, Collector<OUT> out) throws Exception;
}
比如: 我们按滚动窗口求UV
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.sql.Timestamp;
import java.time.Duration;
import java.util.HashSet;
import java.util.Set;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1500L),
new MyEvent("zs", "/product?id=1", 2000L),
new MyEvent("zs", "/product?id=2", 2300L),
new MyEvent("zs", "/product?id=3", 1800L),
new MyEvent("ls", "/user", 1000L),
new MyEvent("ls", "/order", 1500L),
new MyEvent("ls", "/product?id=1", 2000L),
new MyEvent("ls", "/product?id=2", 2300L),
new MyEvent("ls", "/product?id=3", 1800L)
);
// 无序流
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 用户分到相同的组进行计算
.keyBy(data -> true)
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.apply(new WindowFunction<MyEvent, String, Boolean, TimeWindow>() {
@Override
public void apply(Boolean aBoolean, TimeWindow window, Iterable<MyEvent> input, Collector<String> out) throws Exception {
// 计算UV
Set<String> usernames = new HashSet<String>();
for (MyEvent event : input) {
usernames.add(event.getUser());
}
String result = "key: " + aBoolean + "\tuv: " + usernames.size() + "窗口: " + new Timestamp(window.getStart()) + " - " + new Timestamp(window.getEnd());
out.collect(result);
}
})
.print();
executionEnvironment.execute();
}
}
结果:
5> key: true uv: 2窗口: 1970-01-01 08:00:00.0 - 1970-01-01 08:00:02.0
5> key: true uv: 2窗口: 1970-01-01 08:00:02.0 - 1970-01-01 08:00:04.0
- 处理窗口函数ProcessWindowFunction
可以理解为一个增强版的WindowFunction, 可以通过context 获取到窗口以及水位线信息。
原接口如下:
public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window>
extends AbstractRichFunction {
private static final long serialVersionUID = 1L;
public abstract void process(
KEY key, Context context, Iterable<IN> elements, Collector<OUT> out) throws Exception;
public void clear(Context context) throws Exception {}
}
代码:
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1500L),
new MyEvent("zs", "/product?id=1", 2000L),
new MyEvent("zs", "/product?id=2", 2300L),
new MyEvent("zs", "/product?id=3", 1800L),
new MyEvent("ls", "/user", 1000L),
new MyEvent("ls", "/order", 1500L),
new MyEvent("ls", "/product?id=1", 2000L),
new MyEvent("ls", "/product?id=2", 2300L),
new MyEvent("ls", "/product?id=3", 1800L)
);
// 无序流
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 用户分到相同的组进行计算
.keyBy(data -> data.getUser())
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.process(new ProcessWindowFunction<MyEvent, String, String, TimeWindow>() {
@Override
public void process(String s, Context context, Iterable<MyEvent> elements, Collector<String> out) throws Exception {
Long start = context.window().getStart();
Long end = context.window().getEnd();
Long currentWatermark = context.currentWatermark();
Long count = elements.spliterator().getExactSizeIfKnown();
// 收集元素, 然后汇总到结果集
List<String> result = new ArrayList<>();
Iterator<MyEvent> iterator = elements.iterator();
while (iterator.hasNext()) {
result.add(iterator.next().toString());
}
out.collect("窗口" + start + " ~ " + end + "中共有" + count + "个元素,窗口闭合计算时,水位线处于:" + currentWatermark + " result: " + result);
}
})
.print();
executionEnvironment.execute();
}
}
结果:
7> 窗口0 ~ 2000中共有3个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=zs, url=/user, timestamp=1000), MyEvent(user=zs, url=/order, timestamp=1500), MyEvent(user=zs, url=/product?id=3, timestamp=1800)]
7> 窗口2000 ~ 4000中共有2个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=zs, url=/product?id=1, timestamp=2000), MyEvent(user=zs, url=/product?id=2, timestamp=2300)]
1> 窗口0 ~ 2000中共有3个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=ls, url=/user, timestamp=1000), MyEvent(user=ls, url=/order, timestamp=1500), MyEvent(user=ls, url=/product?id=3, timestamp=1800)]
1> 窗口2000 ~ 4000中共有2个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=ls, url=/product?id=1, timestamp=2000), MyEvent(user=ls, url=/product?id=2, timestamp=2300)]
3. 总结
增量聚合函数计算更高效, 而全窗口函数的优势在于提供了更多的信息。两者也可以结合使用。 .reduce和.aggregate 方法第二个参数可以传一个全窗口函数。
例如: 用滑动窗口为10,slide为5s 统计每个连接的热门度(也就是访问次数):
(1)定义类
package cn.qz.window;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@AllArgsConstructor
@NoArgsConstructor
public class UrlViewCount {
public String url;
public Long count;
public Long windowStart;
public Long windowEnd;
}
(2)测试类:
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.time.Duration;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
executionEnvironment.setParallelism(1);
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1500L),
new MyEvent("zs", "/product?id=1", 2000L),
new MyEvent("zs", "/product?id=2", 2300L),
new MyEvent("zs", "/product?id=3", 1800L),
new MyEvent("ls", "/user", 1000L),
new MyEvent("ls", "/order", 1500L),
new MyEvent("ls", "/product?id=1", 2000L),
new MyEvent("ls", "/product?id=2", 2300L),
new MyEvent("ls", "/product?id=3", 1800L),
new MyEvent("ww", "/user", 1000L),
new MyEvent("ww", "/order", 1500L)
);
// 需要按照url分组,开滑动窗口统计
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 需要按照url分组,开滑动窗口统计
.keyBy(data -> data.getUrl())
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
// 同时传入增量聚合函数和全窗口函数
.aggregate(new UrlViewCountAgg(), new UrlViewCountResult())
.print();
executionEnvironment.execute();
}
// 自定义增量聚合函数,来一条数据就加一
public static class UrlViewCountAgg implements AggregateFunction<MyEvent, Long, Long> {
@Override
public Long createAccumulator() {
return 0L;
}
@Override
public Long add(MyEvent value, Long accumulator) {
return accumulator + 1;
}
@Override
public Long getResult(Long accumulator) {
return accumulator;
}
@Override
public Long merge(Long a, Long b) {
return null;
}
}
// 自定义窗口处理函数,只需要包装窗口信息
public static class UrlViewCountResult extends ProcessWindowFunction<Long, UrlViewCount, String, TimeWindow> {
@Override
public void process(String url, Context context, Iterable<Long> elements, Collector<UrlViewCount> out) throws Exception {
// 结合窗口信息,包装输出内容
Long start = context.window().getStart();
Long end = context.window().getEnd();
// 迭代器中只有一个元素,就是增量聚合函数的计算结果
out.collect(new UrlViewCount(url, elements.iterator().next(), start, end));
}
}
}
(3)结果
UrlViewCount(url=/user, count=3, windowStart=-5000, windowEnd=5000)
UrlViewCount(url=/order, count=3, windowStart=-5000, windowEnd=5000)
UrlViewCount(url=/product?id=2, count=2, windowStart=-5000, windowEnd=5000)
UrlViewCount(url=/product?id=3, count=2, windowStart=-5000, windowEnd=5000)
UrlViewCount(url=/product?id=1, count=2, windowStart=-5000, windowEnd=5000)
UrlViewCount(url=/order, count=3, windowStart=0, windowEnd=10000)
UrlViewCount(url=/product?id=3, count=2, windowStart=0, windowEnd=10000)
UrlViewCount(url=/product?id=1, count=2, windowStart=0, windowEnd=10000)
UrlViewCount(url=/product?id=2, count=2, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=3, windowStart=0, windowEnd=10000)
4. windowAll
windowAll 函数实现UV统计
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.AllWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.sql.Timestamp;
import java.time.Duration;
import java.util.HashSet;
import java.util.Set;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1500L),
new MyEvent("zs", "/product?id=1", 2000L),
new MyEvent("zs", "/product?id=2", 2300L),
new MyEvent("zs", "/product?id=3", 1800L),
new MyEvent("ls", "/user", 1000L),
new MyEvent("ls", "/order", 1500L),
new MyEvent("ls", "/product?id=1", 2000L),
new MyEvent("ls", "/product?id=2", 2300L),
new MyEvent("ls", "/product?id=3", 1800L)
);
// 无序流
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 用户分到相同的组进行计算
// .keyBy(data -> true)
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.windowAll(TumblingEventTimeWindows.of(Time.seconds(2)))
.apply(new AllWindowFunction<MyEvent, String, TimeWindow>() {
@Override
public void apply(TimeWindow window, Iterable<MyEvent> values, Collector<String> out) throws Exception {
// 计算UV
Set<String> usernames = new HashSet<String>();
for (MyEvent event : values) {
usernames.add(event.getUser());
}
String result = "uv:" + usernames.size() + " 窗口:" + new Timestamp(window.getStart()) + " - " + new Timestamp(window.getEnd());
out.collect(result);
}
}).print();
executionEnvironment.execute();
}
}
结果:
1> uv:2 窗口:1970-01-01 08:00:02.0 - 1970-01-01 08:00:04.0
8> uv:2 窗口:1970-01-01 08:00:00.0 - 1970-01-01 08:00:02.0
5. 其他API
- 触发器Trigger
触发窗口进行计算。Trigger是窗口算子的内部属性,每个窗口分配器(WindowAssigner) 都会对应一个默认的触发器; 对于Flink 内置的窗口类型,它们的触发器都已经做了实现。 例如 事件时间窗口默认的触发器都是EventTimerTrigger; 类似还有ProcessingTimerTrigger、CountTrigger
Trigger 类如下:
public abstract class Trigger<T, W extends Window> implements Serializable {
// 每来一个元素调用一次
public abstract TriggerResult onElement(T element, long timestamp, W window, TriggerContext ctx)
throws Exception;
// 注册的处理时间定时器触发将调用这个方法
public abstract TriggerResult onProcessingTime(long time, W window, TriggerContext ctx)
throws Exception;
// 注册事件时间定时器触发时,调用这个方法
public abstract TriggerResult onEventTime(long time, W window, TriggerContext ctx)
throws Exception;
// 窗口关闭销毁时触发。一般用来清除自定义的状态。
public abstract void clear(W window, TriggerContext ctx) throws Exception;
}
三个事件方法返回的都是一个TriggerResult,是一个枚举类型:
/** No action is taken on the window. */
CONTINUE(false, false),
/** {@code FIRE_AND_PURGE} evaluates the window function and emits the window result. */
FIRE_AND_PURGE(true, true),
/**
* On {@code FIRE}, the window is evaluated and results are emitted. The window is not purged,
* though, all elements are retained.
*/
FIRE(true, false),
/**
* All elements in the window are cleared and the window is discarded, without evaluating the
* window function or emitting any elements.
*/
PURGE(false, true);
例如:计算每个url在10s滚动窗口的pv指标,然后设置触发器,每1s计算一次窗口的计算。
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.triggers.Trigger;
import org.apache.flink.streaming.api.windowing.triggers.TriggerResult;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.time.Duration;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
executionEnvironment.setParallelism(1);
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1200L)
);
// 需要按照url分组,开滑动窗口统计
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 需要按照url分组,滚动窗口统计
.keyBy(r -> r.url)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.trigger(new MyTrigger())
.process(new WindowResult())
.print();
executionEnvironment.execute();
}
public static class WindowResult extends ProcessWindowFunction<MyEvent, UrlViewCount, String, TimeWindow> {
@Override
public void process(String s, Context context, Iterable<MyEvent> iterable, Collector<UrlViewCount> collector) throws Exception {
collector.collect(
new UrlViewCount(
s,
// 获取迭代器中的元素个数
iterable.spliterator().getExactSizeIfKnown(),
context.window().getStart(),
context.window().getEnd()
)
);
}
}
public static class MyTrigger extends Trigger<MyEvent, TimeWindow> {
@Override
public TriggerResult onElement(MyEvent event, long l, TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
ValueState<Boolean> isFirstEvent = triggerContext.getPartitionedState(
new ValueStateDescriptor<Boolean>("first-event", Types.BOOLEAN)
);
if (isFirstEvent.value() == null) {
for (long i = timeWindow.getStart(); i < timeWindow.getEnd(); i = i + 1000L) {
triggerContext.registerEventTimeTimer(i);
}
isFirstEvent.update(true);
}
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onEventTime(long l, TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
return TriggerResult.FIRE;
}
@Override
public TriggerResult onProcessingTime(long l, TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public void clear(TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
ValueState<Boolean> isFirstEvent = triggerContext.getPartitionedState(
new ValueStateDescriptor<Boolean>("first-event", Types.BOOLEAN)
);
isFirstEvent.clear();
}
}
}
结果:
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
- 移除器 evictor
移除器主要用来移除某些数据的逻辑。基于WindowStream.evictor 方法。该接口有两个方法:
void evictBefore( // 执行窗口函数之前的移除操作
Iterable<TimestampedValue<T>> elements,
int size,
W window,
EvictorContext evictorContext);
void evictAfter( // 执行窗口函数之后的移除操作
Iterable<TimestampedValue<T>> elements,
int size,
W window,
EvictorContext evictorContext);
例如:
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.evictors.Evictor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.runtime.operators.windowing.TimestampedValue;
import org.apache.flink.util.Collector;
import java.sql.Timestamp;
import java.time.Duration;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
// 构造数据
DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
new MyEvent("zs", "/user", 1000L),
new MyEvent("zs", "/order", 1500L),
new MyEvent("zs", "/product?id=1", 2000L),
new MyEvent("zs", "/product?id=2", 2300L),
new MyEvent("zs", "/product?id=3", 1800L),
new MyEvent("ls", "/user", 1000L),
new MyEvent("ls", "/order", 1500L),
new MyEvent("ls", "/product?id=1", 2000L),
new MyEvent("ls", "/product?id=2", 2300L),
new MyEvent("ls", "/product?id=3", 1800L)
);
// 无序流
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 用户分到相同的组进行计算
.keyBy(data -> true)
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.evictor(new Evictor<MyEvent, TimeWindow>() {
/**
*
* @param elements 包装了事件时间的元素
* @param size 窗口大小
* @param window 窗口信息
* @param evictorContext
*/
@Override
public void evictBefore(Iterable<TimestampedValue<MyEvent>> elements, int size, TimeWindow window, EvictorContext evictorContext) {
Iterator<TimestampedValue<MyEvent>> iterator = elements.iterator();
while (iterator.hasNext()) {
TimestampedValue<MyEvent> event = iterator.next();
long timestamp = event.getTimestamp(); // 事件时间
MyEvent value = event.getValue(); // 获取到值
if ("zs".equals(value.getUser())) {
System.out.println("remove value\t" + value + "\twindoe: " + window.getStart() + "-" + window.getEnd());
iterator.remove();
}
}
}
@Override
public void evictAfter(Iterable<TimestampedValue<MyEvent>> elements, int size, TimeWindow window, EvictorContext evictorContext) {
}
})
.apply(new WindowFunction<MyEvent, String, Boolean, TimeWindow>() {
@Override
public void apply(Boolean aBoolean, TimeWindow window, Iterable<MyEvent> input, Collector<String> out) throws Exception {
// 计算UV
Set<String> usernames = new HashSet<String>();
for (MyEvent event : input) {
usernames.add(event.getUser());
}
String result = "key: " + aBoolean + "\tuv: " + usernames.size() + "窗口: " + new Timestamp(window.getStart()) + " - " + new Timestamp(window.getEnd());
out.collect(result);
}
})
.print();
executionEnvironment.execute();
}
}
结果:
remove value MyEvent(user=zs, url=/user, timestamp=1000) windoe: 0-2000
remove value MyEvent(user=zs, url=/order, timestamp=1500) windoe: 0-2000
remove value MyEvent(user=zs, url=/product?id=3, timestamp=1800) windoe: 0-2000
5> key: true uv: 1窗口: 1970-01-01 08:00:00.0 - 1970-01-01 08:00:02.0
remove value MyEvent(user=zs, url=/product?id=1, timestamp=2000) windoe: 2000-4000
remove value MyEvent(user=zs, url=/product?id=2, timestamp=2300) windoe: 2000-4000
5> key: true uv: 1窗口: 1970-01-01 08:00:02.0 - 1970-01-01 08:00:04.0
- 水位线和窗口函数结合使用
(1)例一: 可以看出会丢失最后一条迟到的数据:
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
executionEnvironment.setParallelism(1);
SingleOutputStreamOperator<MyEvent> dataStreamSource = executionEnvironment.socketTextStream("192.168.13.111", 7777)
.map(new MapFunction<String, MyEvent>() {
@Override
public MyEvent map(String value) throws Exception {
String[] split = value.split(",");
return new MyEvent(split[0], split[1], Long.valueOf(split[2]));
}
});
// 无序流(且延迟时间是5s)
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 根据user分组,开窗统计
.keyBy(data -> data.user)
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.process(new WatermarkTestResult())
.print();
executionEnvironment.execute();
}
/**
* ProcessWindowFunction 依次为: <IN, OUT, KEY, W extends Window>
*/
private static class WatermarkTestResult extends ProcessWindowFunction<MyEvent, String, String, TimeWindow> {
@Override
public void process(String s, Context context, Iterable<MyEvent> elements, Collector<String> out) throws Exception {
Long start = context.window().getStart();
Long end = context.window().getEnd();
Long currentWatermark = context.currentWatermark();
Long count = elements.spliterator().getExactSizeIfKnown();
// 收集元素, 然后汇总到结果集
List<String> result = new ArrayList<>();
Iterator<MyEvent> iterator = elements.iterator();
while (iterator.hasNext()) {
result.add(iterator.next().toString());
}
out.collect("窗口" + start + " ~ " + end + "中共有" + count + "个元素,窗口闭合计算时,水位线处于:" + currentWatermark + " result: " + result);
}
}
}
1》在nc -l 7777 输入如下信息:
zs,./home,1000
zs,./cart,2000
zs,./prod?id=100,10000
zs,./prod?id=200,8000
zs,./prod?id=300,15000
2》控制台输出如下:(会触发计算)
窗口0 ~ 10000中共有3个元素,窗口闭合计算时,水位线处于:9999 result: [MyEvent(user=zs, url=./home, timestamp=1000), MyEvent(user=zs, url=./cart, timestamp=2000), MyEvent(user=zs, url=./prod?id=200, timestamp=8000)]
3》再次在nc -l 终端输入如下信息
zs,./prod?id=400,9000
4》中断nc -l 窗口查看控制台
5》控制台输出如下
窗口0 ~ 10000中共有3个元素,窗口闭合计算时,水位线处于:9999 result: [MyEvent(user=zs, url=./home, timestamp=1000), MyEvent(user=zs, url=./cart, timestamp=2000), MyEvent(user=zs, url=./prod?id=200, timestamp=8000)]
窗口10000 ~ 20000中共有2个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=zs, url=./prod?id=100, timestamp=10000), MyEvent(user=zs, url=./prod?id=300, timestamp=15000)]
可以发现,当输入[zs,./prod?id=300,15000] 时,流中会周期性地(默认200毫秒)插入一个时间戳为15000L-5*1000L-1L=9999 毫秒的水位线,已经到达[0, 10000)的结束时间,所以会触发窗口的闭合计算。导致后面再输入9000毫秒的数据无法进入窗口。
(2)解决延迟数据思路
1》增大水位线延迟时间
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ofSeconds(6))
2》允许延迟
允许延迟在窗口结束时,不会销毁窗口,会触发计算然后输出结果;后续在延迟数据到来后仍然会处理。 比如:对于8点-9点的窗口,不考虑水位线延迟,本来9点会触发计算并关闭窗口;允许延迟一分钟,那么9点整会触发一次计算并输出结果,并不会关窗。后续到达的时候,只有属于8点-9点,依然可以在之前统计的基础上继续计算,并且再次输出一个更新后的结果。直到水位线到达了9点领1分,这事就会真正清空状态、关闭窗口。
// 无序流(且延迟时间是5s)
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}))
// 根据user分组,开窗统计
.keyBy(data -> data.user)
// 滚动事件时间窗口。 相当于每2s 是一个窗口
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
// 方式二:允许窗口处理迟到数据,设置1分钟的等待时间
.allowedLateness(Time.minutes(1))
.process(new WatermarkTestResult())
.print();
继续按上面的操作步骤计算,结果:
窗口0 ~ 10000中共有3个元素,窗口闭合计算时,水位线处于:9999 result: [MyEvent(user=zs, url=./home, timestamp=1000), MyEvent(user=zs, url=./cart, timestamp=2000), MyEvent(user=zs, url=./prod?id=200, timestamp=8000)]
窗口0 ~ 10000中共有4个元素,窗口闭合计算时,水位线处于:9999 result: [MyEvent(user=zs, url=./home, timestamp=1000), MyEvent(user=zs, url=./cart, timestamp=2000), MyEvent(user=zs, url=./prod?id=200, timestamp=8000), MyEvent(user=zs, url=./prod?id=400, timestamp=9000)]
窗口10000 ~ 20000中共有2个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=zs, url=./prod?id=100, timestamp=10000), MyEvent(user=zs, url=./prod?id=300, timestamp=15000)]
3》将数据放入侧输出流
我们可以将迟到数据放入到"侧输出流"进行另外的处理。
package cn.qz.window;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class ReduceFunTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
executionEnvironment.setParallelism(1);
SingleOutputStreamOperator<MyEvent> dataStreamSource = executionEnvironment.socketTextStream("192.168.13.111", 7777)
.map(new MapFunction<String, MyEvent>() {
@Override
public MyEvent map(String value) throws Exception {
String[] split = value.split(",");
return new MyEvent(split[0], split[1], Long.valueOf(split[2]));
}
});
// 无序流(且延迟时间是5s)
SingleOutputStreamOperator<MyEvent> stream = dataStreamSource
.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ofSeconds(1))
.withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
@Override
public long extractTimestamp(MyEvent element, long recordTimestamp) {
return element.getTimestamp();
}
}));
// 定义侧输出流标签
OutputTag<MyEvent> outputTag = new OutputTag<MyEvent>("late") {
};
// 根据user分组,开窗统计
SingleOutputStreamOperator<String> result = stream.keyBy(data -> data.user)
// 滚动事件时间窗口
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
// 方式二:允许窗口处理迟到数据,设置1s的等待时间
// .allowedLateness(Time.seconds(1))
// 方式三:将最后的迟到数据输出到侧输出流
.sideOutputLateData(outputTag)
.process(new WatermarkTestResult());
// 必须用result, 不能用stream
result.print("result");
result.getSideOutput(outputTag).print("late");
// 为方便观察,可以将原始数据也输出
stream.print("input");
executionEnvironment.execute();
}
/**
* ProcessWindowFunction 依次为: <IN, OUT, KEY, W extends Window>
*/
private static class WatermarkTestResult extends ProcessWindowFunction<MyEvent, String, String, TimeWindow> {
@Override
public void process(String s, Context context, Iterable<MyEvent> elements, Collector<String> out) throws Exception {
Long start = context.window().getStart();
Long end = context.window().getEnd();
Long currentWatermark = context.currentWatermark();
Long count = elements.spliterator().getExactSizeIfKnown();
// 收集元素, 然后汇总到结果集
List<String> result = new ArrayList<>();
Iterator<MyEvent> iterator = elements.iterator();
while (iterator.hasNext()) {
result.add(iterator.next().toString());
}
out.collect("窗口" + start + " ~ " + end + "中共有" + count + "个元素,窗口闭合计算时,水位线处于:" + currentWatermark + " result: " + result);
}
}
}
结果:
input> MyEvent(user=zs, url=./home, timestamp=1000)
input> MyEvent(user=zs, url=./cart, timestamp=2000)
input> MyEvent(user=zs, url=./prod?id=100, timestamp=10000)
input> MyEvent(user=zs, url=./prod?id=200, timestamp=8000)
input> MyEvent(user=zs, url=./prod?id=300, timestamp=15000)
result> 窗口0 ~ 10000中共有3个元素,窗口闭合计算时,水位线处于:13999 result: [MyEvent(user=zs, url=./home, timestamp=1000), MyEvent(user=zs, url=./cart, timestamp=2000), MyEvent(user=zs, url=./prod?id=200, timestamp=8000)]
input> MyEvent(user=zs, url=./prod?id=400, timestamp=9000)
late> MyEvent(user=zs, url=./prod?id=400, timestamp=9000)
result> 窗口10000 ~ 20000中共有2个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=zs, url=./prod?id=100, timestamp=10000), MyEvent(user=zs, url=./prod?id=300, timestamp=15000)]