Flink窗口

1. 概念

​ 窗口是用来处理无界流的核心。定义窗口一般是包含起始时间、不包含结束时间的,用数学符号表示就是一个左闭右开的区间,例如0-10s的窗口表示为[0, 10)。

​ 需要注意的是,Flink 的窗口不是静态准备好的,而是动态创建的。当有落在这个窗口区间范围的数据到达时,才会创建对应的窗口。到达窗口结束时间时,窗口就触发计算并关闭。

举个例子:水位线设置延迟时间为2s、时间窗口的步长为10s,数据到达情况如下顺序: 2、3、11、9、12、15、22、23

1.2s数据到达时,创建第一个窗口[0, 10)

2.3s数据到达后正常落到第一个窗口

3.11s数据到达后,不属于第一个窗口,所以创建第二个窗口,并将11s的数据落到该窗口。由于水位线设置延迟时间是2s,所以现在时钟是9s,第一个窗口没有到达关闭时间。

4.9s数据到达, 正常落到第一个创建

5.12s数据到达,这时的水位线推进到10s,所以第一个窗口关闭,并且该数据落到第二个窗口

6.15到达正常落到第二个窗口

7.22s到达后,水位线达到20s,第二个窗口关闭;并且开启第三个窗口,该数据落到第三个窗口

8.23s到达后正常落到第三个窗口

2. 分类

  1. 按照驱动类型分类

按照时间段截取数据称为时间窗口(Time);按照固定个数驱动称为计数窗口(Count)。

  1. 按照窗口分配数据的规则

可以分为:滚动窗口(Tumbling)、滑动窗口(Sliding)、会话(Session)、以及全局窗口(Global)

1.滚动窗口:也可以理解为固定窗口,可以基于时间、也可以基于个数,每个数据只会属于一个窗口。该窗口只需要一个窗口的大小参数。

2.滑动窗口:滑动窗口直接并不是首尾连接的,而是可以错开一定的位置。如果看作一个窗口的运动,那么就像是向前小步"滑动"一样。需要两个参数:(窗口大小, 滑动步长-代表窗口计算的频率)。可以用下图表示:

比如定义一个长度为1小时、步长为5分钟的滑动窗口,那么滑动步长就代表了计算频率。那么就会统计1小时内的数据,每5分钟统计一次。同样,滑动窗口可以基于时间定义、也可以基于个数定义。

3.会话窗口:只能基于时间定义。简单说:如果数据来了开启session,接下来一段时间还有数据就加入session,超过一定时间没有数据就自动结束会话并且计算。需要一个参数gap指定时间间隔或者自定义一个SessionWindowTimeGapExtractor 间隙提取器。

4.全局窗口:把相同key的所有数据分到同一个窗口中。可以理解为没有分窗口,默认不触发计算。如果希望它能对数据进行计算处理,需要自定义触发器(trigger)。

3. 简单使用

1. 按键分区和非按键分区

按键分区经过keyBy 操作后,数据流会按照key 被分为多条逻辑流KeyedStream。

stream.keyBy(...)
      .window(...)  # 按键分区

非按键分区直接基于DataStream 调用.windowAll( ) 定义窗口: (是一个非并行操作,对于设置并行度也无效)

dataStreamSource.windowAll()

2. 窗口API简单使用

窗口操作有两个部分:窗口分配器(window Assigners)-指明窗口类型和窗口函数-定义窗口的具体逻辑(Window Functions)

        dataStreamSource.keyBy(<key selector>)
                        .window(<window assigner>)
                        .aggregate(<window function>)
1. 窗口分配器

时间窗口处理器: 处理时间、事件时间 * 滚动、滑动、会话 = 6 种

        // 滚动处理时间
        dataStreamSource.keyBy(data -> data.user)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .aggregate(...)
        // 滑动处理时间
        dataStreamSource.keyBy(data -> data.user)
                .window(SlidingProcessingTimeWindows.of(Time.seconds(5), Time.seconds(2)))
                .aggregate(...)
        // 处理时间会话窗口
//        dataStreamSource.keyBy(data -> data.user)
//                .window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
//                .aggregate(...)
        dataStreamSource.keyBy(data -> data.user)
                .window(ProcessingTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor<MyEvent>() {
                    @Override
                    public long extract(MyEvent element) {
                        return 0;
                    }
                }))
                .aggregate(...) 
            
        // 滚动事件时间
        dataStreamSource.keyBy(data -> data.user)
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                .aggregate(...)
        // 滑动事件时间
        dataStreamSource.keyBy(data -> data.user)
                .window(SlidingEventTimeWindows.of(Time.seconds(5), Time.seconds(2)))
                .aggregate(...)
        // 事件时间会话窗口
        dataStreamSource.keyBy(data -> data.user)
                .window(EventTimeSessionWindows.withGap(Time.seconds(10)))
                .aggregate(...)
//        dataStreamSource.keyBy(data -> data.user)
//                .window(ProcessingTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor<MyEvent>() {
//                    @Override
//                    public long extract(MyEvent element) {
//                        return 0;
//                    }
//                }))
//                .aggregate(...)            

计数窗口:滚动计数和滑动计数

        // 滚动计数
        dataStreamSource.keyBy(data -> data.user)
                .countWindow(10)
                .aggregate(...)
        // 滑动计数: 传入size 窗口大小和slide 步长
        dataStreamSource.keyBy(data -> data.user)
                .countWindow(10, 3)
                .aggregate(...)    

全局窗口: 是计数窗口的底层实现,一般在自定义窗口时使用

        // 全局窗口
        dataStreamSource.keyBy(data -> data.user)
                .window(GlobalWindows.create())

查看计数窗口源码发现其底层是调用的全局窗口,且自己写了触发器:(自定义触发器可以参考这个实现)

org.apache.flink.streaming.api.datastream.KeyedStream#countWindow(long)

    public WindowedStream<T, KEY, GlobalWindow> countWindow(long size) {
        return window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size)));
    }
2. 窗口函数

​ 数据流经过转换得到的数据类型是WindowedStream,这个类型不是DataStream,无法直接进行其他转换,必须进一步调用窗口函数,对收集到的数据进行处理计算之后才能最终再次得到DataStream:

根据处理方式分为增量聚合函数和全窗口函数。

1. 增量聚合函数 reduce、aggregate

用批处理的思路做实时流处理。其实是在用流处理的思路来处理有界数据,核心是保持一个聚合状态,当有数据到来时不停地更新状态。

典型的有ReduceFunction、AggregateFunction

(1)ReduceFunction 归约函数:将窗口中收集到的数据两两进行归约。

比如: 滚动窗口求每个用户每2s内访问的url

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

import java.time.Duration;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1500L),
                new MyEvent("zs", "/product?id=1", 2000L),
                new MyEvent("zs", "/product?id=2", 2300L),
                new MyEvent("zs", "/product?id=3", 1800L),

                new MyEvent("ls", "/user", 1000L),
                new MyEvent("ls", "/order", 1500L),
                new MyEvent("ls", "/product?id=1", 2000L),
                new MyEvent("ls", "/product?id=2", 2300L),
                new MyEvent("ls", "/product?id=3", 1800L)
        );

        // 无序流(且延迟时间是0s)
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 根据user分组,开窗统计
                .keyBy(data -> data.user)
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(2)))
                .reduce(new ReduceFunction<MyEvent>() {
                    @Override
                    public MyEvent reduce(MyEvent value1, MyEvent value2) throws Exception {
                        MyEvent myEvent = new MyEvent();
                        myEvent.setUrl(value1.getUrl() + "," + value2.getUrl());
                        myEvent.setUser(value1.getUser());
                        return myEvent;
                    }
                })
                .print();

        executionEnvironment.execute();
    }
}

结果:

7> MyEvent(user=zs, url=/user,/order,/product?id=3, timestamp=null)
7> MyEvent(user=zs, url=/product?id=1,/product?id=2, timestamp=null)
1> MyEvent(user=ls, url=/user,/order,/product?id=3, timestamp=null)
1> MyEvent(user=ls, url=/product?id=1,/product?id=2, timestamp=null)

修改为滚动窗口再次查看结果:

1》修改代码为滚动窗口

.window(SlidingEventTimeWindows.of(Time.seconds(2), Time.seconds(1)))

2》结果:

1> MyEvent(user=ls, url=/user,/order,/product?id=3, timestamp=null)
7> MyEvent(user=zs, url=/user,/order,/product?id=3, timestamp=null)
7> MyEvent(user=zs, url=/user,/order,/product?id=1,/product?id=2,/product?id=3, timestamp=null)
7> MyEvent(user=zs, url=/product?id=1,/product?id=2, timestamp=null)
1> MyEvent(user=ls, url=/user,/order,/product?id=1,/product?id=2,/product?id=3, timestamp=null)
1> MyEvent(user=ls, url=/product?id=1,/product?id=2, timestamp=null)

(2)聚合函数(AggregateFunction)

​ ReduceFunction 有一个限制,输入输出参数必须和数据类型一样。有时候需要特殊处理,就需要在处理前转为map 然后做处理。有时候我们期望,输入数据、中间状态、输出结果三者类型都可以不同,就需要AggregateFunction。

​ AggregateFunction 相关源码如下:

public interface AggregateFunction<IN, ACC, OUT> extends Function, Serializable {
    // 创建累加器,为聚合创建一个初始状态。每个聚合任务只会调用一次。
    ACC createAccumulator();

    // 将新来的数据进行聚合计算
    ACC add(IN value, ACC accumulator);

    // 从累加器提取输出结果
    OUT getResult(ACC accumulator);

    // 合并两个累加器
    ACC merge(ACC a, ACC b);
 }

​ 可以看做是ReduceFunction的通用版本。泛型分别表示:输入类型、累加器类型、输出类型。IN 是输入流中元数据的类型, ACC 就是我们进行聚合的中间状态类型;输出类型就是返回值类型。

比如: PV(页面浏览量)和UV(独立访客数)。PV/UV 可以表示人均重复访问量,也就是每个用户平均访问次数。(基于滚动时间窗口进行计算,输出user集合以及平均点击量)

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

import java.time.Duration;
import java.util.HashSet;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1500L),
                new MyEvent("zs1", "/product?id=1", 2000L),
                new MyEvent("zs1", "/product?id=2", 2300L),
                new MyEvent("zs2", "/product?id=3", 1800L),

                new MyEvent("ls", "/user", 1000L),
                new MyEvent("ls", "/order", 1500L),
                new MyEvent("ls1", "/product?id=1", 2000L),
                new MyEvent("ls1", "/product?id=2", 2300L),
                new MyEvent("ls2", "/product?id=3", 1800L)
        );

        // 无序流(且延迟时间是0s)
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 用户分到相同的组进行计算
                .keyBy(data -> true)
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(2)))
                .aggregate(new AggregateFunction<MyEvent, Tuple2<HashSet<String>, Long>, String>() {
                    @Override
                    public Tuple2<HashSet<String>, Long> createAccumulator() {
                        return Tuple2.of(new HashSet<String>(), 0L);
                    }

                    @Override
                    public Tuple2<HashSet<String>, Long> add(MyEvent value, Tuple2<HashSet<String>, Long> accumulator) {
                        // 累加一次之后返回累加器
                        accumulator.f0.add(value.getUser());
                        return Tuple2.of(accumulator.f0, accumulator.f1 + 1L);
                    }

                    @Override
                    public String getResult(Tuple2<HashSet<String>, Long> accumulator) {
                        // 窗口闭合时计算结果发送到下游
                        return accumulator.f0 + "" + String.valueOf((double) accumulator.f1 / accumulator.f0.size());
                    }

                    @Override
                    public Tuple2<HashSet<String>, Long> merge(Tuple2<HashSet<String>, Long> a, Tuple2<HashSet<String>, Long> b) {
                        return null;
                    }
                })
                .print();

        executionEnvironment.execute();
    }
}

结果:

5> [ls, zs2, zs, ls2]1.5
5> [zs1, ls1]2.0

(3) 另外,flink聚合提供了一系列预定义的简单聚合方法,可以直接基于windowedStream 调用。 包括 sum/max/maxBy/min/minBy,与KeyedStream 聚合非常相似,底层都是通过AggregateFunction 实现的。

比如max 使用:

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

import java.time.Duration;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1500L),
                new MyEvent("zs", "/product?id=1", 2000L),
                new MyEvent("zs", "/product?id=2", 2300L),
                new MyEvent("zs", "/product?id=3", 1800L),

                new MyEvent("ls", "/user", 1000L),
                new MyEvent("ls", "/order", 1500L),
                new MyEvent("ls", "/product?id=1", 2000L),
                new MyEvent("ls", "/product?id=2", 2300L),
                new MyEvent("ls", "/product?id=3", 1800L)
        );

        // 无序流(且延迟时间是0s)
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 用户分到相同的组进行计算
                .keyBy(data -> data.getUser())
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(2)))
                .max("timestamp")
                .print();

        executionEnvironment.execute();
    }
}

结果:

7> MyEvent(user=zs, url=/user, timestamp=1800)
7> MyEvent(user=zs, url=/product?id=1, timestamp=2300)
1> MyEvent(user=ls, url=/user, timestamp=1800)
1> MyEvent(user=ls, url=/product?id=1, timestamp=2300)

改为maxBy

.maxBy("timestamp")

结果:

1> MyEvent(user=ls, url=/product?id=3, timestamp=1800)
1> MyEvent(user=ls, url=/product?id=2, timestamp=2300)
7> MyEvent(user=zs, url=/product?id=3, timestamp=1800)
7> MyEvent(user=zs, url=/product?id=2, timestamp=2300)

可以看出max只保证需要求的字段是最大值,其他属性取第一条数据的属性; maxBy 会取满足条件的属性。

2. 全窗口函数 apply、process

​ 全窗口函数就是先收集数据,并且在内部缓存起来,等到窗口输出结果的时候再取出进行计算。典型的批处理思路。

​ 另外这类函数可以拿到一些上下文信息,包括窗口起始时间等,

  1. 窗口函数 windouFunction

其原接口如下:

public interface WindowFunction<IN, OUT, KEY, W extends Window> extends Function, Serializable {

    void apply(KEY key, W window, Iterable<IN> input, Collector<OUT> out) throws Exception;
}

比如: 我们按滚动窗口求UV

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.sql.Timestamp;
import java.time.Duration;
import java.util.HashSet;
import java.util.Set;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1500L),
                new MyEvent("zs", "/product?id=1", 2000L),
                new MyEvent("zs", "/product?id=2", 2300L),
                new MyEvent("zs", "/product?id=3", 1800L),

                new MyEvent("ls", "/user", 1000L),
                new MyEvent("ls", "/order", 1500L),
                new MyEvent("ls", "/product?id=1", 2000L),
                new MyEvent("ls", "/product?id=2", 2300L),
                new MyEvent("ls", "/product?id=3", 1800L)
        );

        // 无序流
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 用户分到相同的组进行计算
                .keyBy(data -> true)
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(2)))
                .apply(new WindowFunction<MyEvent, String, Boolean, TimeWindow>() {
                    @Override
                    public void apply(Boolean aBoolean, TimeWindow window, Iterable<MyEvent> input, Collector<String> out) throws Exception {
                        // 计算UV
                        Set<String> usernames = new HashSet<String>();
                        for (MyEvent event : input) {
                            usernames.add(event.getUser());
                        }
                        String result = "key: " + aBoolean + "\tuv: " + usernames.size() + "窗口: " + new Timestamp(window.getStart()) + " - " + new Timestamp(window.getEnd());
                        out.collect(result);
                    }
                })
                .print();

        executionEnvironment.execute();
    }
}

结果:

5> key: true	uv: 2窗口: 1970-01-01 08:00:00.0 - 1970-01-01 08:00:02.0
5> key: true	uv: 2窗口: 1970-01-01 08:00:02.0 - 1970-01-01 08:00:04.0
  1. 处理窗口函数ProcessWindowFunction

可以理解为一个增强版的WindowFunction, 可以通过context 获取到窗口以及水位线信息。

原接口如下:

public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window>
        extends AbstractRichFunction {

    private static final long serialVersionUID = 1L;

    public abstract void process(
            KEY key, Context context, Iterable<IN> elements, Collector<OUT> out) throws Exception;
    
    public void clear(Context context) throws Exception {}

}

代码:

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.time.Duration;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1500L),
                new MyEvent("zs", "/product?id=1", 2000L),
                new MyEvent("zs", "/product?id=2", 2300L),
                new MyEvent("zs", "/product?id=3", 1800L),

                new MyEvent("ls", "/user", 1000L),
                new MyEvent("ls", "/order", 1500L),
                new MyEvent("ls", "/product?id=1", 2000L),
                new MyEvent("ls", "/product?id=2", 2300L),
                new MyEvent("ls", "/product?id=3", 1800L)
        );

        // 无序流
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 用户分到相同的组进行计算
                .keyBy(data -> data.getUser())
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(2)))
                .process(new ProcessWindowFunction<MyEvent, String, String, TimeWindow>() {
                    @Override
                    public void process(String s, Context context, Iterable<MyEvent> elements, Collector<String> out) throws Exception {
                        Long start = context.window().getStart();
                        Long end = context.window().getEnd();
                        Long currentWatermark = context.currentWatermark();
                        Long count = elements.spliterator().getExactSizeIfKnown();

                        // 收集元素, 然后汇总到结果集
                        List<String> result = new ArrayList<>();
                        Iterator<MyEvent> iterator = elements.iterator();
                        while (iterator.hasNext()) {
                            result.add(iterator.next().toString());
                        }
                        out.collect("窗口" + start + " ~ " + end + "中共有" + count + "个元素,窗口闭合计算时,水位线处于:" + currentWatermark + " result: " + result);
                    }
                })
                .print();

        executionEnvironment.execute();
    }
}

结果:

7> 窗口0 ~ 2000中共有3个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=zs, url=/user, timestamp=1000), MyEvent(user=zs, url=/order, timestamp=1500), MyEvent(user=zs, url=/product?id=3, timestamp=1800)]
7> 窗口2000 ~ 4000中共有2个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=zs, url=/product?id=1, timestamp=2000), MyEvent(user=zs, url=/product?id=2, timestamp=2300)]
1> 窗口0 ~ 2000中共有3个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=ls, url=/user, timestamp=1000), MyEvent(user=ls, url=/order, timestamp=1500), MyEvent(user=ls, url=/product?id=3, timestamp=1800)]
1> 窗口2000 ~ 4000中共有2个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=ls, url=/product?id=1, timestamp=2000), MyEvent(user=ls, url=/product?id=2, timestamp=2300)]
3. 总结

增量聚合函数计算更高效, 而全窗口函数的优势在于提供了更多的信息。两者也可以结合使用。 .reduce和.aggregate 方法第二个参数可以传一个全窗口函数。

例如: 用滑动窗口为10,slide为5s 统计每个连接的热门度(也就是访问次数):

(1)定义类

package cn.qz.window;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

@Data
@AllArgsConstructor
@NoArgsConstructor
public class UrlViewCount {

    public String url;

    public Long count;

    public Long windowStart;

    public Long windowEnd;

}

(2)测试类:

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.time.Duration;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setParallelism(1);

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1500L),
                new MyEvent("zs", "/product?id=1", 2000L),
                new MyEvent("zs", "/product?id=2", 2300L),
                new MyEvent("zs", "/product?id=3", 1800L),

                new MyEvent("ls", "/user", 1000L),
                new MyEvent("ls", "/order", 1500L),
                new MyEvent("ls", "/product?id=1", 2000L),
                new MyEvent("ls", "/product?id=2", 2300L),
                new MyEvent("ls", "/product?id=3", 1800L),

                new MyEvent("ww", "/user", 1000L),
                new MyEvent("ww", "/order", 1500L)
        );

        // 需要按照url分组,开滑动窗口统计
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 需要按照url分组,开滑动窗口统计
                .keyBy(data -> data.getUrl())
                .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
                // 同时传入增量聚合函数和全窗口函数
                .aggregate(new UrlViewCountAgg(), new UrlViewCountResult())
                .print();

        executionEnvironment.execute();
    }

    // 自定义增量聚合函数,来一条数据就加一
    public static class UrlViewCountAgg implements AggregateFunction<MyEvent, Long, Long> {

        @Override
        public Long createAccumulator() {
            return 0L;
        }

        @Override
        public Long add(MyEvent value, Long accumulator) {
            return accumulator + 1;
        }

        @Override
        public Long getResult(Long accumulator) {
            return accumulator;
        }

        @Override
        public Long merge(Long a, Long b) {
            return null;
        }
    }

    // 自定义窗口处理函数,只需要包装窗口信息
    public static class UrlViewCountResult extends ProcessWindowFunction<Long, UrlViewCount, String, TimeWindow> {

        @Override
        public void process(String url, Context context, Iterable<Long> elements, Collector<UrlViewCount> out) throws Exception {
            // 结合窗口信息,包装输出内容
            Long start = context.window().getStart();
            Long end = context.window().getEnd();
            // 迭代器中只有一个元素,就是增量聚合函数的计算结果
            out.collect(new UrlViewCount(url, elements.iterator().next(), start, end));
        }
    }
}

(3)结果

UrlViewCount(url=/user, count=3, windowStart=-5000, windowEnd=5000)
UrlViewCount(url=/order, count=3, windowStart=-5000, windowEnd=5000)
UrlViewCount(url=/product?id=2, count=2, windowStart=-5000, windowEnd=5000)
UrlViewCount(url=/product?id=3, count=2, windowStart=-5000, windowEnd=5000)
UrlViewCount(url=/product?id=1, count=2, windowStart=-5000, windowEnd=5000)
UrlViewCount(url=/order, count=3, windowStart=0, windowEnd=10000)
UrlViewCount(url=/product?id=3, count=2, windowStart=0, windowEnd=10000)
UrlViewCount(url=/product?id=1, count=2, windowStart=0, windowEnd=10000)
UrlViewCount(url=/product?id=2, count=2, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=3, windowStart=0, windowEnd=10000)
4. windowAll

​ windowAll 函数实现UV统计

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.AllWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.sql.Timestamp;
import java.time.Duration;
import java.util.HashSet;
import java.util.Set;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1500L),
                new MyEvent("zs", "/product?id=1", 2000L),
                new MyEvent("zs", "/product?id=2", 2300L),
                new MyEvent("zs", "/product?id=3", 1800L),

                new MyEvent("ls", "/user", 1000L),
                new MyEvent("ls", "/order", 1500L),
                new MyEvent("ls", "/product?id=1", 2000L),
                new MyEvent("ls", "/product?id=2", 2300L),
                new MyEvent("ls", "/product?id=3", 1800L)
        );

        // 无序流
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 用户分到相同的组进行计算
//                .keyBy(data -> true)
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .windowAll(TumblingEventTimeWindows.of(Time.seconds(2)))
                .apply(new AllWindowFunction<MyEvent, String, TimeWindow>() {
                    @Override
                    public void apply(TimeWindow window, Iterable<MyEvent> values, Collector<String> out) throws Exception {
                        // 计算UV
                        Set<String> usernames = new HashSet<String>();
                        for (MyEvent event : values) {
                            usernames.add(event.getUser());
                        }
                        String result = "uv:" + usernames.size() + " 窗口:" + new Timestamp(window.getStart()) + " - " + new Timestamp(window.getEnd());
                        out.collect(result);
                    }
                }).print();

        executionEnvironment.execute();
    }
}

结果:

1> uv:2 窗口:1970-01-01 08:00:02.0 - 1970-01-01 08:00:04.0
8> uv:2 窗口:1970-01-01 08:00:00.0 - 1970-01-01 08:00:02.0
5. 其他API
  1. 触发器Trigger

触发窗口进行计算。Trigger是窗口算子的内部属性,每个窗口分配器(WindowAssigner) 都会对应一个默认的触发器; 对于Flink 内置的窗口类型,它们的触发器都已经做了实现。 例如 事件时间窗口默认的触发器都是EventTimerTrigger; 类似还有ProcessingTimerTrigger、CountTrigger

Trigger 类如下:

public abstract class Trigger<T, W extends Window> implements Serializable {
    // 每来一个元素调用一次
    public abstract TriggerResult onElement(T element, long timestamp, W window, TriggerContext ctx)
            throws Exception;    
    // 注册的处理时间定时器触发将调用这个方法
    public abstract TriggerResult onProcessingTime(long time, W window, TriggerContext ctx)
            throws Exception;    
    // 注册事件时间定时器触发时,调用这个方法
    public abstract TriggerResult onEventTime(long time, W window, TriggerContext ctx)
            throws Exception;    
	// 窗口关闭销毁时触发。一般用来清除自定义的状态。
    public abstract void clear(W window, TriggerContext ctx) throws Exception;
}    

​ 三个事件方法返回的都是一个TriggerResult,是一个枚举类型:

    /** No action is taken on the window. */
    CONTINUE(false, false),

    /** {@code FIRE_AND_PURGE} evaluates the window function and emits the window result. */
    FIRE_AND_PURGE(true, true),

    /**
     * On {@code FIRE}, the window is evaluated and results are emitted. The window is not purged,
     * though, all elements are retained.
     */
    FIRE(true, false),

    /**
     * All elements in the window are cleared and the window is discarded, without evaluating the
     * window function or emitting any elements.
     */
    PURGE(false, true);

例如:计算每个url在10s滚动窗口的pv指标,然后设置触发器,每1s计算一次窗口的计算。

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.triggers.Trigger;
import org.apache.flink.streaming.api.windowing.triggers.TriggerResult;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.time.Duration;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setParallelism(1);

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1200L)
        );

        // 需要按照url分组,开滑动窗口统计
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 需要按照url分组,滚动窗口统计
                .keyBy(r -> r.url)
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .trigger(new MyTrigger())
                .process(new WindowResult())
                .print();
        executionEnvironment.execute();
    }
    public static class WindowResult extends ProcessWindowFunction<MyEvent, UrlViewCount, String, TimeWindow> {
        @Override
        public void process(String s, Context context, Iterable<MyEvent> iterable, Collector<UrlViewCount> collector) throws Exception {
            collector.collect(
                    new UrlViewCount(
                            s,
                            // 获取迭代器中的元素个数
                            iterable.spliterator().getExactSizeIfKnown(),
                            context.window().getStart(),
                            context.window().getEnd()
                    )
            );
        }
    }

    public static class MyTrigger extends Trigger<MyEvent, TimeWindow> {
        @Override
        public TriggerResult onElement(MyEvent event, long l, TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
            ValueState<Boolean> isFirstEvent = triggerContext.getPartitionedState(
                    new ValueStateDescriptor<Boolean>("first-event", Types.BOOLEAN)
            );
            if (isFirstEvent.value() == null) {
                for (long i = timeWindow.getStart(); i < timeWindow.getEnd(); i = i + 1000L) {
                    triggerContext.registerEventTimeTimer(i);
                }
                isFirstEvent.update(true);
            }
            return TriggerResult.CONTINUE;
        }

        @Override
        public TriggerResult onEventTime(long l, TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
            return TriggerResult.FIRE;
        }

        @Override
        public TriggerResult onProcessingTime(long l, TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
            return TriggerResult.CONTINUE;
        }

        @Override
        public void clear(TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
            ValueState<Boolean> isFirstEvent = triggerContext.getPartitionedState(
                    new ValueStateDescriptor<Boolean>("first-event", Types.BOOLEAN)
            );
            isFirstEvent.clear();
        }
    }
}

结果:

UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/order, count=1, windowStart=0, windowEnd=10000)
UrlViewCount(url=/user, count=1, windowStart=0, windowEnd=10000)
  1. 移除器 evictor

移除器主要用来移除某些数据的逻辑。基于WindowStream.evictor 方法。该接口有两个方法:

    void evictBefore( // 执行窗口函数之前的移除操作
            Iterable<TimestampedValue<T>> elements,
            int size,
            W window,
            EvictorContext evictorContext);

    void evictAfter( // 执行窗口函数之后的移除操作
            Iterable<TimestampedValue<T>> elements,
            int size,
            W window,
            EvictorContext evictorContext);

例如:

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.evictors.Evictor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.runtime.operators.windowing.TimestampedValue;
import org.apache.flink.util.Collector;

import java.sql.Timestamp;
import java.time.Duration;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1500L),
                new MyEvent("zs", "/product?id=1", 2000L),
                new MyEvent("zs", "/product?id=2", 2300L),
                new MyEvent("zs", "/product?id=3", 1800L),

                new MyEvent("ls", "/user", 1000L),
                new MyEvent("ls", "/order", 1500L),
                new MyEvent("ls", "/product?id=1", 2000L),
                new MyEvent("ls", "/product?id=2", 2300L),
                new MyEvent("ls", "/product?id=3", 1800L)
        );

        // 无序流
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 用户分到相同的组进行计算
                .keyBy(data -> true)
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(2)))
                .evictor(new Evictor<MyEvent, TimeWindow>() {
                    /**
                     *
                     * @param elements 包装了事件时间的元素
                     * @param size  窗口大小
                     * @param window 窗口信息
                     * @param evictorContext
                     */
                    @Override
                    public void evictBefore(Iterable<TimestampedValue<MyEvent>> elements, int size, TimeWindow window, EvictorContext evictorContext) {
                        Iterator<TimestampedValue<MyEvent>> iterator = elements.iterator();
                        while (iterator.hasNext()) {
                            TimestampedValue<MyEvent> event = iterator.next();
                            long timestamp = event.getTimestamp(); // 事件时间
                            MyEvent value = event.getValue(); // 获取到值
                            if ("zs".equals(value.getUser())) {
                                System.out.println("remove value\t" + value + "\twindoe: " + window.getStart() + "-" + window.getEnd());
                                iterator.remove();
                            }
                        }
                    }

                    @Override
                    public void evictAfter(Iterable<TimestampedValue<MyEvent>> elements, int size, TimeWindow window, EvictorContext evictorContext) {

                    }
                })
                .apply(new WindowFunction<MyEvent, String, Boolean, TimeWindow>() {
                    @Override
                    public void apply(Boolean aBoolean, TimeWindow window, Iterable<MyEvent> input, Collector<String> out) throws Exception {
                        // 计算UV
                        Set<String> usernames = new HashSet<String>();
                        for (MyEvent event : input) {
                            usernames.add(event.getUser());
                        }
                        String result = "key: " + aBoolean + "\tuv: " + usernames.size() + "窗口: " + new Timestamp(window.getStart()) + " - " + new Timestamp(window.getEnd());
                        out.collect(result);
                    }
                })
                .print();

        executionEnvironment.execute();
    }
}

结果:

remove value	MyEvent(user=zs, url=/user, timestamp=1000)	windoe: 0-2000
remove value	MyEvent(user=zs, url=/order, timestamp=1500)	windoe: 0-2000
remove value	MyEvent(user=zs, url=/product?id=3, timestamp=1800)	windoe: 0-2000
5> key: true	uv: 1窗口: 1970-01-01 08:00:00.0 - 1970-01-01 08:00:02.0
remove value	MyEvent(user=zs, url=/product?id=1, timestamp=2000)	windoe: 2000-4000
remove value	MyEvent(user=zs, url=/product?id=2, timestamp=2300)	windoe: 2000-4000
5> key: true	uv: 1窗口: 1970-01-01 08:00:02.0 - 1970-01-01 08:00:04.0
  1. 水位线和窗口函数结合使用

(1)例一: 可以看出会丢失最后一条迟到的数据:

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.time.Duration;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setParallelism(1);

        SingleOutputStreamOperator<MyEvent> dataStreamSource = executionEnvironment.socketTextStream("192.168.13.111", 7777)
                .map(new MapFunction<String, MyEvent>() {
                    @Override
                    public MyEvent map(String value) throws Exception {
                        String[] split = value.split(",");
                        return new MyEvent(split[0], split[1], Long.valueOf(split[2]));
                    }
                });

        // 无序流(且延迟时间是5s)
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ofSeconds(5))
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 根据user分组,开窗统计
                .keyBy(data -> data.user)
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .process(new WatermarkTestResult())
                .print();

        executionEnvironment.execute();
    }

    /**
     * ProcessWindowFunction 依次为: <IN, OUT, KEY, W extends Window>
     */
    private static class WatermarkTestResult extends ProcessWindowFunction<MyEvent, String, String, TimeWindow> {
        @Override
        public void process(String s, Context context, Iterable<MyEvent> elements, Collector<String> out) throws Exception {
            Long start = context.window().getStart();
            Long end = context.window().getEnd();
            Long currentWatermark = context.currentWatermark();
            Long count = elements.spliterator().getExactSizeIfKnown();

            // 收集元素, 然后汇总到结果集
            List<String> result = new ArrayList<>();
            Iterator<MyEvent> iterator = elements.iterator();
            while (iterator.hasNext()) {
                result.add(iterator.next().toString());
            }
            out.collect("窗口" + start + " ~ " + end + "中共有" + count + "个元素,窗口闭合计算时,水位线处于:" + currentWatermark + " result: " + result);
        }
    }
}

1》在nc -l 7777 输入如下信息:

zs,./home,1000
zs,./cart,2000
zs,./prod?id=100,10000
zs,./prod?id=200,8000
zs,./prod?id=300,15000

2》控制台输出如下:(会触发计算)

窗口0 ~ 10000中共有3个元素,窗口闭合计算时,水位线处于:9999 result: [MyEvent(user=zs, url=./home, timestamp=1000), MyEvent(user=zs, url=./cart, timestamp=2000), MyEvent(user=zs, url=./prod?id=200, timestamp=8000)]

3》再次在nc -l 终端输入如下信息

zs,./prod?id=400,9000

4》中断nc -l 窗口查看控制台

5》控制台输出如下

窗口0 ~ 10000中共有3个元素,窗口闭合计算时,水位线处于:9999 result: [MyEvent(user=zs, url=./home, timestamp=1000), MyEvent(user=zs, url=./cart, timestamp=2000), MyEvent(user=zs, url=./prod?id=200, timestamp=8000)]
窗口10000 ~ 20000中共有2个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=zs, url=./prod?id=100, timestamp=10000), MyEvent(user=zs, url=./prod?id=300, timestamp=15000)]

​ 可以发现,当输入[zs,./prod?id=300,15000] 时,流中会周期性地(默认200毫秒)插入一个时间戳为15000L-5*1000L-1L=9999 毫秒的水位线,已经到达[0, 10000)的结束时间,所以会触发窗口的闭合计算。导致后面再输入9000毫秒的数据无法进入窗口。

(2)解决延迟数据思路

1》增大水位线延迟时间

        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ofSeconds(6))

2》允许延迟

​ 允许延迟在窗口结束时,不会销毁窗口,会触发计算然后输出结果;后续在延迟数据到来后仍然会处理。 比如:对于8点-9点的窗口,不考虑水位线延迟,本来9点会触发计算并关闭窗口;允许延迟一分钟,那么9点整会触发一次计算并输出结果,并不会关窗。后续到达的时候,只有属于8点-9点,依然可以在之前统计的基础上继续计算,并且再次输出一个更新后的结果。直到水位线到达了9点领1分,这事就会真正清空状态、关闭窗口。

        // 无序流(且延迟时间是5s)
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ofSeconds(5))
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 根据user分组,开窗统计
                .keyBy(data -> data.user)
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                // 方式二:允许窗口处理迟到数据,设置1分钟的等待时间
                .allowedLateness(Time.minutes(1))
                .process(new WatermarkTestResult())
                .print();

继续按上面的操作步骤计算,结果:

窗口0 ~ 10000中共有3个元素,窗口闭合计算时,水位线处于:9999 result: [MyEvent(user=zs, url=./home, timestamp=1000), MyEvent(user=zs, url=./cart, timestamp=2000), MyEvent(user=zs, url=./prod?id=200, timestamp=8000)]
窗口0 ~ 10000中共有4个元素,窗口闭合计算时,水位线处于:9999 result: [MyEvent(user=zs, url=./home, timestamp=1000), MyEvent(user=zs, url=./cart, timestamp=2000), MyEvent(user=zs, url=./prod?id=200, timestamp=8000), MyEvent(user=zs, url=./prod?id=400, timestamp=9000)]
窗口10000 ~ 20000中共有2个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=zs, url=./prod?id=100, timestamp=10000), MyEvent(user=zs, url=./prod?id=300, timestamp=15000)]

3》将数据放入侧输出流

​ 我们可以将迟到数据放入到"侧输出流"进行另外的处理。

package cn.qz.window;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

import java.time.Duration;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

public class ReduceFunTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setParallelism(1);

        SingleOutputStreamOperator<MyEvent> dataStreamSource = executionEnvironment.socketTextStream("192.168.13.111", 7777)
                .map(new MapFunction<String, MyEvent>() {
                    @Override
                    public MyEvent map(String value) throws Exception {
                        String[] split = value.split(",");
                        return new MyEvent(split[0], split[1], Long.valueOf(split[2]));
                    }
                });


        // 无序流(且延迟时间是5s)
        SingleOutputStreamOperator<MyEvent> stream = dataStreamSource
                .assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ofSeconds(1))
                        .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                            @Override
                            public long extractTimestamp(MyEvent element, long recordTimestamp) {
                                return element.getTimestamp();
                            }
                        }));

        // 定义侧输出流标签
        OutputTag<MyEvent> outputTag = new OutputTag<MyEvent>("late") {
        };
        // 根据user分组,开窗统计
        SingleOutputStreamOperator<String> result = stream.keyBy(data -> data.user)
                // 滚动事件时间窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                // 方式二:允许窗口处理迟到数据,设置1s的等待时间
//                .allowedLateness(Time.seconds(1))
                // 方式三:将最后的迟到数据输出到侧输出流
                .sideOutputLateData(outputTag)
                .process(new WatermarkTestResult());

        // 必须用result, 不能用stream
        result.print("result");
        result.getSideOutput(outputTag).print("late");

        // 为方便观察,可以将原始数据也输出
        stream.print("input");

        executionEnvironment.execute();
    }

    /**
     * ProcessWindowFunction 依次为: <IN, OUT, KEY, W extends Window>
     */
    private static class WatermarkTestResult extends ProcessWindowFunction<MyEvent, String, String, TimeWindow> {
        @Override
        public void process(String s, Context context, Iterable<MyEvent> elements, Collector<String> out) throws Exception {
            Long start = context.window().getStart();
            Long end = context.window().getEnd();
            Long currentWatermark = context.currentWatermark();
            Long count = elements.spliterator().getExactSizeIfKnown();

            // 收集元素, 然后汇总到结果集
            List<String> result = new ArrayList<>();
            Iterator<MyEvent> iterator = elements.iterator();
            while (iterator.hasNext()) {
                result.add(iterator.next().toString());
            }
            out.collect("窗口" + start + " ~ " + end + "中共有" + count + "个元素,窗口闭合计算时,水位线处于:" + currentWatermark + " result: " + result);
        }
    }
}

结果:

input> MyEvent(user=zs, url=./home, timestamp=1000)
input> MyEvent(user=zs, url=./cart, timestamp=2000)
input> MyEvent(user=zs, url=./prod?id=100, timestamp=10000)
input> MyEvent(user=zs, url=./prod?id=200, timestamp=8000)
input> MyEvent(user=zs, url=./prod?id=300, timestamp=15000)
result> 窗口0 ~ 10000中共有3个元素,窗口闭合计算时,水位线处于:13999 result: [MyEvent(user=zs, url=./home, timestamp=1000), MyEvent(user=zs, url=./cart, timestamp=2000), MyEvent(user=zs, url=./prod?id=200, timestamp=8000)]
input> MyEvent(user=zs, url=./prod?id=400, timestamp=9000)
late> MyEvent(user=zs, url=./prod?id=400, timestamp=9000)
result> 窗口10000 ~ 20000中共有2个元素,窗口闭合计算时,水位线处于:9223372036854775807 result: [MyEvent(user=zs, url=./prod?id=100, timestamp=10000), MyEvent(user=zs, url=./prod?id=300, timestamp=15000)]
posted @ 2022-08-07 22:42  QiaoZhi  阅读(399)  评论(0编辑  收藏  举报