flink中对于window和watermark的一些理解

package com.chenxiang.flink.demo;

import java.io.IOException;
import java.net.ServerSocket;
import java.net.Socket;
import java.util.Scanner;

/**
 * @author 闪电侠
 */
public class IOServer {

    public static void main(String[] args) throws Exception {
        ServerSocket serverSocket = new ServerSocket(9999);

        // (1) 接收新连接线程
        new Thread(() -> {
            while (true) {
                try {
                    // (1) 阻塞方法获取新的连接
                    Socket socket = serverSocket.accept();

                    // (2) 每一个新的连接都创建一个线程,负责读取数据
                    new Thread(() -> {
                        try {
                            while (true) {
                                socket.getOutputStream().write(new Scanner(System.in).next().getBytes());
                            }
                        } catch (IOException e) {
                        }
                    }).start();

                } catch (IOException e) {
                }

            }
        }).start();
    }
}

 

首先window的时间范围是一个自然时间范围,比如你定义了一个

TumblingEventTimeWindows.of(Time.seconds(3))
窗口,那么会生成类似如下的窗口(左闭右开):

[2018-03-03 03:30:00,2018-03-03 03:30:03)

[2018-03-03 03:30:03,2018-03-03 03:30:06)

...

[2018-03-03 03:30:57,2018-03-03 03:31:00)

当一个event time=2018-03-03 03:30:01的消息到来时,就会生成[2018-03-03 03:30:00,2018-03-03 03:30:03)这个窗口(而不是[2018-03-03 03:30:01,2018-03-03 03:30:04)这样的窗口),然后将消息buffer在这个窗口中(此时还不会触发窗口的计算),

当watermark(可以翻译成水位线,只会不断上升,不会下降,用来保证在水位线之前的数据一定都到达,org.apache.flink.streaming.api.datastream.DataStream#assignTimestampsAndWatermarks(org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks<T>)来定制水位线生成策略)超过窗口的endTime时,才会真正触发窗口计算逻辑,然后清除掉该窗口。这样后续再有在该窗口区间内的数据到来时(延迟到来的数据),将被丢弃不被处理。有时候我们希望可以容忍一定时间的数据延迟,我们可以通过org.apache.flink.streaming.api.datastream.WindowedStream#allowedLateness方法来指定一个允许的延迟时间,比如allowedLateness(Time.seconds(5))允许5秒延迟,还是用上面的样例距离,当watermark到达2018-03-03 03:30:03时,将不会移除窗口,但是会触发窗口计算函数,由于窗口还在,所以当还有延迟的消息时间落在该窗口范围内时,比如又有一条消息2018-03-03 03:30:02到来(已经小于水位线时间2018-03-03 03:30:03),将会再次触发窗口计算函数。

那么什么时候这个窗口会被移除后续的延迟数据将不被处理呢?比如[2018-03-03 03:30:00,2018-03-03 03:30:03)这个窗口,当允许延迟5秒时,将在watermark到达2018-03-03 03:30:08时(即[watermark-5秒=2018-03-03 03:30:03]到达窗口的endTime时),[2018-03-03 03:30:00,2018-03-03 03:30:03)这个窗口将被移除,后续再有事件时间落在该窗口的数据到来,将丢弃不处理。

附一个demo测试程序:

package windowing;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.text.SimpleDateFormat;
import java.util.Comparator;
import java.util.Date;
import java.util.LinkedList;

/**
 *
 * @author : xiaojun
 * @since 9:47 2018/4/2
 */
public class WatermarkTest {
    public static void main(String[] args) throws Exception {

        //2018/3/3 3:30:0
        Long baseTimestamp = 1520019000000L;

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.getConfig().setAutoWatermarkInterval(2000);
        env.setParallelism(1);

        DataStream<Tuple2<String, Long>> raw = env.socketTextStream("localhost", 9999, "\n").map(new MapFunction<String, Tuple2<String, Long>>() {
            @Override
            public Tuple2<String, Long> map(String value) throws Exception {
                //每行输入数据形如: key1@0,key1@13等等,即在baseTimestamp的基础上加多少秒,作为当前event time
                String[] tmp = value.split("@");
                Long ts = baseTimestamp + Long.parseLong(tmp[1]) * 1000;
                return Tuple2.of(tmp[0], ts);
            }
        }).assignTimestampsAndWatermarks(new MyTimestampExtractor(Time.seconds(10))); //允许10秒乱序,watermark为当前接收到的最大事件时间戳减10秒

        DataStream<String> window = raw.keyBy(0)
                //窗口都为自然时间窗口,而不是说从收到的消息时间为窗口开始时间来进行开窗,比如3秒的窗口,那么窗口一次是[0,3),[3,6)....[57,0),如果10秒窗口,那么[0,10),[10,20),...
                .window(TumblingEventTimeWindows.of(Time.seconds(3)))
                // 允许5秒延迟
                //比如窗口[2018-03-03 03:30:00,2018-03-03 03:30:03),如果没有允许延迟的话,那么当watermark到达2018-03-03 03:30:03的时候,将会触发窗口函数并移除窗口,这样2018-03-03 03:30:03之前的数据再来,将被丢弃
                //在允许5秒延迟的情况下,那么窗口的移除时间将到watermark为2018-03-03 03:30:08,在watermark没有到达这个时间之前,你输入2018-03-03 03:30:00这个时间,将仍然会触发[2018-03-03 03:30:00,2018-03-03 03:30:03)这个窗口的计算
                .allowedLateness(Time.seconds(5))
                .apply(new WindowFunction<Tuple2<String, Long>, String, Tuple, TimeWindow>() {
                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<String> out) throws Exception {
                        LinkedList<Tuple2<String, Long>> data = new LinkedList<>();
                        for (Tuple2<String, Long> tuple2 : input) {
                            data.add(tuple2);
                        }
                        data.sort(new Comparator<Tuple2<String, Long>>() {
                            @Override
                            public int compare(Tuple2<String, Long> o1, Tuple2<String, Long> o2) {
                                return o1.f1.compareTo(o2.f1);
                            }
                        });
                        SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                        String msg = String.format("key:%s,  window:[ %s  ,  %s ), elements count:%d, elements time range:[ %s  ,  %s ]", tuple.getField(0)
                                , format.format(new Date(window.getStart()))
                                , format.format(new Date(window.getEnd()))
                                , data.size()
                                , format.format(new Date(data.getFirst().f1))
                                , format.format(new Date(data.getLast().f1))
                        );
                        out.collect(msg);

                    }
                });
        window.print();

        env.execute();

    }


    public static class MyTimestampExtractor implements AssignerWithPeriodicWatermarks<Tuple2<String, Long>> {


        private static final long serialVersionUID = 1L;

        /**
         * The current maximum timestamp seen so far.
         */
        private long currentMaxTimestamp;

        /**
         * The timestamp of the last emitted watermark.
         */
        private long lastEmittedWatermark = Long.MIN_VALUE;

        /**
         * The (fixed) interval between the maximum seen timestamp seen in the records
         * and that of the watermark to be emitted.
         */
        private final long maxOutOfOrderness;


        private SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

        public MyTimestampExtractor(Time maxOutOfOrderness) {
            if (maxOutOfOrderness.toMilliseconds() < 0) {
                throw new RuntimeException("Tried to set the maximum allowed " +
                        "lateness to " + maxOutOfOrderness + ". This parameter cannot be negative.");
            }
            this.maxOutOfOrderness = maxOutOfOrderness.toMilliseconds();
            this.currentMaxTimestamp = Long.MIN_VALUE + this.maxOutOfOrderness;
        }

        public long getMaxOutOfOrdernessInMillis() {
            return maxOutOfOrderness;
        }


        @Override
        public final Watermark getCurrentWatermark() {
            // this guarantees that the watermark never goes backwards.
            long potentialWM = currentMaxTimestamp - maxOutOfOrderness;
            if (potentialWM >= lastEmittedWatermark) {
                lastEmittedWatermark = potentialWM;
            }
            System.out.println(String.format("call getCurrentWatermark======currentMaxTimestamp:%s  , lastEmittedWatermark:%s", format.format(new Date(currentMaxTimestamp)), format.format(new Date(lastEmittedWatermark))));
            return new Watermark(lastEmittedWatermark);
        }

        @Override
        public final long extractTimestamp(Tuple2<String, Long> element, long previousElementTimestamp) {
            long timestamp = element.f1;
            if (timestamp > currentMaxTimestamp) {
                currentMaxTimestamp = timestamp;
            }
            return timestamp;
        }
    }
}

 

赋windows上的nc工具:

https://download.csdn.net/download/xiao_jun_0820/10322207

启动本地9999端口:nc -l -p 9999
监听本地9999端口:nc localhost 9999

结果

1)
.assignTimestampsAndWatermarks(new MyTimestampExtractor(Time.seconds(0))); //允许0秒乱序,watermark为当前接收到的最大事件时间戳减10秒
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.allowedLateness(Time.seconds(5))
[2018-03-03 03:30:00, 2018-03-03 03:30:03)  
[2018-03-03 03:30:03, 2018-03-03 03:30:06)
[2018-03-03 03:30:06, 2018-03-03 03:30:09)
key1@1
key1@3  //触发窗口[2018-03-03 03:30:00,2018-03-03 03:30:03)计算
key:key1,  window:[ 2018-03-03 03:30:00  ,  2018-03-03 03:30:03 ), elements count:1, elements time range:[ 2018-03-03 03:30:01  ,  2018-03-03 03:30:01 ]
key1@2
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:03  , lastEmittedWatermark:2018-03-03 03:30:03
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:03  , lastEmittedWatermark:2018-03-03 03:30:03
key:key1,  window:[ 2018-03-03 03:30:00  ,  2018-03-03 03:30:03 ), elements count:2, elements time range:[ 2018-03-03 03:30:01  ,  2018-03-03 03:30:02 ]
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:03  , lastEmittedWatermark:2018-03-03 03:30:03
key1@1
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:03  , lastEmittedWatermark:2018-03-03 03:30:03
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:03  , lastEmittedWatermark:2018-03-03 03:30:03
key:key1,  window:[ 2018-03-03 03:30:00  ,  2018-03-03 03:30:03 ), elements count:3, elements time range:[ 2018-03-03 03:30:01  ,  2018-03-03 03:30:02 ]
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:03  , lastEmittedWatermark:2018-03-03 03:30:03

key1@4
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:03  , lastEmittedWatermark:2018-03-03 03:30:03
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:04  , lastEmittedWatermark:2018-03-03 03:30:03
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:04  , lastEmittedWatermark:2018-03-03 03:30:04
key1@6
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:06  , lastEmittedWatermark:2018-03-03 03:30:04
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:06  , lastEmittedWatermark:2018-03-03 03:30:06
key:key1,  window:[ 2018-03-03 03:30:03  ,  2018-03-03 03:30:06 ), elements count:2, elements time range:[ 2018-03-03 03:30:03  ,  2018-03-03 03:30:04 ]
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:06  , lastEmittedWatermark:2018-03-03 03:30:06
key1@1
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:06  , lastEmittedWatermark:2018-03-03 03:30:06
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:06  , lastEmittedWatermark:2018-03-03 03:30:06
key:key1,  window:[ 2018-03-03 03:30:00  ,  2018-03-03 03:30:03 ), elements count:4, elements time range:[ 2018-03-03 03:30:01  ,  2018-03-03 03:30:02 ]
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:06  , lastEmittedWatermark:2018-03-03 03:30:06


key1@8  //因为5秒延迟,上一个窗口不触发并销毁
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:06  , lastEmittedWatermark:2018-03-03 03:30:06
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:08  , lastEmittedWatermark:2018-03-03 03:30:06
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:08  , lastEmittedWatermark:2018-03-03 03:30:08
key1@1
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:08  , lastEmittedWatermark:2018-03-03 03:30:08
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:08  , lastEmittedWatermark:2018-03-03 03:30:08
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:08  , lastEmittedWatermark:2018-03-03 03:30:08
key1@9
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:08  , lastEmittedWatermark:2018-03-03 03:30:08
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:09  , lastEmittedWatermark:2018-03-03 03:30:08
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:09  , lastEmittedWatermark:2018-03-03 03:30:09
key:key1,  window:[ 2018-03-03 03:30:06  ,  2018-03-03 03:30:09 ), elements count:2, elements time range:[ 2018-03-03 03:30:06  ,  2018-03-03 03:30:08 ]
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:09  , lastEmittedWatermark:2018-03-03 03:30:09
key1@5
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:09  , lastEmittedWatermark:2018-03-03 03:30:09
key:key1,  window:[ 2018-03-03 03:30:03  ,  2018-03-03 03:30:06 ), elements count:3, elements time range:[ 2018-03-03 03:30:03  ,  2018-03-03 03:30:05 ]
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:09  , lastEmittedWatermark:2018-03-03 03:30:09


2)
.assignTimestampsAndWatermarks(new MyTimestampExtractor(Time.seconds(10))); //允许10秒乱序,watermark为当前接收到的最大事件时间戳减10秒
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.allowedLateness(Time.seconds(5))
[2018-03-03 03:30:00, 2018-03-03 03:30:03)  
[2018-03-03 03:30:03, 2018-03-03 03:30:06)
[2018-03-03 03:30:06, 2018-03-03 03:30:09)
key1@1
call getCurrentWatermark======currentMaxTimestamp:292269055-12-03 00:47:14  , lastEmittedWatermark:292269055-12-03 00:47:04
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:01  , lastEmittedWatermark:292269055-12-03 00:47:04
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:01  , lastEmittedWatermark:2018-03-03 03:29:51
key1@3
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:01  , lastEmittedWatermark:2018-03-03 03:29:51
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:03  , lastEmittedWatermark:2018-03-03 03:29:51
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:03  , lastEmittedWatermark:2018-03-03 03:29:53
key1@13
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:03  , lastEmittedWatermark:2018-03-03 03:29:53
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:13  , lastEmittedWatermark:2018-03-03 03:29:53
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:13  , lastEmittedWatermark:2018-03-03 03:30:03
key:key1,  window:[ 2018-03-03 03:30:00  ,  2018-03-03 03:30:03 ), elements count:1, elements time range:[ 2018-03-03 03:30:01  ,  2018-03-03 03:30:01 ]
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:13  , lastEmittedWatermark:2018-03-03 03:30:03
key1@1
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:13  , lastEmittedWatermark:2018-03-03 03:30:03
======extractTimestamp======,currentMaxTimestamp:2018-03-03 03:30:13  , lastEmittedWatermark:2018-03-03 03:30:03
key:key1,  window:[ 2018-03-03 03:30:00  ,  2018-03-03 03:30:03 ), elements count:2, elements time range:[ 2018-03-03 03:30:01  ,  2018-03-03 03:30:01 ]
call getCurrentWatermark======currentMaxTimestamp:2018-03-03 03:30:13  , lastEmittedWatermark:2018-03-03 03:30:03

---------------------
原文:https://blog.csdn.net/xiao_jun_0820/article/details/79786517

posted on 2019-06-03 17:48  cxhfuujust  阅读(829)  评论(0编辑  收藏  举报

导航