按键分区处理函数（KeyedProcessFunction）

在Flink程序中，为了实现数据的聚合统计，或者开窗计算之类的功能，一般都要先用keyBy算子对数据流进行“按键分区”，得到一个KeyedStream。也就是指定一个键（key），按照它的哈希值（hash code）将数据分成不同的“组”，然后分配到不同的并行子任务上执行计算；这相当于做了一个逻辑分流的操作，从而可以充分利用并行计算的优势实时处理海量数据。另外只有在KeyedStream中才支持使用TimerService设置定时器的操作。所以一般情况下，我们都是先做了keyBy分区之后，再去定义处理操作.

1、定时器（Timer）和定时服务（TimerService）

KeyedProcessFunction的一个特色，就是可以灵活地使用定时器。定时器（timers）是处理函数中进行时间相关操作的主要机制。在.onTimer()方法中可以实现定时处理的逻辑，而它能触发的前提，就是之前曾经注册过定时器、并且现在已经到了触发时间。注册定时器的功能，是通过上下文中提供的“定时服务”（TimerService）来实现的。定时服务与当前运行的环境有关。前面已经介绍过，ProcessFunction的上下文（Context）中提供了.timerService()方法，可以直接返回一个TimerService对象：

public abstract TimerService timerService();

TimerService是Flink关于时间和定时器的基础服务接口，包含以下六个方法：

/** Interface for working with time and timers. */
@PublicEvolving
public interface TimerService {

    /** Error string for {@link UnsupportedOperationException} on registering timers. */
    String UNSUPPORTED_REGISTER_TIMER_MSG = "Setting timers is only supported on a keyed streams.";

    /** Error string for {@link UnsupportedOperationException} on deleting timers. */
    String UNSUPPORTED_DELETE_TIMER_MSG = "Deleting timers is only supported on a keyed streams.";

    /** Returns the current processing time. */
    long currentProcessingTime();

    /** Returns the current event-time watermark. */
    long currentWatermark();

    /**
     * Registers a timer to be fired when processing time passes the given time.
     *
     * <p>Timers can internally be scoped to keys and/or windows. When you set a timer in a keyed
     * context, such as in an operation on {@link
     * org.apache.flink.streaming.api.datastream.KeyedStream} then that context will also be active
     * when you receive the timer notification.
     */
    void registerProcessingTimeTimer(long time);

    /**
     * Registers a timer to be fired when the event time watermark passes the given time.
     *
     * <p>Timers can internally be scoped to keys and/or windows. When you set a timer in a keyed
     * context, such as in an operation on {@link
     * org.apache.flink.streaming.api.datastream.KeyedStream} then that context will also be active
     * when you receive the timer notification.
     */
    void registerEventTimeTimer(long time);

    /**
     * Deletes the processing-time timer with the given trigger time. This method has only an effect
     * if such a timer was previously registered and did not already expire.
     *
     * <p>Timers can internally be scoped to keys and/or windows. When you delete a timer, it is
     * removed from the current keyed context.
     */
    void deleteProcessingTimeTimer(long time);

    /**
     * Deletes the event-time timer with the given trigger time. This method has only an effect if
     * such a timer was previously registered and did not already expire.
     *
     * <p>Timers can internally be scoped to keys and/or windows. When you delete a timer, it is
     * removed from the current keyed context.
     */
    void deleteEventTimeTimer(long time);
}

六个方法可以分成两大类：基于处理时间和基于事件时间。而对应的操作主要有三个：获取当前时间，注册定时器，以及删除定时器。需要注意，尽管处理函数中都可以直接访问TimerService，不过只有基于KeyedStream的处理函数，才能去调用注册和删除定时器的方法；未作按键分区的DataStream不支持定时器操作，只能获取当前时间。对于处理时间和事件时间这两种类型的定时器，TimerService内部会用一个优先队列将它们的时间戳（timestamp）保存起来，排队等待执行。可以认为，定时器其实是KeyedStream上处理算子的一个状态，它以时间戳作为区分。所以TimerService会以键（key）和时间戳为标准，对定时器进行去重；也就是说对于每个key和时间戳，最多只有一个定时器，如果注册了多次，onTimer()方法也将只被调用一次。这样一来，在代码中就方便了很多，可以肆无忌惮地对一个key注册定时器，而不用担心重复定义——因为一个时间戳上的定时器只会触发一次。基于KeyedStream注册定时器时，会传入一个定时器触发的时间戳，这个时间戳的定时器对于每个key都是有效的。这样，代码并不需要做额外的处理，底层就可以直接对不同key进行独立的处理操作了。利用这个特性，有时可以故意降低时间戳的精度，来减少定时器的数量，从而提高处理性能。比如可以在设置定时器时只保留整秒数，那么定时器的触发频率就是最多1秒一次。

ong coalescedTime = time /1000 * 1000;
tx.timerService().registerProcessingTimeTimer(coalescedTime);

这里注意定时器的时间戳必须是毫秒数，所以得到整秒之后还要乘以1000。定时器默认的区分精度是毫秒。另外Flink对.onTimer()和.processElement()方法是同步调用的（synchronous），所以也不会出现状态的并发修改。Flink的定时器同样具有容错性，它和状态一起都会被保存到一致性检查点（checkpoint）中。当发生故障时，Flink会重启并读取检查点中的状态，恢复定时器。如果是处理时间的定时器，有可能会出现已经“过期”的情况，这时它们会在重启时被立刻触发。

2、KeyedProcessFunction的使用

KeyedProcessFunction可以说是处理函数中的“嫡系部队”，可以认为是ProcessFunction的一个扩展。只要基于keyBy之后的KeyedStream，直接调用.process()方法，这时需要传入的参数就是KeyedProcessFunction的实现类。

stream.keyBy( t -> t.f0 ).process(new MyKeyedProcessFunction())

类似地，KeyedProcessFunction也是继承自AbstractRichFunction的一个抽象类，源码中定义如下：

   /**
     * Process one element from the input stream.
     *
     * <p>This function can output zero or more elements using the {@link Collector} parameter and
     * also update internal state or set timers using the {@link Context} parameter.
     *
     * @param value The input value.
     * @param ctx A {@link Context} that allows querying the timestamp of the element and getting a
     *     {@link TimerService} for registering timers and querying the time. The context is only
     *     valid during the invocation of this method, do not store it.
     * @param out The collector for returning result values.
     * @throws Exception This method may throw exceptions. Throwing an exception will cause the
     *     operation to fail and may trigger recovery.
     */
    public abstract void processElement(I value, Context ctx, Collector<O> out) throws Exception;

    /**
     * Called when a timer set using {@link TimerService} fires.
     *
     * @param timestamp The timestamp of the firing timer.
     * @param ctx An {@link OnTimerContext} that allows querying the timestamp, the {@link
     *     TimeDomain}, and the key of the firing timer and getting a {@link TimerService} for
     *     registering timers and querying the time. The context is only valid during the invocation
     *     of this method, do not store it.
     * @param out The collector for returning result values.
     * @throws Exception This method may throw exceptions. Throwing an exception will cause the
     *     operation to fail and may trigger recovery.
     */
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception {}

可以看到与ProcessFunction的定义几乎完全一样，区别只是在于类型参数多了一个K，这是当前按键分区的key的类型。同样地，必须实现一个.processElement()抽象方法，用来处理流中的每一个数据；另外还有一个非抽象方法.onTimer()，用来定义定时器触发时的回调操作。由于定时器只能在KeyedStream上使用，所以到了KeyedProcessFunction这里，才真正对时间有了精细的控制，定时方法.onTimer()才真正派上了用场。下面是一个使用处理时间定时器的具体示例：

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        DataStreamSource<Event> eventDS = env.addSource(new ClickSource());

        eventDS.keyBy(data -> data.user)
                //KeyedProcessFunction<String, Event, String>( K I O )
                .process(new KeyedProcessFunction<String, Event, String>() {
                    @Override
                    public void processElement(Event value, Context ctx, Collector<String> out) throws Exception {
                        //获取处理时间
                        long currTs = ctx.timerService().currentProcessingTime(); // 处理时间
                        out.collect(ctx.getCurrentKey() + " 数据到达时间 -> " + new Timestamp(currTs));
                        //注册 10s 的定时器：处理时间定时器
                        ctx.timerService().registerProcessingTimeTimer(currTs + 10 * 1000);

                    }
                     @Override
                    public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
                        out.collect(ctx.getCurrentKey() + " 定时器触发时间 -> " + new Timestamp(timestamp));
                    }
                }).print();
        env.execute();
    }

在上面的代码中，由于定时器只能在KeyedStream上使用，所以先要进行keyBy；这里的.keyBy(data-> true)是将所有数据的key都指定为了true，其实就是所有数据拥有相同的key，会分配到同一个分区。之后自定义了一个KeyedProcessFunction，其中.processElement()方法是每来一个数据都会调用一次，主要是定义了一个10秒之后的定时器；而.onTimer()方法则会在定时器触发时调用。所以会看到，程序运行后先在控制台输出“数据到达”的信息，等待10秒之后，又会输出“定时器触发”的信息，打印出的时间间隔正是10秒。当然，上面的例子是处理时间的定时器，所以是真的需要等待10秒才会看到结果。事件时间语义下，又会有什么不同呢？可以对上面的代码略作修改，做一个测试：

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        SingleOutputStreamOperator<Event> eventDS = env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<Event>forBoundedOutOfOrderness(Duration.ofSeconds(2))
                                .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
                                    @Override
                                    public long extractTimestamp(Event element, long recordTimestamp) {
                                        return element.timestamp;
                                    }
                                }));


        eventDS.keyBy(data -> data.user)
                .process(new KeyedProcessFunction<String, Event, String>() {
                    @Override
                    public void processElement(Event value, Context ctx, Collector<String> out) throws Exception {
                        //获取处理时间
                        long currTs = ctx.timerService().currentWatermark();
                        out.collect(ctx.getCurrentKey() + " 数据到时间戳 -> " + new Timestamp(currTs) + " watermaker " + ctx.timerService().currentWatermark());
                        //注册 10s 的定时器
                        ctx.timerService().registerEventTimeTimer(currTs + 10 * 1000);

                    }

                    //触发
                    @Override
                    public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
                        out.collect(ctx.getCurrentKey() + " 定时器触发时间 -> " + new Timestamp(timestamp));
                    }
                }).print();

        env.execute();
    }

由于是事件时间语义，所以必须从数据中提取出数据产生的时间戳。这里为了更清楚地看到程序行为，自定义了一个数据源，发出三条测试数据，时间戳分别为1000、11000和11001，并且发出数据后都会停顿5秒。在代码中，依然将所有数据分到同一分区，然后在自定义的KeyedProcessFunction中使用定时器。同样地，每来一条数据，就将当前的数据时间戳和水位线信息输出，并注册一个10秒后（以当前数据时间戳为基准）的事件时间定时器。执行程序结果如下：

令狐冲 数据到时间戳 -> 292278994-08-17 15:12:55.192 watermaker -9223372036854775808
令狐冲 定时器触发时间 -> 292269055-12-03 00:47:14.192
任盈盈 数据到时间戳 -> 2022-07-06 12:31:55.509 watermaker 1657081915509
莫大 数据到时间戳 -> 2022-07-06 12:31:56.521 watermaker 1657081916521
依琳 数据到时间戳 -> 2022-07-06 12:31:57.529 watermaker 1657081917529
任盈盈 数据到时间戳 -> 2022-07-06 12:31:58.53 watermaker 1657081918530
令狐冲 数据到时间戳 -> 2022-07-06 12:31:59.545 watermaker 1657081919545
风清扬 数据到时间戳 -> 2022-07-06 12:32:00.55 watermaker 1657081920550
风清扬 数据到时间戳 -> 2022-07-06 12:32:01.551 watermaker 1657081921551
依琳 数据到时间戳 -> 2022-07-06 12:32:02.554 watermaker 1657081922554
风清扬 数据到时间戳 -> 2022-07-06 12:32:03.567 watermaker 1657081923567
任盈盈 数据到时间戳 -> 2022-07-06 12:32:04.577 watermaker 1657081924577
任盈盈 定时器触发时间 -> 2022-07-06 12:32:05.509
任盈盈 数据到时间戳 -> 2022-07-06 12:32:05.587 watermaker 1657081925587
莫大 定时器触发时间 -> 2022-07-06 12:32:06.521
风清扬 数据到时间戳 -> 2022-07-06 12:32:06.599 watermaker 1657081926599
依琳 定时器触发时间 -> 2022-07-06 12:32:07.529
莫大 数据到时间戳 -> 2022-07-06 12:32:07.61 watermaker 1657081927610
任盈盈 定时器触发时间 -> 2022-07-06 12:32:08.53
莫大 数据到时间戳 -> 2022-07-06 12:32:08.623 watermaker 1657081928623
令狐冲 定时器触发时间 -> 2022-07-06 12:32:09.545
依琳 数据到时间戳 -> 2022-07-06 12:32:09.633 watermaker 1657081929633
风清扬 定时器触发时间 -> 2022-07-06 12:32:10.55
令狐冲 数据到时间戳 -> 2022-07-06 12:32:10.643 watermaker 1657081930643
风清扬 定时器触发时间 -> 2022-07-06 12:32:11.551
风清扬 数据到时间戳 -> 2022-07-06 12:32:11.653 watermaker 1657081931653
依琳 定时器触发时间 -> 2022-07-06 12:32:12.554

每来一条数据，都会输出两行“数据到达”的信息，并以分割线隔开；两条数据到达的时间间隔为5秒。当第三条数据到达后，随后立即输出一条定时器触发的信息；再过5秒之后，剩余两条定时器信息输出，程序运行结束。可以发现，数据到来之后，当前的水位线与时间戳并不是一致的。当第一条数据到来，时间戳为1000，可水位线的生成是周期性的（默认200ms一次），不会立即发生改变，所以依然是最小值Long.MIN_VALUE；随后只要到了水位线生成的时间点（200ms到了），就会依据当前的最大时间戳1000来生成水位线了。这里没有设置水位线延迟，默认需要减去1毫秒，所以水位线推进到了999。而当时间戳为11000的第二条数据到来之后，水位线同样没有立即改变，仍然是999，就好像总是“滞后”数据一样。这样程序的行为就可以得到合理解释了。事件时间语义下，定时器触发的条件就是水位线推进到设定的时间。第一条数据到来后，设定的定时器时间为1000 + 10 * 1000 = 11000；而当时间戳为11000的第二条数据到来，水位线还处在999的位置，当然不会立即触发定时器；而之后水位线会推进到10999，同样是无法触发定时器的。必须等到第三条数据到来，将水位线真正推进到11000，就可以触发第一个定时器了。第三条数据发出后再过5秒，没有更多的数据生成了，整个程序运行结束将要退出，此时Flink会自动将水位线推进到长整型的最大值（Long.MAX_VALUE）。于是所有尚未触发的定时器这时就统一触发了，就在控制台看到了后两个定时器的触发信息。

posted @ 2022-07-12 09:15 晓枫的春天阅读(1889) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

滴水穿石不是靠力，而是因为不舍昼夜。

按键分区处理函数（KeyedProcessFunction）

1、定时器（Timer）和定时服务（TimerService）

2、KeyedProcessFunction的使用

公告