自定义函数 UDF operator window 窗口 datas ource 数据源 水印 事件时间 event time watermark

 

FlinkSQL支持的窗口函数类型_实时计算 Flink版(Flink)-阿里云帮助中心 https://help.aliyun.com/zh/flink/developer-reference/overview-4

时间属性

Flink SQL支持Event Time和Processing Time两种时间属性,时间属性详情请参见时间属性。Flink可以基于这两种时间属性对数据进行窗口聚合。基于这两种时间属性开窗的区别如下:

  • Event Time:您提供的事件时间,通常是数据的最原始的创建时间。

    系统会根据数据的Event Time生成的Watermark来进行关窗。只有当Watermark大于关窗时间,才会触发窗口的结束,窗口结束才会输出结果。如果一直没有触发窗口结束的数据流入Flink,则该窗口就无法输出数据。单个subtask的Watermark是递增的,多个subtask或者多个源表的Watermark取最小值。

    重要
    • 如果源表中存在一条未来的乱序数据或者某个subtask或上游源表的某个分区中没有数据,则可能会无法触发窗口结束,从而导致结果数据异常。因此您需要根据数据乱序的程度设置合理的offset大小,并保证所有subtask和上游源表的所有分区中都有数据。如果某个subtask或上游源表的某个分区中没有数据,导致Watermark无法推进,窗口无法及时结束,则可以在更多Flink配置中添加table.exec.source.idle-timeout: 10s来触发窗口结束。该参数含义详情请参见Configuration

    • 数据经过GroupBy、双流JOIN或OVER窗口节点后,会导致Watermark属性丢失,无法再使用Event Time进行开窗。

  • Processing Time:对事件进行处理的本地系统时间。

    Processing Time是Flink系统产生的,不在用户的原始数据中。因此需要您显示定义一个Processing Time列。

    说明

    因为Processing Time容易受到事件到达Flink系统的速度及Flink内部处理数据顺序的影响,所以每次回溯数据的结果可能不一致。

 

 

Windows | Apache Flink https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/

Windows #

Windows are at the heart of processing infinite streams. Windows split the stream into “buckets” of finite size, over which we can apply computations. This document focuses on how windowing is performed in Flink and how the programmer can benefit to the maximum from its offered functionality.

The general structure of a windowed Flink program is presented below. The first snippet refers to keyed streams, while the second to non-keyed ones. As one can see, the only difference is the keyBy(...) call for the keyed streams and the window(...) which becomes windowAll(...) for non-keyed streams. This is also going to serve as a roadmap for the rest of the page.

 

 

 

实际:

1、

env.createTemporarySystemFunction("SubstringFunction", SubstringFunction.class);
 
设置方法别名。
 
 

小结:

1、操作者执行自定义函数。

2、

Operators transform one or more DataStreams into a new DataStream. Programs can combine multiple transformations into sophisticated dataflow topologies.

状态的转换。

3、窗口是处理流的核心。

 

 

User-Defined Functions | Apache Flink https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/datastream/user_defined_functions/

User-Defined Functions #

Most operations require a user-defined function. This section lists different ways of how they can be specified. We also cover Accumulators, which can be used to gain insights into your Flink application.

Implementing an interface #

The most basic way is to implement one of the provided interfaces:

class MyMapFunction implements MapFunction<String, Integer> {
  public Integer map(String value) { return Integer.parseInt(value); }
}
data.map(new MyMapFunction());

Anonymous classes #

You can pass a function as an anonymous class:

data.map(new MapFunction<String, Integer> () {
  public Integer map(String value) { return Integer.parseInt(value); }
});

Java 8 Lambdas #

Flink also supports Java 8 Lambdas in the Java API.

data.filter(s -> s.startsWith("http://"));
data.reduce((i1,i2) -> i1 + i2);

Rich functions #

All transformations that require a user-defined function can instead take as argument a rich function. For example, instead of

class MyMapFunction implements MapFunction<String, Integer> {
  public Integer map(String value) { return Integer.parseInt(value); }
}

you can write

class MyMapFunction extends RichMapFunction<String, Integer> {
  public Integer map(String value) { return Integer.parseInt(value); }
}

and pass the function as usual to a map transformation:

data.map(new MyMapFunction());

Rich functions can also be defined as an anonymous class:

data.map (new RichMapFunction<String, Integer>() {
  public Integer map(String value) { return Integer.parseInt(value); }
});

Rich functions provide, in addition to the user-defined function (map, reduce, etc), four methods: openclosegetRuntimeContext, and setRuntimeContext. These are useful for parameterizing the function (see Passing Parameters to Functions), creating and finalizing local state, accessing broadcast variables (see Broadcast Variables), and for accessing runtime information such as accumulators and counters (see Accumulators and Counters), and information on iterations (see Iterations).

Back to top

Accumulators & Counters #

Accumulators are simple constructs with an add operation and a final accumulated result, which is available after the job ended.

The most straightforward accumulator is a counter: You can increment it using the Accumulator.add(V value) method. At the end of the job Flink will sum up (merge) all partial results and send the result to the client. Accumulators are useful during debugging or if you quickly want to find out more about your data.

Flink currently has the following built-in accumulators. Each of them implements the Accumulator interface.

  • IntCounter LongCounter and DoubleCounter : See below for an example using a counter.
  • Histogram : A histogram implementation for a discrete number of bins. Internally it is just a map from Integer to Integer. You can use this to compute distributions of values, e.g. the distribution of words-per-line for a word count program.

How to use accumulators:

First you have to create an accumulator object (here a counter) in the user-defined transformation function where you want to use it.

private IntCounter numLines = new IntCounter();

Second you have to register the accumulator object, typically in the open() method of the rich function. Here you also define the name.

getRuntimeContext().addAccumulator("num-lines", this.numLines);

You can now use the accumulator anywhere in the operator function, including in the open() and close() methods.

this.numLines.add(1);

The overall result will be stored in the JobExecutionResult object which is returned from the execute() method of the execution environment (currently this only works if the execution waits for the completion of the job).

myJobExecutionResult.getAccumulatorResult("num-lines");

All accumulators share a single namespace per job. Thus you can use the same accumulator in different operator functions of your job. Flink will internally merge all accumulators with the same name.

A note on accumulators and iterations: Currently the result of accumulators is only available after the overall job has ended. We plan to also make the result of the previous iteration available in the next iteration. You can use Aggregators to compute per-iteration statistics and base the termination of iterations on such statistics.

Custom accumulators:

To implement your own accumulator you simply have to write your implementation of the Accumulator interface. Feel free to create a pull request if you think your custom accumulator should be shipped with Flink.

You have the choice to implement either Accumulator or SimpleAccumulator .

Accumulator<V,R> is most flexible: It defines a type V for the value to add, and a result type R for the final result. E.g. for a histogram, V is a number and R is a histogram. SimpleAccumulator is for the cases where both types are the same, e.g. for counters.

 

Overview | Apache Flink https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/datastream/operators/overview/

Operators transform one or more DataStreams into a new DataStream. Programs can combine multiple transformations into sophisticated dataflow topologies.

This section gives a description of the basic transformations, the effective physical partitioning after applying those as well as insights into Flink’s operator chaining.

 

Windows | Apache Flink https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/datastream/operators/windows/

Windows #

Windows are at the heart of processing infinite streams. Windows split the stream into “buckets” of finite size, over which we can apply computations. This document focuses on how windowing is performed in Flink and how the programmer can benefit to the maximum from its offered functionality.

The general structure of a windowed Flink program is presented below. The first snippet refers to keyed streams, while the second to non-keyed ones. As one can see, the only difference is the keyBy(...) call for the keyed streams and the window(...) which becomes windowAll(...) for non-keyed streams. This is also going to serve as a roadmap for the rest of the page.

Keyed Windows

stream
       .keyBy(...)               <-  keyed versus non-keyed windows
       .window(...)              <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

Non-Keyed Windows

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

In the above, the commands in square brackets ([…]) are optional. This reveals that Flink allows you to customize your windowing logic in many different ways so that it best fits your needs.

Note: Evictor is still not supported in Python DataStream API.

 

Window Lifecycle #

In a nutshell, a window is created as soon as the first element that should belong to this window arrives, and the window is completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness (see Allowed Lateness). Flink guarantees removal only for time-based windows and not for other types, e.g. global windows (see Window Assigners). For example, with an event-time-based windowing strategy that creates non-overlapping (or tumbling) windows every 5 minutes and has an allowed lateness of 1 min, Flink will create a new window for the interval between 12:00 and 12:05 when the first element with a timestamp that falls into this interval arrives, and it will remove it when the watermark passes the 12:06 timestamp.

In addition, each window will have a Trigger (see Triggers) and a function (ProcessWindowFunctionReduceFunction, or AggregateFunction) (see Window Functions) attached to it. The function will contain the computation to be applied to the contents of the window, while the Trigger specifies the conditions under which the window is considered ready for the function to be applied. A triggering policy might be something like “when the number of elements in the window is more than 4”, or “when the watermark passes the end of the window”. A trigger can also decide to purge a window’s contents any time between its creation and removal. Purging in this case only refers to the elements in the window, and not the window metadata. This means that new data can still be added to that window.

Apart from the above, you can specify an Evictor (see Evictors) which will be able to remove elements from the window after the trigger fires and before and/or after the function is applied.

In the following we go into more detail for each of the components above. We start with the required parts in the above snippet (see Keyed vs Non-Keyed WindowsWindow Assigners, and Window Functions) before moving to the optional ones.

Keyed vs Non-Keyed Windows #

The first thing to specify is whether your stream should be keyed or not. This has to be done before defining the window. Using the keyBy(...) will split your infinite stream into logical keyed streams. If keyBy(...) is not called, your stream is not keyed.

In the case of keyed streams, any attribute of your incoming events can be used as a key (more details here). Having a keyed stream will allow your windowed computation to be performed in parallel by multiple tasks, as each logical keyed stream can be processed independently from the rest. All elements referring to the same key will be sent to the same parallel task.

In case of non-keyed streams, your original stream will not be split into multiple logical streams and all the windowing logic will be performed by a single task, i.e. with parallelism of 1.

Window Assigners #

After specifying whether your stream is keyed or not, the next step is to define a window assigner. The window assigner defines how elements are assigned to windows. This is done by specifying the WindowAssigner of your choice in the window(...) (for keyed streams) or the windowAll() (for non-keyed streams) call.

WindowAssigner is responsible for assigning each incoming element to one or more windows. Flink comes with pre-defined window assigners for the most common use cases, namely tumbling windowssliding windowssession windows and global windows. You can also implement a custom window assigner by extending the WindowAssigner class. All built-in window assigners (except the global windows) assign elements to windows based on time, which can either be processing time or event time. Please take a look at our section on event time to learn about the difference between processing time and event time and how timestamps and watermarks are generated.

Time-based windows have a start timestamp (inclusive) and an end timestamp (exclusive) that together describe the size of the window. In code, Flink uses TimeWindow when working with time-based windows which has methods for querying the start- and end-timestamp and also an additional method maxTimestamp() that returns the largest allowed timestamp for a given windows.

In the following, we show how Flink’s pre-defined window assigners work and how they are used in a DataStream program. The following figures visualize the workings of each assigner. The purple circles represent elements of the stream, which are partitioned by some key (in this case user 1user 2 and user 3). The x-axis shows the progress of time.

Tumbling Windows #

tumbling windows assigner assigns each element to a window of a specified window size. Tumbling windows have a fixed size and do not overlap. For example, if you specify a tumbling window with a size of 5 minutes, the current window will be evaluated and a new window will be started every five minutes as illustrated by the following figure.

Tumbling Windows

The following code snippets show how to use tumbling windows.

DataStream<T> input = ...;

// tumbling event-time windows
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>);

// tumbling processing-time windows
input
    .keyBy(<key selector>)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>);

// daily tumbling event-time windows offset by -8 hours.
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8)))
    .<windowed transformation>(<window function>);

Time intervals can be specified by using one of Time.milliseconds(x)Time.seconds(x)Time.minutes(x), and so on.

As shown in the last example, tumbling window assigners also take an optional offset parameter that can be used to change the alignment of windows. For example, without offsets hourly tumbling windows are aligned with epoch, that is you will get windows such as 1:00:00.000 - 1:59:59.9992:00:00.000 - 2:59:59.999 and so on. If you want to change that you can give an offset. With an offset of 15 minutes you would, for example, get 1:15:00.000 - 2:14:59.9992:15:00.000 - 3:14:59.999 etc. An important use case for offsets is to adjust windows to timezones other than UTC-0. For example, in China you would have to specify an offset of Time.hours(-8).

Sliding Windows #

The sliding windows assigner assigns elements to windows of fixed length. Similar to a tumbling windows assigner, the size of the windows is configured by the window size parameter. An additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows can be overlapping if the slide is smaller than the window size. In this case elements are assigned to multiple windows.

For example, you could have windows of size 10 minutes that slides by 5 minutes. With this you get every 5 minutes a window that contains the events that arrived during the last 10 minutes as depicted by the following figure.

sliding windows

The following code snippets show how to use sliding windows.

DataStream<T> input = ...;

// sliding event-time windows
input
    .keyBy(<key selector>)
    .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>);

// sliding processing-time windows
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>);

// sliding processing-time windows offset by -8 hours
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.hours(12), Time.hours(1), Time.hours(-8)))
    .<windowed transformation>(<window function>);

Time intervals can be specified by using one of Time.milliseconds(x)Time.seconds(x)Time.minutes(x), and so on.

As shown in the last example, sliding window assigners also take an optional offset parameter that can be used to change the alignment of windows. For example, without offsets hourly windows sliding by 30 minutes are aligned with epoch, that is you will get windows such as 1:00:00.000 - 1:59:59.9991:30:00.000 - 2:29:59.999 and so on. If you want to change that you can give an offset. With an offset of 15 minutes you would, for example, get 1:15:00.000 - 2:14:59.9991:45:00.000 - 2:44:59.999 etc. An important use case for offsets is to adjust windows to timezones other than UTC-0. For example, in China you would have to specify an offset of Time.hours(-8).

Session Windows #

The session windows assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time, in contrast to tumbling windows and sliding windows. Instead a session window closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred. A session window assigner can be configured with either a static session gap or with a session gap extractor function which defines how long the period of inactivity is. When this period expires, the current session closes and subsequent elements are assigned to a new session window.

session windows

The following code snippets show how to use session windows.

DataStream<T> input = ...;

// event-time session windows with static gap
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>);
    
// event-time session windows with dynamic gap
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withDynamicGap((element) -> {
        // determine and return session gap
    }))
    .<windowed transformation>(<window function>);

// processing-time session windows with static gap
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>);
    
// processing-time session windows with dynamic gap
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withDynamicGap((element) -> {
        // determine and return session gap
    }))
    .<windowed transformation>(<window function>);

Static gaps can be specified by using one of Time.milliseconds(x)Time.seconds(x)Time.minutes(x), and so on.

Dynamic gaps are specified by implementing the SessionWindowTimeGapExtractor interface.

Since session windows do not have a fixed start and end, they are evaluated differently than tumbling and sliding windows. Internally, a session window operator creates a new window for each arriving record and merges windows together if they are closer to each other than the defined gap. In order to be mergeable, a session window operator requires a merging Trigger and a merging Window Function, such as ReduceFunctionAggregateFunction, or ProcessWindowFunction

Global Windows #

global windows assigner assigns all elements with the same key to the same single global window. This windowing scheme is only useful if you also specify a custom trigger. Otherwise, no computation will be performed, as the global window does not have a natural end at which we could process the aggregated elements.

global windows

The following code snippets show how to use a global window.

DataStream<T> input = ...;

input
    .keyBy(<key selector>)
    .window(GlobalWindows.create())
    .<windowed transformation>(<window function>);

Window Functions #

After defining the window assigner, we need to specify the computation that we want to perform on each of these windows. This is the responsibility of the window function, which is used to process the elements of each (possibly keyed) window once the system determines that a window is ready for processing (see triggers for how Flink determines when a window is ready).

The window function can be one of ReduceFunctionAggregateFunction, or ProcessWindowFunction. The first two can be executed more efficiently (see State Size section) because Flink can incrementally aggregate the elements for each window as they arrive. A ProcessWindowFunction gets an Iterable for all the elements contained in a window and additional meta information about the window to which the elements belong.

A windowed transformation with a ProcessWindowFunction cannot be executed as efficiently as the other cases because Flink has to buffer all elements for a window internally before invoking the function. This can be mitigated by combining a ProcessWindowFunction with a ReduceFunction, or AggregateFunction to get both incremental aggregation of window elements and the additional window metadata that the ProcessWindowFunction receives. We will look at examples for each of these variants.

ReduceFunction #

ReduceFunction specifies how two elements from the input are combined to produce an output element of the same type. Flink uses a ReduceFunction to incrementally aggregate the elements of a window.

ReduceFunction can be defined and used like this:

DataStream<Tuple2<String, Long>> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce(new ReduceFunction<Tuple2<String, Long>>() {
      public Tuple2<String, Long> reduce(Tuple2<String, Long> v1, Tuple2<String, Long> v2) {
        return new Tuple2<>(v1.f0, v1.f1 + v2.f1);
      }
    });

The above example sums up the second fields of the tuples for all elements in a window.

AggregateFunction #

An AggregateFunction is a generalized version of a ReduceFunction that has three types: an input type (IN), accumulator type (ACC), and an output type (OUT). The input type is the type of elements in the input stream and the AggregateFunction has a method for adding one input element to an accumulator. The interface also has methods for creating an initial accumulator, for merging two accumulators into one accumulator and for extracting an output (of type OUT) from an accumulator. We will see how this works in the example below.

Same as with ReduceFunction, Flink will incrementally aggregate input elements of a window as they arrive.

An AggregateFunction can be defined and used like this:


/**
 * The accumulator is used to keep a running sum and a count. The {@code getResult} method
 * computes the average.
 */
private static class AverageAggregate
    implements AggregateFunction<Tuple2<String, Long>, Tuple2<Long, Long>, Double> {
  @Override
  public Tuple2<Long, Long> createAccumulator() {
    return new Tuple2<>(0L, 0L);
  }

  @Override
  public Tuple2<Long, Long> add(Tuple2<String, Long> value, Tuple2<Long, Long> accumulator) {
    return new Tuple2<>(accumulator.f0 + value.f1, accumulator.f1 + 1L);
  }

  @Override
  public Double getResult(Tuple2<Long, Long> accumulator) {
    return ((double) accumulator.f0) / accumulator.f1;
  }

  @Override
  public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
    return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);
  }
}

DataStream<Tuple2<String, Long>> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .aggregate(new AverageAggregate());

The above example computes the average of the second field of the elements in the window.

ProcessWindowFunction #

A ProcessWindowFunction gets an Iterable containing all the elements of the window, and a Context object with access to time and state information, which enables it to provide more flexibility than other window functions. This comes at the cost of performance and resource consumption, because elements cannot be incrementally aggregated but instead need to be buffered internally until the window is considered ready for processing.

The signature of ProcessWindowFunction looks as follows:

public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window> implements Function {

    /**
     * Evaluates the window and outputs none or several elements.
     *
     * @param key The key for which this window is evaluated.
     * @param context The context in which the window is being evaluated.
     * @param elements The elements in the window being evaluated.
     * @param out A collector for emitting elements.
     *
     * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
     */
    public abstract void process(
            KEY key,
            Context context,
            Iterable<IN> elements,
            Collector<OUT> out) throws Exception;
    
    /**
     * Deletes any state in the {@code Context} when the Window expires (the watermark passes its
     * {@code maxTimestamp} + {@code allowedLateness}).
     *
     * @param context The context to which the window is being evaluated
     * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
     */
    public void clear(Context context) throws Exception {}

    /**
     * The context holding window metadata.
     */
    public abstract class Context implements java.io.Serializable {
        /**
         * Returns the window that is being evaluated.
         */
        public abstract W window();

        /** Returns the current processing time. */
        public abstract long currentProcessingTime();

        /** Returns the current event-time watermark. */
        public abstract long currentWatermark();

        /**
         * State accessor for per-key and per-window state.
         *
         * <p><b>NOTE:</b>If you use per-window state you have to ensure that you clean it up
         * by implementing {@link ProcessWindowFunction#clear(Context)}.
         */
        public abstract KeyedStateStore windowState();

        /**
         * State accessor for per-key global state.
         */
        public abstract KeyedStateStore globalState();
    }

}

The key parameter is the key that is extracted via the KeySelector that was specified for the keyBy() invocation. In case of tuple-index keys or string-field references this key type is always Tuple and you have to manually cast it to a tuple of the correct size to extract the key fields.

ProcessWindowFunction can be defined and used like this:

DataStream<Tuple2<String, Long>> input = ...;

input
  .keyBy(t -> t.f0)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .process(new MyProcessWindowFunction());

/* ... */

public class MyProcessWindowFunction 
    extends ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow> {

  @Override
  public void process(String key, Context context, Iterable<Tuple2<String, Long>> input, Collector<String> out) {
    long count = 0;
    for (Tuple2<String, Long> in: input) {
      count++;
    }
    out.collect("Window: " + context.window() + "count: " + count);
  }
}

The example shows a ProcessWindowFunction that counts the elements in a window. In addition, the window function adds information about the window to the output.

Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. The next section shows how a ReduceFunction or AggregateFunction can be combined with a ProcessWindowFunction to get both incremental aggregation and the added information of a ProcessWindowFunction.

ProcessWindowFunction with Incremental Aggregation #

ProcessWindowFunction can be combined with either a ReduceFunction, or an AggregateFunction to incrementally aggregate elements as they arrive in the window. When the window is closed, the ProcessWindowFunction will be provided with the aggregated result. This allows it to incrementally compute windows while having access to the additional window meta information of the ProcessWindowFunction.

You can also use the legacy WindowFunction instead of ProcessWindowFunction for incremental window aggregation.

Incremental Window Aggregation with ReduceFunction #

The following example shows how an incremental ReduceFunction can be combined with a ProcessWindowFunction to return the smallest event in a window along with the start time of the window.

DataStream<SensorReading> input = ...;

input
  .keyBy(<key selector>)
  .window(<window assigner>)
  .reduce(new MyReduceFunction(), new MyProcessWindowFunction());

// Function definitions

private static class MyReduceFunction implements ReduceFunction<SensorReading> {

  public SensorReading reduce(SensorReading r1, SensorReading r2) {
      return r1.value() > r2.value() ? r2 : r1;
  }
}

private static class MyProcessWindowFunction
    extends ProcessWindowFunction<SensorReading, Tuple2<Long, SensorReading>, String, TimeWindow> {

  public void process(String key,
                    Context context,
                    Iterable<SensorReading> minReadings,
                    Collector<Tuple2<Long, SensorReading>> out) {
      SensorReading min = minReadings.iterator().next();
      out.collect(new Tuple2<Long, SensorReading>(context.window().getStart(), min));
  }
}

Incremental Window Aggregation with AggregateFunction #

The following example shows how an incremental AggregateFunction can be combined with a ProcessWindowFunction to compute the average and also emit the key and window along with the average.

DataStream<Tuple2<String, Long>> input = ...;

input
  .keyBy(<key selector>)
  .window(<window assigner>)
  .aggregate(new AverageAggregate(), new MyProcessWindowFunction());

// Function definitions

/**
 * The accumulator is used to keep a running sum and a count. The {@code getResult} method
 * computes the average.
 */
private static class AverageAggregate
    implements AggregateFunction<Tuple2<String, Long>, Tuple2<Long, Long>, Double> {
  @Override
  public Tuple2<Long, Long> createAccumulator() {
    return new Tuple2<>(0L, 0L);
  }

  @Override
  public Tuple2<Long, Long> add(Tuple2<String, Long> value, Tuple2<Long, Long> accumulator) {
    return new Tuple2<>(accumulator.f0 + value.f1, accumulator.f1 + 1L);
  }

  @Override
  public Double getResult(Tuple2<Long, Long> accumulator) {
    return ((double) accumulator.f0) / accumulator.f1;
  }

  @Override
  public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
    return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);
  }
}

private static class MyProcessWindowFunction
    extends ProcessWindowFunction<Double, Tuple2<String, Double>, String, TimeWindow> {

  public void process(String key,
                    Context context,
                    Iterable<Double> averages,
                    Collector<Tuple2<String, Double>> out) {
      Double average = averages.iterator().next();
      out.collect(new Tuple2<>(key, average));
  }
}

Using per-window state in ProcessWindowFunction #

In addition to accessing keyed state (as any rich function can) a ProcessWindowFunction can also use keyed state that is scoped to the window that the function is currently processing. In this context it is important to understand what the window that per-window state is referring to is. There are different “windows” involved:

  • The window that was defined when specifying the windowed operation: This might be tumbling windows of 1 hour or sliding windows of 2 hours that slide by 1 hour.
  • An actual instance of a defined window for a given key: This might be time window from 12:00 to 13:00 for user-id xyz. This is based on the window definition and there will be many windows based on the number of keys that the job is currently processing and based on what time slots the events fall into.

Per-window state is tied to the latter of those two. Meaning that if we process events for 1000 different keys and events for all of them currently fall into the [12:00, 13:00) time window then there will be 1000 window instances that each have their own keyed per-window state.

There are two methods on the Context object that a process() invocation receives that allow access to the two types of state:

  • globalState(), which allows access to keyed state that is not scoped to a window
  • windowState(), which allows access to keyed state that is also scoped to the window

This feature is helpful if you anticipate multiple firing for the same window, as can happen when you have late firings for data that arrives late or when you have a custom trigger that does speculative early firings. In such a case you would store information about previous firings or the number of firings in per-window state.

When using windowed state it is important to also clean up that state when a window is cleared. This should happen in the clear() method.

WindowFunction (Legacy) #

In some places where a ProcessWindowFunction can be used you can also use a WindowFunction. This is an older version of ProcessWindowFunction that provides less contextual information and does not have some advances features, such as per-window keyed state. This interface will be deprecated at some point.

The signature of a WindowFunction looks as follows:

public interface WindowFunction<IN, OUT, KEY, W extends Window> extends Function, Serializable {

  /**
   * Evaluates the window and outputs none or several elements.
   *
   * @param key The key for which this window is evaluated.
   * @param window The window that is being evaluated.
   * @param input The elements in the window being evaluated.
   * @param out A collector for emitting elements.
   *
   * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
   */
  void apply(KEY key, W window, Iterable<IN> input, Collector<OUT> out) throws Exception;
}

It can be used like this:

DataStream<Tuple2<String, Long>> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .apply(new MyWindowFunction());

Triggers #

Trigger determines when a window (as formed by the window assigner) is ready to be processed by the window function. Each WindowAssigner comes with a default Trigger. If the default trigger does not fit your needs, you can specify a custom trigger using trigger(...).

The trigger interface has five methods that allow a Trigger to react to different events:

  • The onElement() method is called for each element that is added to a window.
  • The onEventTime() method is called when a registered event-time timer fires.
  • The onProcessingTime() method is called when a registered processing-time timer fires.
  • The onMerge() method is relevant for stateful triggers and merges the states of two triggers when their corresponding windows merge, e.g. when using session windows.
  • Finally the clear() method performs any action needed upon removal of the corresponding window.

Two things to notice about the above methods are:

  1. The first three decide how to act on their invocation event by returning a TriggerResult. The action can be one of the following:
  • CONTINUE: do nothing,
  • FIRE: trigger the computation,
  • PURGE: clear the elements in the window, and
  • FIRE_AND_PURGE: trigger the computation and clear the elements in the window afterwards.
  1. Any of these methods can be used to register processing- or event-time timers for future actions.

Fire and Purge #

Once a trigger determines that a window is ready for processing, it fires, i.e., it returns FIRE or FIRE_AND_PURGE. This is the signal for the window operator to emit the result of the current window. Given a window with a ProcessWindowFunction all elements are passed to the ProcessWindowFunction (possibly after passing them to an evictor). Windows with ReduceFunction, or AggregateFunction simply emit their eagerly aggregated result.

When a trigger fires, it can either FIRE or FIRE_AND_PURGE. While FIRE keeps the contents of the window, FIRE_AND_PURGE removes its content. By default, the pre-implemented triggers simply FIRE without purging the window state.

Purging will simply remove the contents of the window and will leave any potential meta-information about the window and any trigger state intact.

Default Triggers of WindowAssigners #

The default Trigger of a WindowAssigner is appropriate for many use cases. For example, all the event-time window assigners have an EventTimeTrigger as default trigger. This trigger simply fires once the watermark passes the end of a window.

The default trigger of the GlobalWindow is the NeverTrigger which does never fire. Consequently, you always have to define a custom trigger when using a GlobalWindow.

By specifying a trigger using trigger() you are overwriting the default trigger of a WindowAssigner. For example, if you specify a CountTrigger for TumblingEventTimeWindows you will no longer get window firings based on the progress of time but only by count. Right now, you have to write your own custom trigger if you want to react based on both time and count.

Built-in and Custom Triggers #

Flink comes with a few built-in triggers.

  • The (already mentioned) EventTimeTrigger fires based on the progress of event-time as measured by watermarks.
  • The ProcessingTimeTrigger fires based on processing time.
  • The CountTrigger fires once the number of elements in a window exceeds the given limit.
  • The PurgingTrigger takes as argument another trigger and transforms it into a purging one.

If you need to implement a custom trigger, you should check out the abstract Trigger class. Please note that the API is still evolving and might change in future versions of Flink.

Evictors #

Flink’s windowing model allows specifying an optional Evictor in addition to the WindowAssigner and the Trigger. This can be done using the evictor(...) method (shown in the beginning of this document). The evictor has the ability to remove elements from a window after the trigger fires and before and/or after the window function is applied. To do so, the Evictor interface has two methods:

/**
 * Optionally evicts elements. Called before windowing function.
 *
 * @param elements The elements currently in the pane.
 * @param size The current number of elements in the pane.
 * @param window The {@link Window}
 * @param evictorContext The context for the Evictor
 */
void evictBefore(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);

/**
 * Optionally evicts elements. Called after windowing function.
 *
 * @param elements The elements currently in the pane.
 * @param size The current number of elements in the pane.
 * @param window The {@link Window}
 * @param evictorContext The context for the Evictor
 */
void evictAfter(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);

The evictBefore() contains the eviction logic to be applied before the window function, while the evictAfter() contains the one to be applied after the window function. Elements evicted before the application of the window function will not be processed by it.

Flink comes with three pre-implemented evictors. These are:

  • CountEvictor: keeps up to a user-specified number of elements from the window and discards the remaining ones from the beginning of the window buffer.
  • DeltaEvictor: takes a DeltaFunction and a threshold, computes the delta between the last element in the window buffer and each of the remaining ones, and removes the ones with a delta greater or equal to the threshold.
  • TimeEvictor: takes as argument an interval in milliseconds and for a given window, it finds the maximum timestamp max_ts among its elements and removes all the elements with timestamps smaller than max_ts - interval.

By default, all the pre-implemented evictors apply their logic before the window function.

Specifying an evictor prevents any pre-aggregation, as all the elements of a window have to be passed to the evictor before applying the computation. This means windows with evictors will create significantly more state.
Note: Evictor is still not supported in Python DataStream API.

Flink provides no guarantees about the order of the elements within a window. This implies that although an evictor may remove elements from the beginning of the window, these are not necessarily the ones that arrive first or last.

Allowed Lateness #

When working with event-time windowing, it can happen that elements arrive late, i.e. the watermark that Flink uses to keep track of the progress of event-time is already past the end timestamp of a window to which an element belongs. See event time and especially late elements for a more thorough discussion of how Flink deals with event time.

By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows to specify a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped, and its default value is 0. Elements that arrive after the watermark has passed the end of the window but before it passes the end of the window plus the allowed lateness, are still added to the window. Depending on the trigger used, a late but not dropped element may cause the window to fire again. This is the case for the EventTimeTrigger.

In order to make this work, Flink keeps the state of windows until their allowed lateness expires. Once this happens, Flink removes the window and deletes its state, as also described in the Window Lifecycle section.

By default, the allowed lateness is set to 0. That is, elements that arrive behind the watermark will be dropped.

You can specify an allowed lateness like this:

DataStream<T> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .allowedLateness(<time>)
    .<windowed transformation>(<window function>);
When using the GlobalWindows window assigner no data is ever considered late because the end timestamp of the global window is Long.MAX_VALUE.

Getting late data as a side output #

Using Flink’s side output feature you can get a stream of the data that was discarded as late.

You first need to specify that you want to get late data using sideOutputLateData(OutputTag) on the windowed stream. Then, you can get the side-output stream on the result of the windowed operation:

final OutputTag<T> lateOutputTag = new OutputTag<T>("late-data"){};

DataStream<T> input = ...;

SingleOutputStreamOperator<T> result = input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .allowedLateness(<time>)
    .sideOutputLateData(lateOutputTag)
    .<windowed transformation>(<window function>);

DataStream<T> lateStream = result.getSideOutput(lateOutputTag);

Late elements considerations #

When specifying an allowed lateness greater than 0, the window along with its content is kept after the watermark passes the end of the window. In these cases, when a late but not dropped element arrives, it could trigger another firing for the window. These firings are called late firings, as they are triggered by late events and in contrast to the main firing which is the first firing of the window. In case of session windows, late firings can further lead to merging of windows, as they may “bridge” the gap between two pre-existing, unmerged windows.

The elements emitted by a late firing should be treated as updated results of a previous computation, i.e., your data stream will contain multiple results for the same computation. Depending on your application, you need to take these duplicated results into account or deduplicate them.

Working with window results #

The result of a windowed operation is again a DataStream, no information about the windowed operations is retained in the result elements so if you want to keep meta-information about the window you have to manually encode that information in the result elements in your ProcessWindowFunction. The only relevant information that is set on the result elements is the element timestamp. This is set to the maximum allowed timestamp of the processed window, which is end timestamp - 1, since the window-end timestamp is exclusive. Note that this is true for both event-time windows and processing-time windows. i.e. after a windowed operations elements always have a timestamp, but this can be an event-time timestamp or a processing-time timestamp. For processing-time windows this has no special implications but for event-time windows this together with how watermarks interact with windows enables consecutive windowed operations with the same window sizes. We will cover this after taking a look how watermarks interact with windows.

Interaction of watermarks and windows #

Before continuing in this section you might want to take a look at our section about event time and watermarks.

When watermarks arrive at the window operator this triggers two things:

  • the watermark triggers computation of all windows where the maximum timestamp (which is end-timestamp - 1) is smaller than the new watermark
  • the watermark is forwarded (as is) to downstream operations

Intuitively, a watermark “flushes” out any windows that would be considered late in downstream operations once they receive that watermark.

Consecutive windowed operations #

As mentioned before, the way the timestamp of windowed results is computed and how watermarks interact with windows allows stringing together consecutive windowed operations. This can be useful when you want to do two consecutive windowed operations where you want to use different keys but still want elements from the same upstream window to end up in the same downstream window. Consider this example:

DataStream<Integer> input = ...;

DataStream<Integer> resultsPerKey = input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .reduce(new Summer());

DataStream<Integer> globalResults = resultsPerKey
    .windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
    .process(new TopKWindowFunction());

In this example, the results for time window [0, 5) from the first operation will also end up in time window [0, 5) in the subsequent windowed operation. This allows calculating a sum per key and then calculating the top-k elements within the same window in the second operation.

Useful state size considerations #

Windows can be defined over long periods of time (such as days, weeks, or months) and therefore accumulate very large state. There are a couple of rules to keep in mind when estimating the storage requirements of your windowing computation:

  1. Flink creates one copy of each element per window to which it belongs. Given this, tumbling windows keep one copy of each element (an element belongs to exactly one window unless it is dropped late). In contrast, sliding windows create several of each element, as explained in the Window Assigners section. Hence, a sliding window of size 1 day and slide 1 second might not be a good idea.

  2. ReduceFunction and AggregateFunction can significantly reduce the storage requirements, as they eagerly aggregate elements and store only one value per window. In contrast, just using a ProcessWindowFunction requires accumulating all elements.

  3. Using an Evictor prevents any pre-aggregation, as all the elements of a window have to be passed through the evictor before applying the computation (see Evictors).

 

 Data Sources | Apache Flink https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/datastream/sources/

 

 

Data Sources #

This page describes Flink’s Data Source API and the concepts and architecture behind it. Read this, if you are interested in how data sources in Flink work, or if you want to implement a new Data Source.

If you are looking for pre-defined source connectors, please check the Connector Docs.

Data Source Concepts #

Core Components

A Data Source has three core components: Splits, the SplitEnumerator, and the SourceReader.

  • A Split is a portion of data consumed by the source, like a file or a log partition. Splits are the granularity by which the source distributes the work and parallelizes reading data.

  • The SourceReader requests Splits and processes them, for example by reading the file or log partition represented by the Split. The SourceReaders run in parallel on the Task Managers in the SourceOperators and produce the parallel stream of events/records.

  • The SplitEnumerator generates the Splits and assigns them to the SourceReaders. It runs as a single instance on the Job Manager and is responsible for maintaining the backlog of pending Splits and assigning them to the readers in a balanced manner.

The Source class is the API entry point that ties the above three components together.

Illustration of SplitEnumerator and SourceReader interacting

Unified Across Streaming and Batch

The Data Source API supports both unbounded streaming sources and bounded batch sources, in a unified way.

The difference between both cases is minimal: In the bounded/batch case, the enumerator generates a fixed set of splits, and each split is necessarily finite. In the unbounded streaming case, one of the two is not true (splits are not finite, or the enumerator keeps generating new splits).

Examples #

Here are some simplified conceptual examples to illustrate how the data source components interact, in streaming and batch cases.

Note that this does not accurately describe how the Kafka and File source implementations work; parts are simplified, for illustrative purposes.

Bounded File Source

The source has the URI/Path of a directory to read, and a Format that defines how to parse the files.

  • Split is a file, or a region of a file (if the data format supports splitting the file).
  • The SplitEnumerator lists all files under the given directory path. It assigns Splits to the next reader that requests a Split. Once all Splits are assigned, it responds to requests with NoMoreSplits.
  • The SourceReader requests a Split and reads the assigned Split (file or file region) and parses it using the given Format. If it does not get another Split, but a NoMoreSplits message, it finishes.

Unbounded Streaming File Source

This source works the same way as described above, except that the SplitEnumerator never responds with NoMoreSplits and periodically lists the contents under the given URI/Path to check for new files. Once it finds new files, it generates new Splits for them and can assign them to the available SourceReaders.

Unbounded Streaming Kafka Source

The source has a Kafka Topic (or list of Topics or Topic regex) and a Deserializer to parse the records.

  • Split is a Kafka Topic Partition.
  • The SplitEnumerator connects to the brokers to list all topic partitions involved in the subscribed topics. The enumerator can optionally repeat this operation to discover newly added topics/partitions.
  • The SourceReader reads the assigned Splits (Topic Partitions) using the KafkaConsumer and deserializes the records using the provided Deserializer. The splits (Topic Partitions) do not have an end, so the reader never reaches the end of the data.

Bounded Kafka Source

Same as above, except that each Split (Topic Partition) has a defined end offset. Once the SourceReader reaches the end offset for a Split, it finishes that Split. Once all assigned Splits are finished, the SourceReader finishes.

The Data Source API #

This section describes the major interfaces of the new Source API introduced in FLIP-27, and provides tips to the developers on the Source development.

Source #

The Source API is a factory style interface to create the following components.

  • Split Enumerator
  • Source Reader
  • Split Serializer
  • Enumerator Checkpoint Serializer

In addition to that, the Source provides the boundedness attribute of the source, so that Flink can choose the appropriate mode to run the Flink jobs.

The Source implementations should be serializable as the Source instances are serialized and uploaded to the Flink cluster at runtime.

SplitEnumerator #

The SplitEnumerator is expected to be the “brain” of the Source. Typical implementations of the SplitEnumerator do the following:

  • SourceReader registration handling
  • SourceReader failure handling
    • The addSplitsBack() method will be invoked when a SourceReader fails. The SplitEnumerator should take back the split assignments that have not been acknowledged by the failed SourceReader.
  • SourceEvent handling
    • SourceEvents are custom events sent between SplitEnumerator and SourceReader. The implementation can leverage this mechanism to perform sophisticated coordination.
  • Split discovery and assignment
    • The SplitEnumerator can assign splits to the SourceReaders in response to various events, including discovery of new splits, new SourceReader registration, SourceReader failure, etc.

SplitEnumerator can accomplish the above work with the help of the SplitEnumeratorContext which is provided to the Source on creation or restore of the SplitEnumerator. The SplitEnumeratorContext allows a SplitEnumerator to retrieve necessary information of the readers and perform coordination actions. The Source implementation is expected to pass the SplitEnumeratorContext to the SplitEnumerator instance.

While a SplitEnumerator implementation can work well in a reactive way by only taking coordination actions when its method is invoked, some SplitEnumerator implementations might want to take actions actively. For example, a SplitEnumerator may want to periodically run split discovery and assign the new splits to the SourceReaders. Such implementations may find that the callAsync() method in the SplitEnumeratorContext is handy. The code snippet below shows how the SplitEnumerator implementation can achieve that without maintaining its own threads.

class MySplitEnumerator implements SplitEnumerator<MySplit, MyCheckpoint> {
    private final long DISCOVER_INTERVAL = 60_000L;

    private final SplitEnumeratorContext<MySplit> enumContext            ;

    /** The Source creates instances of SplitEnumerator and provides the context. */
    MySplitEnumerator(SplitEnumeratorContext<MySplit> enumContext) {
        this.enumContext = enumContext;
    }

    /**
     * A method to discover the splits.
     */
    private List<MySplit> discoverSplits() {...}
    
    @Override
    public void start() {
        ...
        enumContext.callAsync(this::discoverSplits, (splits, thrown) -> {
            Map<Integer, List<MySplit>> assignments = new HashMap<>();
            int parallelism = enumContext.currentParallelism();
            for (MySplit split : splits) {
                int owner = split.splitId().hashCode() % parallelism;
                assignments.computeIfAbsent(owner, s -> new ArrayList<>()).add(split);
            }
            enumContext.assignSplits(new SplitsAssignment<>(assignments));
        }, 0L, DISCOVER_INTERVAL);
        ...
    }
    ...
}

SourceReader #

The SourceReader is a component running in the Task Managers to consume the records from the Splits.

The SourceReader exposes a pull-based consumption interface. A Flink task keeps calling pollNext(ReaderOutput) in a loop to poll records from the SourceReader. The return value of the pollNext(ReaderOutput) method indicates the status of the source reader.

  • MORE_AVAILABLE - The SourceReader has more records available immediately.
  • NOTHING_AVAILABLE - The SourceReader does not have more records available at this point, but may have more records in the future.
  • END_OF_INPUT - The SourceReader has exhausted all the records and reached the end of data. This means the SourceReader can be closed.

In the interest of performance, a ReaderOutput is provided to the pollNext(ReaderOutput) method, so a SourceReader can emit multiple records in a single call of pollNext() if it has to. For example, sometimes the external system works at the granularity of blocks. A block may contain multiple records but the source can only checkpoint at the block boundaries. In this case the SourceReader can emit all the records in one block at a time to the ReaderOutput. However, the SourceReader implementation should avoid emitting multiple records in a single pollNext(ReaderOutput) invocation unless necessary. This is because the task thread that is polling from the SourceReader works in an event-loop and cannot block.

All the state of a SourceReader should be maintained inside the SourceSplits which are returned at the snapshotState() invocation. Doing this allows the SourceSplits to be reassigned to other SourceReaders when needed.

SourceReaderContext is provided to the Source upon a SourceReader creation. It is expected that the Source will pass the context to the SourceReader instance. The SourceReader can send SourceEvent to its SplitEnumerator through the SourceReaderContext. A typical design pattern of the Source is letting the SourceReaders report their local information to the SplitEnumerator who has a global view to make decisions.

The SourceReader API is a low level API that allows users to deal with the splits manually and have their own threading model to fetch and handover the records. To facilitate the SourceReader implementation, Flink has provided a SourceReaderBase class which significantly reduces the amount the work needed to write a SourceReader. It is highly recommended for the connector developers to take advantage of the SourceReaderBase instead of writing the SourceReaders from scratch. For more details please check the Split Reader API section.

Use the Source #

In order to create a DataStream from a Source, one needs to pass the Source to a StreamExecutionEnvironment. For example,

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

Source mySource = new MySource(...);

DataStream<Integer> stream = env.fromSource(
        mySource,
        WatermarkStrategy.noWatermarks(),
        "MySourceName");
...

The Split Reader API #

The core SourceReader API is fully asynchronous and requires implementations to manually manage reading splits asynchronously. However, in practice, most sources perform blocking operations, like blocking poll() calls on clients (for example the KafkaConsumer), or blocking I/O operations on distributed file systems (HDFS, S3, …). To make this compatible with the asynchronous Source API, these blocking (synchronous) operations need to happen in separate threads, which hand over the data to the asynchronous part of the reader.

The SplitReader is the high-level API for simple synchronous reading/polling-based source implementations, like file reading, Kafka, etc.

The core is the SourceReaderBase class, which takes a SplitReader and creates fetcher threads running the SplitReader, supporting different consumption threading models.

SplitReader #

The SplitReader API only has three methods:

  • A blocking fetch method to return a RecordsWithSplitIds
  • A non-blocking method to handle split changes.
  • A non-blocking wake up method to wake up the blocking fetch operation.

The SplitReader only focuses on reading the records from the external system, therefore is much simpler compared with SourceReader. Please check the Java doc of the class for more details.

SourceReaderBase #

It is quite common that a SourceReader implementation does the following:

  • Have a pool of threads fetching from splits of the external system in a blocking way.
  • Handle the synchronization between the internal fetching threads and other methods invocations such as pollNext(ReaderOutput).
  • Maintain the per split watermark for watermark alignment.
  • Maintain the state of each split for checkpoint.

In order to reduce the work of writing a new SourceReader, Flink provides a SourceReaderBase class to serve as a base implementation of the SourceReaderSourceReaderBase has all the above work done out of the box. To write a new SourceReader, one can just let the SourceReader implementation inherit from the SourceReaderBase, fill in a few methods and implement a high level SplitReader .

SplitFetcherManager #

The SourceReaderBase supports a few threading models out of the box, depending on the behavior of the SplitFetcherManager it works with. The SplitFetcherManager helps create and maintain a pool of SplitFetchers each fetching with a SplitReader. It also determines how to assign splits to each split fetcher.

As an example, as illustrated below, a SplitFetcherManager may have a fixed number of threads, each fetching from some splits assigned to the SourceReader.

One fetcher per split threading model.

The following code snippet implements this threading model.

/**
 * A SplitFetcherManager that has a fixed size of split fetchers and assign splits 
 * to the split fetchers based on the hash code of split IDs.
 */
public class FixedSizeSplitFetcherManager<E, SplitT extends SourceSplit> 
        extends SplitFetcherManager<E, SplitT> {
    private final int numFetchers;

    public FixedSizeSplitFetcherManager(
            int numFetchers,
            FutureCompletingBlockingQueue<RecordsWithSplitIds<E>> elementsQueue,
            Supplier<SplitReader<E, SplitT>> splitReaderSupplier) {
        super(elementsQueue, splitReaderSupplier);
        this.numFetchers = numFetchers;
        // Create numFetchers split fetchers.
        for (int i = 0; i < numFetchers; i++) {
            startFetcher(createSplitFetcher());
        }
    }

    @Override
    public void addSplits(List<SplitT> splitsToAdd) {
        // Group splits by their owner fetchers.
        Map<Integer, List<SplitT>> splitsByFetcherIndex = new HashMap<>();
        splitsToAdd.forEach(split -> {
            int ownerFetcherIndex = split.hashCode() % numFetchers;
            splitsByFetcherIndex
                    .computeIfAbsent(ownerFetcherIndex, s -> new ArrayList<>())
                    .add(split);
        });
        // Assign the splits to their owner fetcher.
        splitsByFetcherIndex.forEach((fetcherIndex, splitsForFetcher) -> {
            fetchers.get(fetcherIndex).addSplits(splitsForFetcher);
        });
    }
}

And a SourceReader using this threading model can be created like following:

public class FixedFetcherSizeSourceReader<E, T, SplitT extends SourceSplit, SplitStateT>
        extends SourceReaderBase<E, T, SplitT, SplitStateT> {

    public FixedFetcherSizeSourceReader(
            FutureCompletingBlockingQueue<RecordsWithSplitIds<E>> elementsQueue,
            Supplier<SplitReader<E, SplitT>> splitFetcherSupplier,
            RecordEmitter<E, T, SplitStateT> recordEmitter,
            Configuration config,
            SourceReaderContext context) {
        super(
                elementsQueue,
                new FixedSizeSplitFetcherManager<>(
                        config.getInteger(SourceConfig.NUM_FETCHERS),
                        elementsQueue,
                        splitFetcherSupplier),
                recordEmitter,
                config,
                context);
    }

    @Override
    protected void onSplitFinished(Map<String, SplitStateT> finishedSplitIds) {
        // Do something in the callback for the finished splits.
    }

    @Override
    protected SplitStateT initializedState(SplitT split) {
        ...
    }

    @Override
    protected SplitT toSplitType(String splitId, SplitStateT splitState) {
        ...
    }
}

The SourceReader implementations can also implement their own threading model easily on top of the SplitFetcherManager and SourceReaderBase.

Event Time and Watermarks #

Event Time assignment and Watermark Generation happen as part of the data sources. The event streams leaving the Source Readers have event timestamps and (during streaming execution) contain watermarks. See Timely Stream Processing for an introduction to Event Time and Watermarks.

Applications based on the legacy SourceFunction typically generate timestamps and watermarks in a separate later step via stream.assignTimestampsAndWatermarks(WatermarkStrategy). This function should not be used with the new sources, because timestamps will be already assigned, and it will override the previous split-aware watermarks.

API #

The WatermarkStrategy is passed to the Source during creation in the DataStream API and creates both the TimestampAssigner and WatermarkGenerator .

environment.fromSource(
    Source<OUT, ?, ?> source,
    WatermarkStrategy<OUT> timestampsAndWatermarks,
    String sourceName);

The TimestampAssigner and WatermarkGenerator run transparently as part of the ReaderOutput(or SourceOutput) so source implementors do not have to implement any timestamp extraction and watermark generation code.

Event Timestamps #

Event timestamps are assigned in two steps:

  1. The SourceReader may attach the source record timestamp to the event, by calling SourceOutput.collect(event, timestamp). This is relevant only for data sources that are record-based and have timestamps, such as Kafka, Kinesis, Pulsar, or Pravega. Sources that are not based on records with timestamps (like files) do not have a source record timestamp. This step is part of the source connector implementation and not parameterized by the application that uses the source.

  2. The TimestampAssigner, which is configured by the application, assigns the final timestamp. The TimestampAssigner sees the original source record timestamp and the event. The assigner can use the source record timestamp or access a field of the event to obtain the final event timestamp.

This two-step approach allows users to reference both timestamps from the source systems and timestamps in the event’s data as the event timestamp.

Note: When using a data source without source record timestamps (like files) and selecting the source record timestamp as the final event timestamp, events will get a default timestamp equal to LONG_MIN (=-9,223,372,036,854,775,808).

Watermark Generation #

Watermark Generators are only active during streaming execution. Batch execution deactivates Watermark Generators; all related operations described below become effectively no-ops.

The data source API supports running watermark generators individually per split. That allows Flink to observe the event time progress per split individually, which is important to handle event time skew properly and prevent idle partitions from holding back the event time progress of the entire application.

Watermark Generation in a Source with two Splits.

When implementing a source connector using the Split Reader API, this is automatically handled. All implementations based on the Split Reader API have split-aware watermarks out-of-the-box.

For an implementation of the lower level SourceReader API to use split-aware watermark generation, the implementation must output events from different splits to different outputs: the Split-local SourceOutputs. Split-local outputs can be created and released on the main ReaderOutput via the createOutputForSplit(splitId) and releaseOutputForSplit(splitId) methods. Please refer to the JavaDocs of the class and methods for details.

Split Level Watermark Alignment #

Although source operator watermark alignment is handled by Flink runtime, the source needs to additionally implement SourceReader#pauseOrResumeSplits and SplitReader#pauseOrResumeSplits to achieve split level watermark alignment. Split level watermark alignment is useful for when there are multiple splits assigned to a source reader. By default, these implementations will throw an UnsupportedOperationExceptionpipeline.watermark-alignment.allow-unaligned-source-splits is set to false, when there is more than one split assigned, and the split exceeds the watermark alignment threshold configured by the WatermarkStrategySourceReaderBase contains an implementation for SourceReader#pauseOrResumeSplits so that inheriting sources only need to implement SplitReader#pauseOrResumeSplits. See the javadocs for more implementation hints.

 

 

 

posted @ 2023-09-16 12:02  papering  阅读(10)  评论(0编辑  收藏  举报