自定义函数 UDF operator window 窗口 datas ource 数据源 水印 事件时间 event time watermark
FlinkSQL支持的窗口函数类型_实时计算 Flink版(Flink)-阿里云帮助中心 https://help.aliyun.com/zh/flink/developer-reference/overview-4
时间属性
Flink SQL支持Event Time和Processing Time两种时间属性,时间属性详情请参见时间属性。Flink可以基于这两种时间属性对数据进行窗口聚合。基于这两种时间属性开窗的区别如下:
-
Event Time:您提供的事件时间,通常是数据的最原始的创建时间。
系统会根据数据的Event Time生成的Watermark来进行关窗。只有当Watermark大于关窗时间,才会触发窗口的结束,窗口结束才会输出结果。如果一直没有触发窗口结束的数据流入Flink,则该窗口就无法输出数据。单个subtask的Watermark是递增的,多个subtask或者多个源表的Watermark取最小值。
-
如果源表中存在一条未来的乱序数据或者某个subtask或上游源表的某个分区中没有数据,则可能会无法触发窗口结束,从而导致结果数据异常。因此您需要根据数据乱序的程度设置合理的offset大小,并保证所有subtask和上游源表的所有分区中都有数据。如果某个subtask或上游源表的某个分区中没有数据,导致Watermark无法推进,窗口无法及时结束,则可以在更多Flink配置中添加
table.exec.source.idle-timeout: 10s
来触发窗口结束。该参数含义详情请参见Configuration。 -
数据经过GroupBy、双流JOIN或OVER窗口节点后,会导致Watermark属性丢失,无法再使用Event Time进行开窗。
-
-
Processing Time:对事件进行处理的本地系统时间。
Processing Time是Flink系统产生的,不在用户的原始数据中。因此需要您显示定义一个Processing Time列。
因为Processing Time容易受到事件到达Flink系统的速度及Flink内部处理数据顺序的影响,所以每次回溯数据的结果可能不一致。
Windows | Apache Flink https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/
Windows #
Windows are at the heart of processing infinite streams. Windows split the stream into “buckets” of finite size, over which we can apply computations. This document focuses on how windowing is performed in Flink and how the programmer can benefit to the maximum from its offered functionality.
The general structure of a windowed Flink program is presented below. The first snippet refers to keyed streams, while the second to non-keyed ones. As one can see, the only difference is the keyBy(...)
call for the keyed streams and the window(...)
which becomes windowAll(...)
for non-keyed streams. This is also going to serve as a roadmap for the rest of the page.
实际:
1、
env.createTemporarySystemFunction("SubstringFunction", SubstringFunction.class);
小结:
1、操作者执行自定义函数。
2、
Operators transform one or more DataStreams into a new DataStream. Programs can combine multiple transformations into sophisticated dataflow topologies.
状态的转换。
3、窗口是处理流的核心。
User-Defined Functions | Apache Flink https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/datastream/user_defined_functions/
User-Defined Functions #
Most operations require a user-defined function. This section lists different ways of how they can be specified. We also cover Accumulators
, which can be used to gain insights into your Flink application.
Implementing an interface #
The most basic way is to implement one of the provided interfaces:
class MyMapFunction implements MapFunction<String, Integer> {
public Integer map(String value) { return Integer.parseInt(value); }
}
data.map(new MyMapFunction());
Anonymous classes #
You can pass a function as an anonymous class:
data.map(new MapFunction<String, Integer> () {
public Integer map(String value) { return Integer.parseInt(value); }
});
Java 8 Lambdas #
Flink also supports Java 8 Lambdas in the Java API.
data.filter(s -> s.startsWith("http://"));
data.reduce((i1,i2) -> i1 + i2);
Rich functions #
All transformations that require a user-defined function can instead take as argument a rich function. For example, instead of
class MyMapFunction implements MapFunction<String, Integer> {
public Integer map(String value) { return Integer.parseInt(value); }
}
you can write
class MyMapFunction extends RichMapFunction<String, Integer> {
public Integer map(String value) { return Integer.parseInt(value); }
}
and pass the function as usual to a map
transformation:
data.map(new MyMapFunction());
Rich functions can also be defined as an anonymous class:
data.map (new RichMapFunction<String, Integer>() {
public Integer map(String value) { return Integer.parseInt(value); }
});
Rich functions provide, in addition to the user-defined function (map, reduce, etc), four methods: open
, close
, getRuntimeContext
, and setRuntimeContext
. These are useful for parameterizing the function (see Passing Parameters to Functions), creating and finalizing local state, accessing broadcast variables (see Broadcast Variables), and for accessing runtime information such as accumulators and counters (see Accumulators and Counters), and information on iterations (see Iterations).
Accumulators & Counters #
Accumulators are simple constructs with an add operation and a final accumulated result, which is available after the job ended.
The most straightforward accumulator is a counter: You can increment it using the Accumulator.add(V value)
method. At the end of the job Flink will sum up (merge) all partial results and send the result to the client. Accumulators are useful during debugging or if you quickly want to find out more about your data.
Flink currently has the following built-in accumulators. Each of them implements the Accumulator interface.
- IntCounter , LongCounter and DoubleCounter : See below for an example using a counter.
- Histogram : A histogram implementation for a discrete number of bins. Internally it is just a map from Integer to Integer. You can use this to compute distributions of values, e.g. the distribution of words-per-line for a word count program.
How to use accumulators:
First you have to create an accumulator object (here a counter) in the user-defined transformation function where you want to use it.
private IntCounter numLines = new IntCounter();
Second you have to register the accumulator object, typically in the open()
method of the rich function. Here you also define the name.
getRuntimeContext().addAccumulator("num-lines", this.numLines);
You can now use the accumulator anywhere in the operator function, including in the open()
and close()
methods.
this.numLines.add(1);
The overall result will be stored in the JobExecutionResult
object which is returned from the execute()
method of the execution environment (currently this only works if the execution waits for the completion of the job).
myJobExecutionResult.getAccumulatorResult("num-lines");
All accumulators share a single namespace per job. Thus you can use the same accumulator in different operator functions of your job. Flink will internally merge all accumulators with the same name.
A note on accumulators and iterations: Currently the result of accumulators is only available after the overall job has ended. We plan to also make the result of the previous iteration available in the next iteration. You can use Aggregators to compute per-iteration statistics and base the termination of iterations on such statistics.
Custom accumulators:
To implement your own accumulator you simply have to write your implementation of the Accumulator interface. Feel free to create a pull request if you think your custom accumulator should be shipped with Flink.
You have the choice to implement either Accumulator or SimpleAccumulator .
Accumulator<V,R>
is most flexible: It defines a type V
for the value to add, and a result type R
for the final result. E.g. for a histogram, V
is a number and R
is a histogram. SimpleAccumulator
is for the cases where both types are the same, e.g. for counters.
Overview | Apache Flink https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/datastream/operators/overview/
Operators transform one or more DataStreams into a new DataStream. Programs can combine multiple transformations into sophisticated dataflow topologies.
This section gives a description of the basic transformations, the effective physical partitioning after applying those as well as insights into Flink’s operator chaining.
Windows | Apache Flink https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/datastream/operators/windows/
Windows #
Windows are at the heart of processing infinite streams. Windows split the stream into “buckets” of finite size, over which we can apply computations. This document focuses on how windowing is performed in Flink and how the programmer can benefit to the maximum from its offered functionality.
The general structure of a windowed Flink program is presented below. The first snippet refers to keyed streams, while the second to non-keyed ones. As one can see, the only difference is the keyBy(...)
call for the keyed streams and the window(...)
which becomes windowAll(...)
for non-keyed streams. This is also going to serve as a roadmap for the rest of the page.
Keyed Windows
stream
.keyBy(...) <- keyed versus non-keyed windows
.window(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
Non-Keyed Windows
stream
.windowAll(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
In the above, the commands in square brackets ([…]) are optional. This reveals that Flink allows you to customize your windowing logic in many different ways so that it best fits your needs.
Note: Evictor
is still not supported in Python DataStream API.
Window Lifecycle #
In a nutshell, a window is created as soon as the first element that should belong to this window arrives, and the window is completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness
(see Allowed Lateness). Flink guarantees removal only for time-based windows and not for other types, e.g. global windows (see Window Assigners). For example, with an event-time-based windowing strategy that creates non-overlapping (or tumbling) windows every 5 minutes and has an allowed lateness of 1 min, Flink will create a new window for the interval between 12:00
and 12:05
when the first element with a timestamp that falls into this interval arrives, and it will remove it when the watermark passes the 12:06
timestamp.
In addition, each window will have a Trigger
(see Triggers) and a function (ProcessWindowFunction
, ReduceFunction
, or AggregateFunction
) (see Window Functions) attached to it. The function will contain the computation to be applied to the contents of the window, while the Trigger
specifies the conditions under which the window is considered ready for the function to be applied. A triggering policy might be something like “when the number of elements in the window is more than 4”, or “when the watermark passes the end of the window”. A trigger can also decide to purge a window’s contents any time between its creation and removal. Purging in this case only refers to the elements in the window, and not the window metadata. This means that new data can still be added to that window.
Apart from the above, you can specify an Evictor
(see Evictors) which will be able to remove elements from the window after the trigger fires and before and/or after the function is applied.
In the following we go into more detail for each of the components above. We start with the required parts in the above snippet (see Keyed vs Non-Keyed Windows, Window Assigners, and Window Functions) before moving to the optional ones.
Keyed vs Non-Keyed Windows #
The first thing to specify is whether your stream should be keyed or not. This has to be done before defining the window. Using the keyBy(...)
will split your infinite stream into logical keyed streams. If keyBy(...)
is not called, your stream is not keyed.
In the case of keyed streams, any attribute of your incoming events can be used as a key (more details here). Having a keyed stream will allow your windowed computation to be performed in parallel by multiple tasks, as each logical keyed stream can be processed independently from the rest. All elements referring to the same key will be sent to the same parallel task.
In case of non-keyed streams, your original stream will not be split into multiple logical streams and all the windowing logic will be performed by a single task, i.e. with parallelism of 1.
Window Assigners #
After specifying whether your stream is keyed or not, the next step is to define a window assigner. The window assigner defines how elements are assigned to windows. This is done by specifying the WindowAssigner
of your choice in the window(...)
(for keyed streams) or the windowAll()
(for non-keyed streams) call.
A WindowAssigner
is responsible for assigning each incoming element to one or more windows. Flink comes with pre-defined window assigners for the most common use cases, namely tumbling windows, sliding windows, session windows and global windows. You can also implement a custom window assigner by extending the WindowAssigner
class. All built-in window assigners (except the global windows) assign elements to windows based on time, which can either be processing time or event time. Please take a look at our section on event time to learn about the difference between processing time and event time and how timestamps and watermarks are generated.
Time-based windows have a start timestamp (inclusive) and an end timestamp (exclusive) that together describe the size of the window. In code, Flink uses TimeWindow
when working with time-based windows which has methods for querying the start- and end-timestamp and also an additional method maxTimestamp()
that returns the largest allowed timestamp for a given windows.
In the following, we show how Flink’s pre-defined window assigners work and how they are used in a DataStream program. The following figures visualize the workings of each assigner. The purple circles represent elements of the stream, which are partitioned by some key (in this case user 1, user 2 and user 3). The x-axis shows the progress of time.
Tumbling Windows #
A tumbling windows assigner assigns each element to a window of a specified window size. Tumbling windows have a fixed size and do not overlap. For example, if you specify a tumbling window with a size of 5 minutes, the current window will be evaluated and a new window will be started every five minutes as illustrated by the following figure.
The following code snippets show how to use tumbling windows.
DataStream<T> input = ...;
// tumbling event-time windows
input
.keyBy(<key selector>)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.<windowed transformation>(<window function>);
// tumbling processing-time windows
input
.keyBy(<key selector>)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.<windowed transformation>(<window function>);
// daily tumbling event-time windows offset by -8 hours.
input
.keyBy(<key selector>)
.window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8)))
.<windowed transformation>(<window function>);
Time intervals can be specified by using one of Time.milliseconds(x)
, Time.seconds(x)
, Time.minutes(x)
, and so on.
As shown in the last example, tumbling window assigners also take an optional offset
parameter that can be used to change the alignment of windows. For example, without offsets hourly tumbling windows are aligned with epoch, that is you will get windows such as 1:00:00.000 - 1:59:59.999
, 2:00:00.000 - 2:59:59.999
and so on. If you want to change that you can give an offset. With an offset of 15 minutes you would, for example, get 1:15:00.000 - 2:14:59.999
, 2:15:00.000 - 3:14:59.999
etc. An important use case for offsets is to adjust windows to timezones other than UTC-0. For example, in China you would have to specify an offset of Time.hours(-8)
.
Sliding Windows #
The sliding windows assigner assigns elements to windows of fixed length. Similar to a tumbling windows assigner, the size of the windows is configured by the window size parameter. An additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows can be overlapping if the slide is smaller than the window size. In this case elements are assigned to multiple windows.
For example, you could have windows of size 10 minutes that slides by 5 minutes. With this you get every 5 minutes a window that contains the events that arrived during the last 10 minutes as depicted by the following figure.
The following code snippets show how to use sliding windows.
DataStream<T> input = ...;
// sliding event-time windows
input
.keyBy(<key selector>)
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.<windowed transformation>(<window function>);
// sliding processing-time windows
input
.keyBy(<key selector>)
.window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.<windowed transformation>(<window function>);
// sliding processing-time windows offset by -8 hours
input
.keyBy(<key selector>)
.window(SlidingProcessingTimeWindows.of(Time.hours(12), Time.hours(1), Time.hours(-8)))
.<windowed transformation>(<window function>);
Time intervals can be specified by using one of Time.milliseconds(x)
, Time.seconds(x)
, Time.minutes(x)
, and so on.
As shown in the last example, sliding window assigners also take an optional offset
parameter that can be used to change the alignment of windows. For example, without offsets hourly windows sliding by 30 minutes are aligned with epoch, that is you will get windows such as 1:00:00.000 - 1:59:59.999
, 1:30:00.000 - 2:29:59.999
and so on. If you want to change that you can give an offset. With an offset of 15 minutes you would, for example, get 1:15:00.000 - 2:14:59.999
, 1:45:00.000 - 2:44:59.999
etc. An important use case for offsets is to adjust windows to timezones other than UTC-0. For example, in China you would have to specify an offset of Time.hours(-8)
.
Session Windows #
The session windows assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time, in contrast to tumbling windows and sliding windows. Instead a session window closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred. A session window assigner can be configured with either a static session gap or with a session gap extractor function which defines how long the period of inactivity is. When this period expires, the current session closes and subsequent elements are assigned to a new session window.
The following code snippets show how to use session windows.
DataStream<T> input = ...;
// event-time session windows with static gap
input
.keyBy(<key selector>)
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.<windowed transformation>(<window function>);
// event-time session windows with dynamic gap
input
.keyBy(<key selector>)
.window(EventTimeSessionWindows.withDynamicGap((element) -> {
// determine and return session gap
}))
.<windowed transformation>(<window function>);
// processing-time session windows with static gap
input
.keyBy(<key selector>)
.window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
.<windowed transformation>(<window function>);
// processing-time session windows with dynamic gap
input
.keyBy(<key selector>)
.window(ProcessingTimeSessionWindows.withDynamicGap((element) -> {
// determine and return session gap
}))
.<windowed transformation>(<window function>);
Static gaps can be specified by using one of Time.milliseconds(x)
, Time.seconds(x)
, Time.minutes(x)
, and so on.
Dynamic gaps are specified by implementing the SessionWindowTimeGapExtractor
interface.
Since session windows do not have a fixed start and end, they are evaluated differently than tumbling and sliding windows. Internally, a session window operator creates a new window for each arriving record and merges windows together if they are closer to each other than the defined gap. In order to be mergeable, a session window operator requires a merging Trigger and a merging Window Function, such asReduceFunction
,AggregateFunction
, orProcessWindowFunction
Global Windows #
A global windows assigner assigns all elements with the same key to the same single global window. This windowing scheme is only useful if you also specify a custom trigger. Otherwise, no computation will be performed, as the global window does not have a natural end at which we could process the aggregated elements.
The following code snippets show how to use a global window.
DataStream<T> input = ...;
input
.keyBy(<key selector>)
.window(GlobalWindows.create())
.<windowed transformation>(<window function>);
Window Functions #
After defining the window assigner, we need to specify the computation that we want to perform on each of these windows. This is the responsibility of the window function, which is used to process the elements of each (possibly keyed) window once the system determines that a window is ready for processing (see triggers for how Flink determines when a window is ready).
The window function can be one of ReduceFunction
, AggregateFunction
, or ProcessWindowFunction
. The first two can be executed more efficiently (see State Size section) because Flink can incrementally aggregate the elements for each window as they arrive. A ProcessWindowFunction
gets an Iterable
for all the elements contained in a window and additional meta information about the window to which the elements belong.
A windowed transformation with a ProcessWindowFunction
cannot be executed as efficiently as the other cases because Flink has to buffer all elements for a window internally before invoking the function. This can be mitigated by combining a ProcessWindowFunction
with a ReduceFunction
, or AggregateFunction
to get both incremental aggregation of window elements and the additional window metadata that the ProcessWindowFunction
receives. We will look at examples for each of these variants.
ReduceFunction #
A ReduceFunction
specifies how two elements from the input are combined to produce an output element of the same type. Flink uses a ReduceFunction
to incrementally aggregate the elements of a window.
A ReduceFunction
can be defined and used like this:
DataStream<Tuple2<String, Long>> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.reduce(new ReduceFunction<Tuple2<String, Long>>() {
public Tuple2<String, Long> reduce(Tuple2<String, Long> v1, Tuple2<String, Long> v2) {
return new Tuple2<>(v1.f0, v1.f1 + v2.f1);
}
});
The above example sums up the second fields of the tuples for all elements in a window.
AggregateFunction #
An AggregateFunction
is a generalized version of a ReduceFunction
that has three types: an input type (IN
), accumulator type (ACC
), and an output type (OUT
). The input type is the type of elements in the input stream and the AggregateFunction
has a method for adding one input element to an accumulator. The interface also has methods for creating an initial accumulator, for merging two accumulators into one accumulator and for extracting an output (of type OUT
) from an accumulator. We will see how this works in the example below.
Same as with ReduceFunction
, Flink will incrementally aggregate input elements of a window as they arrive.
An AggregateFunction
can be defined and used like this:
/**
* The accumulator is used to keep a running sum and a count. The {@code getResult} method
* computes the average.
*/
private static class AverageAggregate
implements AggregateFunction<Tuple2<String, Long>, Tuple2<Long, Long>, Double> {
@Override
public Tuple2<Long, Long> createAccumulator() {
return new Tuple2<>(0L, 0L);
}
@Override
public Tuple2<Long, Long> add(Tuple2<String, Long> value, Tuple2<Long, Long> accumulator) {
return new Tuple2<>(accumulator.f0 + value.f1, accumulator.f1 + 1L);
}
@Override
public Double getResult(Tuple2<Long, Long> accumulator) {
return ((double) accumulator.f0) / accumulator.f1;
}
@Override
public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);
}
}
DataStream<Tuple2<String, Long>> input = ...;