又看下了 Join 算子 Transformation 的过程,发现有使用 union 和 coGroup,比较特殊,就仔细梳理一下
join demo 代码: 两个 Stream join 只能在窗口中进行 join,join 在处理无界数据集的时候,必须指定窗口,让无界数据变成有界数据,Flink 状态缓存左右两条流的部分数据做 join 联接,在时间(或条数、指定窗口)上清除超过 join 窗口范围的数据,Flink 状态的大小才能保持在一个合理的范围内,而不是一直增大,直到超出大小失败。
val join = process.join(map) .where(str => str) .equalTo(str => str) .window(TumblingProcessingTimeWindows.of(Time.minutes(1))) .apply(new JoinFunction[String, String, String] { override def join(first: String, second: String): String = { first + ";" + second } })
先是 input1.join(input2) , 使用 A、B 两个流创建一个 JoinedStreams, input1、input2 分别是左右两个流
DataStream.scala
def join[T2](otherStream: DataStream[T2]): JoinedStreams[T, T2] = { new JoinedStreams(this, otherStream) }
where、equalTo、window 没什么内容,跳过
apply 方法是 join 算子的关键
先使用 JoinedStreams 的 input1、input2 创建一个 JavaJoinedStreams (多直白的名字, 上篇已经说过了, Java 开发了 Flink 的方法,Scala Api 相当于一个壳,调用了 Java 的内容)
对 JavaJoinedStreams 的对象 join 调用 Join 对应的内容, 如果没有指定 trigger、evictor、allowedLateness 就是 null
返回的结果做为参数调用 asScalaStream,将 Java 的 DataStream 转为 Scala 的 DataStream,供后续使用
JoinedStreams.scala
def apply[T: TypeInformation](function: JoinFunction[T1, T2, T]): DataStream[T] = { // 创建 JavaJoinedStreams val join = new JavaJoinedStreams[T1, T2](input1.javaStream, input2.javaStream) asScalaStream(join .where(keySelector1) .equalTo(keySelector2) .window(windowAssigner) .trigger(trigger) .evictor(evictor) .allowedLateness(allowedLateness) // apply join .apply(clean(function), implicitly[TypeInformation[T]])) }
再看 Java的 JoinedStreams 的 apply 放,将 JoinedStreams 转成 CoGroupedStreams 来处理 join 算子,input、where、equalTo 等直接平移过来,最后调用 CoGroupedStreams 的 apply 方法
JoinedStreams.java
public <T> DataStream<T> apply(JoinFunction<T1, T2, T> function, TypeInformation<T> resultType) { //clean the closure function = input1.getExecutionEnvironment().clean(function); // join 变 coGroup 了, input1 input2 还是 他们 coGroupedWindowedStream = input1.coGroup(input2) .where(keySelector1) .equalTo(keySelector2) .window(windowAssigner) .trigger(trigger) .evictor(evictor) .allowedLateness(allowedLateness); // 调用 coGroupedWindowedStream 的 apply 处理 return coGroupedWindowedStream .apply(new JoinCoGroupFunction<>(function), resultType); }
CoGroupedStreams 的 apply 方法 看着就更有意思了,将 input1、input2 转成类型是 TaggedUnion<T1, T2> 的 DataStream, 对两个新流调用 map(new Input1Tagger<T1, T2>())、map(new Input2Tagger<T1, T2>()) 方法,将两个流的类型转成一样,只是在输出数据是,只有自己这边有数据,另一边直接给 null
有将两个流的 keySelect 组合成 unionKeySelector
使用 union 后的流,创建 KeyedStream 传入 unionKeySelector, 指定分区的 PartitionTransformation, 并生成窗口
最后调用 windowedStream.apply 方法
CoGroupedStreams.java
public <T> DataStream<T> apply(CoGroupFunction<T1, T2, T> function, TypeInformation<T> resultType) { //clean the closure function = input1.getExecutionEnvironment().clean(function); // 定义 union 的 UnionTypeInfo, 两种类型组合 UnionTypeInfo<T1, T2> unionType = new UnionTypeInfo<>(input1.getType(), input2.getType()); // 定义 union 的 KeySelector,两个 keySelector UnionKeySelector<T1, T2, KEY> unionKeySelector = new UnionKeySelector<>(keySelector1, keySelector2); // input1 创建 DataStream<TaggedUnion<T1, T2>> 指定返回类型是 unionType DataStream<TaggedUnion<T1, T2>> taggedInput1 = input1 .map(new Input1Tagger<T1, T2>()) .setParallelism(input1.getParallelism()) .returns(unionType); // input2 创建 DataStream<TaggedUnion<T1, T2>> 指定返回类型是 unionType DataStream<TaggedUnion<T1, T2>> taggedInput2 = input2 .map(new Input2Tagger<T1, T2>()) .setParallelism(input2.getParallelism()) .returns(unionType); // join 两个流,上面已经将两个流的类型转为一样了: DataStream<TaggedUnion<T1, T2>> DataStream<TaggedUnion<T1, T2>> unionStream = taggedInput1.union(taggedInput2); // we explicitly create the keyed stream to manually pass the key type information in // 使用 union 的 Stream 创建 KeyedStream,同时指定 分区的 PartitionTransformation // 调用window 生成 windowStream windowedStream = new KeyedStream<TaggedUnion<T1, T2>, KEY>(unionStream, unionKeySelector, keyType) .window(windowAssigner); if (trigger != null) { windowedStream.trigger(trigger); } if (evictor != null) { windowedStream.evictor(evictor); } if (allowedLateness != null) { windowedStream.allowedLateness(allowedLateness); } // 调用 windowedStream apply 方法,参数是个 CoGroupWindowFunction return windowedStream.apply(new CoGroupWindowFunction<T1, T2, T, KEY, W>(function), resultType);
Input1Tagger/Input2Tagger 的 map 方法
private static class Input1Tagger<T1, T2> implements MapFunction<T1, TaggedUnion<T1, T2>> { private static final long serialVersionUID = 1L; @Override public TaggedUnion<T1, T2> map(T1 value) throws Exception { return TaggedUnion.one(value); } } private static class Input2Tagger<T1, T2> implements MapFunction<T2, TaggedUnion<T1, T2>> { private static final long serialVersionUID = 1L; @Override public TaggedUnion<T1, T2> map(T2 value) throws Exception { return TaggedUnion.two(value); } }
TaggedUnion 的 one/two 方法
public static <T1, T2> TaggedUnion<T1, T2> one(T1 one) { // one 方法 右边数据为 null return new TaggedUnion<>(one, null); } public static <T1, T2> TaggedUnion<T1, T2> two(T2 two) { // two 方法 左边数据为 null return new TaggedUnion<>(null, two); }
KeyedStream
public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) { this( dataStream, new PartitionTransformation<>( dataStream.getTransformation(), new KeyGroupStreamPartitioner<>(keySelector, StreamGraphGenerator.DEFAULT_LOWER_BOUND_MAX_PARALLELISM)), keySelector, keyType); }
windowStream.apply 方法
public <R> SingleOutputStreamOperator<R> apply(WindowFunction<T, R, K, W> function, TypeInformation<R> resultType) { function = input.getExecutionEnvironment().clean(function); return apply(new InternalIterableWindowFunction<>(function), resultType, function); }
再 apply,看到这里,又看到了熟悉的样子,先 operator,再 Transformation
private <R> SingleOutputStreamOperator<R> apply(InternalWindowFunction<Iterable<T>, R, K, W> function, TypeInformation<R> resultType, Function originalFunction) { // operatorName final String opName = generateOperatorName(windowAssigner, trigger, evictor, originalFunction, null); // keySelector 就是之前的 UnionKeySelector KeySelector<T, K> keySel = input.getKeySelector(); WindowOperator<K, T, Iterable<T>, R, W> operator; if (evictor != null) { @SuppressWarnings({"unchecked", "rawtypes"}) TypeSerializer<StreamRecord<T>> streamRecordSerializer = (TypeSerializer<StreamRecord<T>>) new StreamElementSerializer(input.getType().createSerializer(getExecutionEnvironment().getConfig())); ListStateDescriptor<StreamRecord<T>> stateDesc = new ListStateDescriptor<>("window-contents", streamRecordSerializer); operator = new EvictingWindowOperator<>(windowAssigner, windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()), keySel, input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()), stateDesc, function, trigger, evictor, allowedLateness, lateDataOutputTag); } else { ListStateDescriptor<T> stateDesc = new ListStateDescriptor<>("window-contents", input.getType().createSerializer(getExecutionEnvironment().getConfig())); // 创建 window 的 WindowOperator operator = new WindowOperator<>(windowAssigner, windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()), keySel, input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()), stateDesc, function, trigger, allowedLateness, lateDataOutputTag); } // 调用 transform 方法 生成 Transformation return input.transform(opName, resultType, operator); }
join 算子 doTransform 的过程,先生成了一个 OneInputTransformation, 再用 OneInputTransformation 生成了一个 SingleOutputStreamOperator 返回,所以最后是个 SingleOutputStreamOperator 的 DataStream
@PublicEvolving public <R> SingleOutputStreamOperator<R> transform( String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperatorFactory<T, R> operatorFactory) { return doTransform(operatorName, outTypeInfo, operatorFactory); } protected <R> SingleOutputStreamOperator<R> doTransform( String operatorName, TypeInformation<R> outTypeInfo, StreamOperatorFactory<R> operatorFactory) { // read the output type of the input Transform to coax out errors about MissingTypeInfo transformation.getOutputType(); OneInputTransformation<T, R> resultTransform = new OneInputTransformation<>( this.transformation, operatorName, operatorFactory, outTypeInfo, environment.getParallelism()); @SuppressWarnings({"unchecked", "rawtypes"}) SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(environment, resultTransform); getExecutionEnvironment().addOperator(resultTransform); return returnStream; }
到这里简单总结下中间转换的过程:
* 1 先是将 join 转成 Scala JoinedStreams,再到 Java JoinedStreams
* 2 Java JoinedStreams 转成 CoGroupedStreams
* 3 CoGroupedStreams 转成 union (UnionTransformation )
* 4 再转成 KeyedStream (PartitionTransformation)
* 5 再转成 SingleOutputStreamOperator (OneInputTransformation)
看起来略复杂,但是总体来说,还是跟之前的差不多
欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文