Flink运行时架构

1. 系统架构

Flink 运行时的两大架构。JobManager作业管理器和TaskManager任务管理器。

JobManager：真正的管理者(master)，负责管理和调度。在不考虑高可用的情况下只有一个。

TaskManager：可以理解为工作中(worker, slave)。可以有一个或者多个。

作业提交和任务处理时的系统如下：

　　客户端并不是处理系统的一部分，只负责作业的提交。负责调用程序的main 方法，将代码转换成数据流图(Dataflow Graph) ，并且最终生成作业图(Job Graph),然后发送给JobManager。也可以从JobManager 获取任务的执行状态和结果。 TaskManager 启动之后，JobManager 会与它建立连接，并将作业图(Job Graph) 转换成可执行的执行图(ExecutionGraph) 然后分发给可用的TaskManager。

1. JobManager

一个集群中任务管理和调度中心。JobManager 又包含三个组件：

1. JobMaster

　　负责处理单独的作业，JobMaster 接收要执行的应用。包括：jar包、数据流图、作业图。JobMaster 会把JobGraph 转换成一个物理层面的数据流图，叫做执行图(包含了所有可以并发执行的任务)。JobMaster 会向资源管理器(ResourceManager)发出请求，申请执行任务必要的资源，一旦获取到足够资源就会将执行图分发到真正运行的TaskManager 上。

　　在运行过程中，JobMaster 会负责所有需要中央协调的操作，比如说检查点的协调等。

2. ResourceManager

　　主要负责资源的分配和管理。所谓资源主要是指TaskManager 的任务槽。任务槽是Flink 集群中的资源协调单元，包含了机器用来执行计算的一组CPU和内存资源。每一个任务都需要分配到一个slot 上执行。

3. Dispatcher

　　主要负责提供一个REST 接口，用来提交应用，并且负责为每一个新提交的作业启动一个新的JobMaster 组件。 Dispatcher 也会启动一个WEB UI，用来方便的展示和监控作业执行的信息。

2. TaskManager

　　Flink 中的工作进程，也被称为worker。一个集群包含一个或多个TaskManager，每个TaskManager 都包含一定数量的任务槽(task slots)。slot 的数量限制了TaskManager 能并行处理的任务数量。

　　启动后，TaskManager 向资源管理器注册它的slots；收到资源管理器的指令后会将一个或多个槽位提供给JobMaster 调用，用于分配任务。

　　执行过程中，TaskManager 可以缓冲数据，还可以跟其他运行同一应用的TaskManager 交换数据。

2. 作业提交流程

　　可以用下图表示

(1)客户端APP通过分发器提供REST 接口，将作业提交给JobManager

(2)分发器启动JobMaster，将作业(包含JobGraph)提交给JobMaster

(3)JobMaster 将JobGraph 解析为ExecutionGraph，得到所需的资源数量，然后向资源管理器请求资源(slots)

(4)资源管理器协调资源

(5)Taskmanager 启动只会向ResourceManager 注册自己的可以slots

(6)资源管理器通知TaskManager 为新的作业提供slots

(7)TaskManager连接到对应的JobMaster，提供slots

(8)JobMaster 将需要执行的任务分发给TaskManager

(9)TaskManager 执行任务，互相之间可以交换

3. 重要概念

　　通过这些核心概念，我们可以明白：

1》怎样从Flink程序得到任务?

2》一个流处理程序，到底包含多少个任务?

3》最终执行任务，需要占用多少slot?

1. 数据流图(Dataflow Graph)

　　Flink 是流式计算框架，它的程序结构其实就是定义了一连串的操作，每个数据输入之后都会调用每一个步骤一次计算，每一个操作都叫做"算子"(operator)，可以理解为我们的程序是一串算子构成的管道，数据则像水流一样有序地流过。

　　所有的程序都由三部分组成。source(源算子，负责读取数据)、Transformation(转换算子，负责处理数据)、Sink(下沉子算子，负责数据的输出)。

　　在运行时，Flink 程序会被映射成所有算子按照逻辑顺序拼接成一张图，这种图被称为逻辑数据流(数据流图)。数据流图类似于任意的有向无环图(DAG-Directed Acyclic Graph)。图中的每一条数据流以一个或者多个source 开始，以一个或者多个sink 结束。

　　代码中，除了source和sink，其他可以被称为代码中如果返回值是 SingleOutputStreamOperator 的API 就可以称为一个算子，否则不会计算为算子(只能理解为中间的转换操作)，比如：keyBy 返回值是 KeyedStream 就不是一个算子；org.apache.flink.streaming.api.datastream.KeyedStream#sum(int) 就是一个算子。

　　常见的算子：

source：读txt、socket、自定义输入等

transformation：flatMap、map、filter、process 处理操作，还有sum、max、maxBy、min、minBy 等也都是聚合算子(名字都是Keyed Aggregation)

sink: print、printToErr、writeAsText、writeAsCsv 等

org.apache.flink.streaming.api.datastream.DataStream 源码可以看出每个算子都有一个特定名称：

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by FernFlower decompiler)
//

package org.apache.flink.streaming.api.datastream;

import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
import org.apache.flink.annotation.Experimental;
import org.apache.flink.annotation.Internal;
import org.apache.flink.annotation.Public;
import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.api.common.ExecutionConfig;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.Partitioner;
import org.apache.flink.api.common.io.OutputFormat;
import org.apache.flink.api.common.operators.Keys;
import org.apache.flink.api.common.operators.ResourceSpec;
import org.apache.flink.api.common.operators.Keys.ExpressionKeys;
import org.apache.flink.api.common.serialization.SerializationSchema;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.typeinfo.BasicArrayTypeInfo;
import org.apache.flink.api.common.typeinfo.PrimitiveArrayTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.common.typeutils.TypeSerializer;
import org.apache.flink.api.connector.sink.Sink;
import org.apache.flink.api.dag.Transformation;
import org.apache.flink.api.java.Utils;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.io.CsvOutputFormat;
import org.apache.flink.api.java.io.TextOutputFormat;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.typeutils.InputTypeConfigurable;
import org.apache.flink.api.java.typeutils.TypeExtractor;
import org.apache.flink.core.execution.JobClient;
import org.apache.flink.core.fs.Path;
import org.apache.flink.core.fs.FileSystem.WriteMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.apache.flink.streaming.api.functions.sink.SocketClientSink;
import org.apache.flink.streaming.api.operators.OneInputStreamOperator;
import org.apache.flink.streaming.api.operators.OneInputStreamOperatorFactory;
import org.apache.flink.streaming.api.operators.ProcessOperator;
import org.apache.flink.streaming.api.operators.SimpleOperatorFactory;
import org.apache.flink.streaming.api.operators.StreamFilter;
import org.apache.flink.streaming.api.operators.StreamFlatMap;
import org.apache.flink.streaming.api.operators.StreamMap;
import org.apache.flink.streaming.api.operators.StreamOperatorFactory;
import org.apache.flink.streaming.api.operators.StreamSink;
import org.apache.flink.streaming.api.operators.collect.ClientAndIterator;
import org.apache.flink.streaming.api.operators.collect.CollectResultIterator;
import org.apache.flink.streaming.api.operators.collect.CollectSinkOperator;
import org.apache.flink.streaming.api.operators.collect.CollectSinkOperatorFactory;
import org.apache.flink.streaming.api.operators.collect.CollectStreamSink;
import org.apache.flink.streaming.api.transformations.OneInputTransformation;
import org.apache.flink.streaming.api.transformations.PartitionTransformation;
import org.apache.flink.streaming.api.transformations.TimestampsAndWatermarksTransformation;
import org.apache.flink.streaming.api.transformations.UnionTransformation;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.WindowAssigner;
import org.apache.flink.streaming.api.windowing.evictors.CountEvictor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.api.windowing.windows.Window;
import org.apache.flink.streaming.runtime.operators.util.AssignerWithPeriodicWatermarksAdapter.Strategy;
import org.apache.flink.streaming.runtime.partitioner.BroadcastPartitioner;
import org.apache.flink.streaming.runtime.partitioner.CustomPartitionerWrapper;
import org.apache.flink.streaming.runtime.partitioner.ForwardPartitioner;
import org.apache.flink.streaming.runtime.partitioner.GlobalPartitioner;
import org.apache.flink.streaming.runtime.partitioner.RebalancePartitioner;
import org.apache.flink.streaming.runtime.partitioner.RescalePartitioner;
import org.apache.flink.streaming.runtime.partitioner.ShufflePartitioner;
import org.apache.flink.streaming.runtime.partitioner.StreamPartitioner;
import org.apache.flink.streaming.util.keys.KeySelectorUtil;
import org.apache.flink.util.CloseableIterator;
import org.apache.flink.util.Preconditions;

@Public
public class DataStream<T> {
    protected final StreamExecutionEnvironment environment;
    protected final Transformation<T> transformation;

    public DataStream(StreamExecutionEnvironment environment, Transformation<T> transformation) {
        this.environment = (StreamExecutionEnvironment)Preconditions.checkNotNull(environment, "Execution Environment must not be null.");
        this.transformation = (Transformation)Preconditions.checkNotNull(transformation, "Stream Transformation must not be null.");
    }

    @Internal
    public int getId() {
        return this.transformation.getId();
    }

    public int getParallelism() {
        return this.transformation.getParallelism();
    }

    @PublicEvolving
    public ResourceSpec getMinResources() {
        return this.transformation.getMinResources();
    }

    @PublicEvolving
    public ResourceSpec getPreferredResources() {
        return this.transformation.getPreferredResources();
    }

    public TypeInformation<T> getType() {
        return this.transformation.getOutputType();
    }

    protected <F> F clean(F f) {
        return this.getExecutionEnvironment().clean(f);
    }

    public StreamExecutionEnvironment getExecutionEnvironment() {
        return this.environment;
    }

    public ExecutionConfig getExecutionConfig() {
        return this.environment.getConfig();
    }

    @SafeVarargs
    public final DataStream<T> union(DataStream<T>... streams) {
        List<Transformation<T>> unionedTransforms = new ArrayList();
        unionedTransforms.add(this.transformation);
        DataStream[] var3 = streams;
        int var4 = streams.length;

        for(int var5 = 0; var5 < var4; ++var5) {
            DataStream<T> newStream = var3[var5];
            if (!this.getType().equals(newStream.getType())) {
                throw new IllegalArgumentException("Cannot union streams of different types: " + this.getType() + " and " + newStream.getType());
            }

            unionedTransforms.add(newStream.getTransformation());
        }

        return new DataStream(this.environment, new UnionTransformation(unionedTransforms));
    }

    public <R> ConnectedStreams<T, R> connect(DataStream<R> dataStream) {
        return new ConnectedStreams(this.environment, this, dataStream);
    }

    @PublicEvolving
    public <R> BroadcastConnectedStream<T, R> connect(BroadcastStream<R> broadcastStream) {
        return new BroadcastConnectedStream(this.environment, this, (BroadcastStream)Preconditions.checkNotNull(broadcastStream), broadcastStream.getBroadcastStateDescriptors());
    }

    public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key) {
        Preconditions.checkNotNull(key);
        return new KeyedStream(this, (KeySelector)this.clean(key));
    }

    public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key, TypeInformation<K> keyType) {
        Preconditions.checkNotNull(key);
        Preconditions.checkNotNull(keyType);
        return new KeyedStream(this, (KeySelector)this.clean(key), keyType);
    }

    /** @deprecated */
    @Deprecated
    public KeyedStream<T, Tuple> keyBy(int... fields) {
        return !(this.getType() instanceof BasicArrayTypeInfo) && !(this.getType() instanceof PrimitiveArrayTypeInfo) ? this.keyBy((Keys)(new ExpressionKeys(fields, this.getType()))) : this.keyBy((KeySelector)KeySelectorUtil.getSelectorForArray(fields, this.getType()));
    }

    /** @deprecated */
    @Deprecated
    public KeyedStream<T, Tuple> keyBy(String... fields) {
        return this.keyBy((Keys)(new ExpressionKeys(fields, this.getType())));
    }

    private KeyedStream<T, Tuple> keyBy(Keys<T> keys) {
        return new KeyedStream(this, (KeySelector)this.clean(KeySelectorUtil.getSelectorForKeys(keys, this.getType(), this.getExecutionConfig())));
    }

    /** @deprecated */
    @Deprecated
    public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, int field) {
        ExpressionKeys<T> outExpressionKeys = new ExpressionKeys(new int[]{field}, this.getType());
        return this.partitionCustom(partitioner, (Keys)outExpressionKeys);
    }

    /** @deprecated */
    @Deprecated
    public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, String field) {
        ExpressionKeys<T> outExpressionKeys = new ExpressionKeys(new String[]{field}, this.getType());
        return this.partitionCustom(partitioner, (Keys)outExpressionKeys);
    }

    public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, KeySelector<T, K> keySelector) {
        return this.setConnectionType(new CustomPartitionerWrapper((Partitioner)this.clean(partitioner), (KeySelector)this.clean(keySelector)));
    }

    private <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, Keys<T> keys) {
        KeySelector<T, K> keySelector = KeySelectorUtil.getSelectorForOneKey(keys, partitioner, this.getType(), this.getExecutionConfig());
        return this.setConnectionType(new CustomPartitionerWrapper((Partitioner)this.clean(partitioner), (KeySelector)this.clean(keySelector)));
    }

    public DataStream<T> broadcast() {
        return this.setConnectionType(new BroadcastPartitioner());
    }

    @PublicEvolving
    public BroadcastStream<T> broadcast(MapStateDescriptor<?, ?>... broadcastStateDescriptors) {
        Preconditions.checkNotNull(broadcastStateDescriptors);
        DataStream<T> broadcastStream = this.setConnectionType(new BroadcastPartitioner());
        return new BroadcastStream(this.environment, broadcastStream, broadcastStateDescriptors);
    }

    @PublicEvolving
    public DataStream<T> shuffle() {
        return this.setConnectionType(new ShufflePartitioner());
    }

    public DataStream<T> forward() {
        return this.setConnectionType(new ForwardPartitioner());
    }

    public DataStream<T> rebalance() {
        return this.setConnectionType(new RebalancePartitioner());
    }

    @PublicEvolving
    public DataStream<T> rescale() {
        return this.setConnectionType(new RescalePartitioner());
    }

    @PublicEvolving
    public DataStream<T> global() {
        return this.setConnectionType(new GlobalPartitioner());
    }

    @PublicEvolving
    public IterativeStream<T> iterate() {
        return new IterativeStream(this, 0L);
    }

    @PublicEvolving
    public IterativeStream<T> iterate(long maxWaitTimeMillis) {
        return new IterativeStream(this, maxWaitTimeMillis);
    }

    public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper) {
        TypeInformation<R> outType = TypeExtractor.getMapReturnTypes((MapFunction)this.clean(mapper), this.getType(), Utils.getCallLocationName(), true);
        return this.map(mapper, outType);
    }

    public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper, TypeInformation<R> outputType) {
        return this.transform("Map", outputType, (OneInputStreamOperator)(new StreamMap((MapFunction)this.clean(mapper))));
    }

    public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper) {
        TypeInformation<R> outType = TypeExtractor.getFlatMapReturnTypes((FlatMapFunction)this.clean(flatMapper), this.getType(), Utils.getCallLocationName(), true);
        return this.flatMap(flatMapper, outType);
    }

    public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper, TypeInformation<R> outputType) {
        return this.transform("Flat Map", outputType, (OneInputStreamOperator)(new StreamFlatMap((FlatMapFunction)this.clean(flatMapper))));
    }

    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction) {
        TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(processFunction, ProcessFunction.class, 0, 1, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true);
        return this.process(processFunction, outType);
    }

    @Internal
    public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction, TypeInformation<R> outputType) {
        ProcessOperator<T, R> operator = new ProcessOperator((ProcessFunction)this.clean(processFunction));
        return this.transform("Process", outputType, (OneInputStreamOperator)operator);
    }

    public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter) {
        return this.transform("Filter", this.getType(), (OneInputStreamOperator)(new StreamFilter((FilterFunction)this.clean(filter))));
    }

    @PublicEvolving
    public <R extends Tuple> SingleOutputStreamOperator<R> project(int... fieldIndexes) {
        return (new StreamProjection(this, fieldIndexes)).projectTupleX();
    }

    public <T2> CoGroupedStreams<T, T2> coGroup(DataStream<T2> otherStream) {
        return new CoGroupedStreams(this, otherStream);
    }

    public <T2> JoinedStreams<T, T2> join(DataStream<T2> otherStream) {
        return new JoinedStreams(this, otherStream);
    }

    /** @deprecated */
    @Deprecated
    public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size) {
        return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.windowAll(TumblingProcessingTimeWindows.of(size)) : this.windowAll(TumblingEventTimeWindows.of(size));
    }

    /** @deprecated */
    @Deprecated
    public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size, Time slide) {
        return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.windowAll(SlidingProcessingTimeWindows.of(size, slide)) : this.windowAll(SlidingEventTimeWindows.of(size, slide));
    }

    public AllWindowedStream<T, GlobalWindow> countWindowAll(long size) {
        return this.windowAll(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size)));
    }

    public AllWindowedStream<T, GlobalWindow> countWindowAll(long size, long slide) {
        return this.windowAll(GlobalWindows.create()).evictor(CountEvictor.of(size)).trigger(CountTrigger.of(slide));
    }

    @PublicEvolving
    public <W extends Window> AllWindowedStream<T, W> windowAll(WindowAssigner<? super T, W> assigner) {
        return new AllWindowedStream(this, assigner);
    }

    public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(WatermarkStrategy<T> watermarkStrategy) {
        WatermarkStrategy<T> cleanedStrategy = (WatermarkStrategy)this.clean(watermarkStrategy);
        int inputParallelism = this.getTransformation().getParallelism();
        TimestampsAndWatermarksTransformation<T> transformation = new TimestampsAndWatermarksTransformation("Timestamps/Watermarks", inputParallelism, this.getTransformation(), cleanedStrategy);
        this.getExecutionEnvironment().addOperator(transformation);
        return new SingleOutputStreamOperator(this.getExecutionEnvironment(), transformation);
    }

    /** @deprecated */
    @Deprecated
    public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(AssignerWithPeriodicWatermarks<T> timestampAndWatermarkAssigner) {
        AssignerWithPeriodicWatermarks<T> cleanedAssigner = (AssignerWithPeriodicWatermarks)this.clean(timestampAndWatermarkAssigner);
        WatermarkStrategy<T> wms = new Strategy(cleanedAssigner);
        return this.assignTimestampsAndWatermarks((WatermarkStrategy)wms);
    }

    /** @deprecated */
    @Deprecated
    public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(AssignerWithPunctuatedWatermarks<T> timestampAndWatermarkAssigner) {
        AssignerWithPunctuatedWatermarks<T> cleanedAssigner = (AssignerWithPunctuatedWatermarks)this.clean(timestampAndWatermarkAssigner);
        WatermarkStrategy<T> wms = new org.apache.flink.streaming.runtime.operators.util.AssignerWithPunctuatedWatermarksAdapter.Strategy(cleanedAssigner);
        return this.assignTimestampsAndWatermarks((WatermarkStrategy)wms);
    }

    @PublicEvolving
    public DataStreamSink<T> print() {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction();
        return this.addSink(printFunction).name("Print to Std. Out");
    }

    @PublicEvolving
    public DataStreamSink<T> printToErr() {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction(true);
        return this.addSink(printFunction).name("Print to Std. Err");
    }

    @PublicEvolving
    public DataStreamSink<T> print(String sinkIdentifier) {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction(sinkIdentifier, false);
        return this.addSink(printFunction).name("Print to Std. Out");
    }

    @PublicEvolving
    public DataStreamSink<T> printToErr(String sinkIdentifier) {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction(sinkIdentifier, true);
        return this.addSink(printFunction).name("Print to Std. Err");
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public DataStreamSink<T> writeAsText(String path) {
        return this.writeUsingOutputFormat(new TextOutputFormat(new Path(path)));
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public DataStreamSink<T> writeAsText(String path, WriteMode writeMode) {
        TextOutputFormat<T> tof = new TextOutputFormat(new Path(path));
        tof.setWriteMode(writeMode);
        return this.writeUsingOutputFormat(tof);
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public DataStreamSink<T> writeAsCsv(String path) {
        return this.writeAsCsv(path, (WriteMode)null, "\n", CsvOutputFormat.DEFAULT_FIELD_DELIMITER);
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public DataStreamSink<T> writeAsCsv(String path, WriteMode writeMode) {
        return this.writeAsCsv(path, writeMode, "\n", CsvOutputFormat.DEFAULT_FIELD_DELIMITER);
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public <X extends Tuple> DataStreamSink<T> writeAsCsv(String path, WriteMode writeMode, String rowDelimiter, String fieldDelimiter) {
        Preconditions.checkArgument(this.getType().isTupleType(), "The writeAsCsv() method can only be used on data streams of tuples.");
        CsvOutputFormat<X> of = new CsvOutputFormat(new Path(path), rowDelimiter, fieldDelimiter);
        if (writeMode != null) {
            of.setWriteMode(writeMode);
        }

        return this.writeUsingOutputFormat(of);
    }

    @PublicEvolving
    public DataStreamSink<T> writeToSocket(String hostName, int port, SerializationSchema<T> schema) {
        DataStreamSink<T> returnStream = this.addSink(new SocketClientSink(hostName, port, schema, 0));
        returnStream.setParallelism(1);
        return returnStream;
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public DataStreamSink<T> writeUsingOutputFormat(OutputFormat<T> format) {
        return this.addSink(new OutputFormatSinkFunction(format));
    }

    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperator<T, R> operator) {
        return this.doTransform(operatorName, outTypeInfo, SimpleOperatorFactory.of(operator));
    }

    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperatorFactory<T, R> operatorFactory) {
        return this.doTransform(operatorName, outTypeInfo, operatorFactory);
    }

    protected <R> SingleOutputStreamOperator<R> doTransform(String operatorName, TypeInformation<R> outTypeInfo, StreamOperatorFactory<R> operatorFactory) {
        this.transformation.getOutputType();
        OneInputTransformation<T, R> resultTransform = new OneInputTransformation(this.transformation, operatorName, operatorFactory, outTypeInfo, this.environment.getParallelism());
        SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(this.environment, resultTransform);
        this.getExecutionEnvironment().addOperator(resultTransform);
        return returnStream;
    }

    protected DataStream<T> setConnectionType(StreamPartitioner<T> partitioner) {
        return new DataStream(this.getExecutionEnvironment(), new PartitionTransformation(this.getTransformation(), partitioner));
    }

    public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction) {
        this.transformation.getOutputType();
        if (sinkFunction instanceof InputTypeConfigurable) {
            ((InputTypeConfigurable)sinkFunction).setInputType(this.getType(), this.getExecutionConfig());
        }

        StreamSink<T> sinkOperator = new StreamSink((SinkFunction)this.clean(sinkFunction));
        DataStreamSink<T> sink = new DataStreamSink(this, sinkOperator);
        this.getExecutionEnvironment().addOperator(sink.getTransformation());
        return sink;
    }

    @Experimental
    public DataStreamSink<T> sinkTo(Sink<T, ?, ?, ?> sink) {
        this.transformation.getOutputType();
        return new DataStreamSink(this, sink);
    }

    public CloseableIterator<T> executeAndCollect() throws Exception {
        return this.executeAndCollect("DataStream Collect");
    }

    public CloseableIterator<T> executeAndCollect(String jobExecutionName) throws Exception {
        return this.executeAndCollectWithClient(jobExecutionName).iterator;
    }

    public List<T> executeAndCollect(int limit) throws Exception {
        return this.executeAndCollect("DataStream Collect", limit);
    }

    public List<T> executeAndCollect(String jobExecutionName, int limit) throws Exception {
        Preconditions.checkState(limit > 0, "Limit must be greater than 0");
        ClientAndIterator<T> clientAndIterator = this.executeAndCollectWithClient(jobExecutionName);
        Throwable var4 = null;

        try {
            ArrayList results;
            for(results = new ArrayList(limit); clientAndIterator.iterator.hasNext() && limit > 0; --limit) {
                results.add(clientAndIterator.iterator.next());
            }

            ArrayList var6 = results;
            return var6;
        } catch (Throwable var15) {
            var4 = var15;
            throw var15;
        } finally {
            if (clientAndIterator != null) {
                if (var4 != null) {
                    try {
                        clientAndIterator.close();
                    } catch (Throwable var14) {
                        var4.addSuppressed(var14);
                    }
                } else {
                    clientAndIterator.close();
                }
            }

        }
    }

    ClientAndIterator<T> executeAndCollectWithClient(String jobExecutionName) throws Exception {
        TypeSerializer<T> serializer = this.getType().createSerializer(this.getExecutionEnvironment().getConfig());
        String accumulatorName = "dataStreamCollect_" + UUID.randomUUID().toString();
        StreamExecutionEnvironment env = this.getExecutionEnvironment();
        CollectSinkOperatorFactory<T> factory = new CollectSinkOperatorFactory(serializer, accumulatorName);
        CollectSinkOperator<T> operator = (CollectSinkOperator)factory.getOperator();
        CollectResultIterator<T> iterator = new CollectResultIterator(operator.getOperatorIdFuture(), serializer, accumulatorName, env.getCheckpointConfig());
        CollectStreamSink<T> sink = new CollectStreamSink(this, factory);
        sink.name("Data stream collect sink");
        env.addOperator(sink.getTransformation());
        JobClient jobClient = env.executeAsync(jobExecutionName);
        iterator.setJobClient(jobClient);
        return new ClientAndIterator(jobClient, iterator);
    }

    @Internal
    public Transformation<T> getTransformation() {
        return this.transformation;
    }
}

View Code

org.apache.flink.streaming.api.datastream.KeyedStream 针对集合的算子API：

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by FernFlower decompiler)
//

package org.apache.flink.streaming.api.datastream;

import java.util.ArrayList;
import java.util.Stack;
import java.util.UUID;
import org.apache.commons.lang3.StringUtils;
import org.apache.flink.annotation.Internal;
import org.apache.flink.annotation.Public;
import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.api.common.InvalidProgramException;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.BasicArrayTypeInfo;
import org.apache.flink.api.common.typeinfo.PrimitiveArrayTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.Utils;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.typeutils.EnumTypeInfo;
import org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo;
import org.apache.flink.api.java.typeutils.PojoTypeInfo;
import org.apache.flink.api.java.typeutils.TupleTypeInfoBase;
import org.apache.flink.api.java.typeutils.TypeExtractor;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.aggregation.AggregationFunction;
import org.apache.flink.streaming.api.functions.aggregation.ComparableAggregator;
import org.apache.flink.streaming.api.functions.aggregation.SumAggregator;
import org.apache.flink.streaming.api.functions.aggregation.AggregationFunction.AggregationType;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.functions.query.QueryableAppendingStateOperator;
import org.apache.flink.streaming.api.functions.query.QueryableValueStateOperator;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.apache.flink.streaming.api.operators.KeyedProcessOperator;
import org.apache.flink.streaming.api.operators.LegacyKeyedProcessOperator;
import org.apache.flink.streaming.api.operators.StreamOperatorFactory;
import org.apache.flink.streaming.api.operators.co.IntervalJoinOperator;
import org.apache.flink.streaming.api.transformations.OneInputTransformation;
import org.apache.flink.streaming.api.transformations.PartitionTransformation;
import org.apache.flink.streaming.api.transformations.ReduceTransformation;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.WindowAssigner;
import org.apache.flink.streaming.api.windowing.evictors.CountEvictor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.api.windowing.windows.Window;
import org.apache.flink.streaming.runtime.partitioner.KeyGroupStreamPartitioner;
import org.apache.flink.streaming.runtime.partitioner.StreamPartitioner;
import org.apache.flink.util.Preconditions;

@Public
public class KeyedStream<T, KEY> extends DataStream<T> {
    private final KeySelector<T, KEY> keySelector;
    private final TypeInformation<KEY> keyType;

    public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector) {
        this(dataStream, keySelector, TypeExtractor.getKeySelectorTypes(keySelector, dataStream.getType()));
    }

    public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
        this(dataStream, new PartitionTransformation(dataStream.getTransformation(), new KeyGroupStreamPartitioner(keySelector, 128)), keySelector, keyType);
    }

    @Internal
    KeyedStream(DataStream<T> stream, PartitionTransformation<T> partitionTransformation, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
        super(stream.getExecutionEnvironment(), partitionTransformation);
        this.keySelector = (KeySelector)this.clean(keySelector);
        this.keyType = this.validateKeyType(keyType);
    }

    private TypeInformation<KEY> validateKeyType(TypeInformation<KEY> keyType) {
        Stack<TypeInformation<?>> stack = new Stack();
        stack.push(keyType);
        ArrayList unsupportedTypes = new ArrayList();

        while(true) {
            TypeInformation typeInfo;
            do {
                if (stack.isEmpty()) {
                    if (!unsupportedTypes.isEmpty()) {
                        throw new InvalidProgramException("Type " + keyType + " cannot be used as key. Contained UNSUPPORTED key types: " + StringUtils.join(unsupportedTypes, ", ") + ". Look at the keyBy() documentation for the conditions a type has to satisfy in order to be eligible for a key.");
                    }

                    return keyType;
                }

                typeInfo = (TypeInformation)stack.pop();
                if (!this.validateKeyTypeIsHashable(typeInfo)) {
                    unsupportedTypes.add(typeInfo);
                }
            } while(!(typeInfo instanceof TupleTypeInfoBase));

            for(int i = 0; i < typeInfo.getArity(); ++i) {
                stack.push(((TupleTypeInfoBase)typeInfo).getTypeAt(i));
            }
        }
    }

    private boolean validateKeyTypeIsHashable(TypeInformation<?> type) {
        try {
            return type instanceof PojoTypeInfo ? !type.getTypeClass().getMethod("hashCode").getDeclaringClass().equals(Object.class) : !isArrayType(type) && !isEnumType(type);
        } catch (NoSuchMethodException var3) {
            return false;
        }
    }

    private static boolean isArrayType(TypeInformation<?> type) {
        return type instanceof PrimitiveArrayTypeInfo || type instanceof BasicArrayTypeInfo || type instanceof ObjectArrayTypeInfo;
    }

    private static boolean isEnumType(TypeInformation<?> type) {
        return type instanceof EnumTypeInfo;
    }

    @Internal
    public KeySelector<T, KEY> getKeySelector() {
        return this.keySelector;
    }

    @Internal
    public TypeInformation<KEY> getKeyType() {
        return this.keyType;
    }

    protected DataStream<T> setConnectionType(StreamPartitioner<T> partitioner) {
        throw new UnsupportedOperationException("Cannot override partitioning for KeyedStream.");
    }

    protected <R> SingleOutputStreamOperator<R> doTransform(String operatorName, TypeInformation<R> outTypeInfo, StreamOperatorFactory<R> operatorFactory) {
        SingleOutputStreamOperator<R> returnStream = super.doTransform(operatorName, outTypeInfo, operatorFactory);
        OneInputTransformation<T, R> transform = (OneInputTransformation)returnStream.getTransformation();
        transform.setStateKeySelector(this.keySelector);
        transform.setStateKeyType(this.keyType);
        return returnStream;
    }

    public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction) {
        DataStreamSink<T> result = super.addSink(sinkFunction);
        result.getTransformation().setStateKeySelector(this.keySelector);
        result.getTransformation().setStateKeyType(this.keyType);
        return result;
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction) {
        TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(processFunction, ProcessFunction.class, 0, 1, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true);
        return this.process(processFunction, outType);
    }

    /** @deprecated */
    @Deprecated
    @Internal
    public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction, TypeInformation<R> outputType) {
        LegacyKeyedProcessOperator<KEY, T, R> operator = new LegacyKeyedProcessOperator((ProcessFunction)this.clean(processFunction));
        return this.transform("Process", outputType, operator);
    }

    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> process(KeyedProcessFunction<KEY, T, R> keyedProcessFunction) {
        TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(keyedProcessFunction, KeyedProcessFunction.class, 1, 2, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true);
        return this.process(keyedProcessFunction, outType);
    }

    @Internal
    public <R> SingleOutputStreamOperator<R> process(KeyedProcessFunction<KEY, T, R> keyedProcessFunction, TypeInformation<R> outputType) {
        KeyedProcessOperator<KEY, T, R> operator = new KeyedProcessOperator((KeyedProcessFunction)this.clean(keyedProcessFunction));
        return this.transform("KeyedProcess", outputType, operator);
    }

    @PublicEvolving
    public <T1> KeyedStream.IntervalJoin<T, T1, KEY> intervalJoin(KeyedStream<T1, KEY> otherStream) {
        return new KeyedStream.IntervalJoin(this, otherStream);
    }

    /** @deprecated */
    @Deprecated
    public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size) {
        return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.window(TumblingProcessingTimeWindows.of(size)) : this.window(TumblingEventTimeWindows.of(size));
    }

    /** @deprecated */
    @Deprecated
    public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size, Time slide) {
        return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.window(SlidingProcessingTimeWindows.of(size, slide)) : this.window(SlidingEventTimeWindows.of(size, slide));
    }

    public WindowedStream<T, KEY, GlobalWindow> countWindow(long size) {
        return this.window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size)));
    }

    public WindowedStream<T, KEY, GlobalWindow> countWindow(long size, long slide) {
        return this.window(GlobalWindows.create()).evictor(CountEvictor.of(size)).trigger(CountTrigger.of(slide));
    }

    @PublicEvolving
    public <W extends Window> WindowedStream<T, KEY, W> window(WindowAssigner<? super T, W> assigner) {
        return new WindowedStream(this, assigner);
    }

    public SingleOutputStreamOperator<T> reduce(ReduceFunction<T> reducer) {
        ReduceTransformation<T, KEY> reduce = new ReduceTransformation("Keyed Reduce", this.environment.getParallelism(), this.transformation, (ReduceFunction)this.clean(reducer), this.keySelector, this.getKeyType());
        this.getExecutionEnvironment().addOperator(reduce);
        return new SingleOutputStreamOperator(this.getExecutionEnvironment(), reduce);
    }

    public SingleOutputStreamOperator<T> sum(int positionToSum) {
        return this.aggregate(new SumAggregator(positionToSum, this.getType(), this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> sum(String field) {
        return this.aggregate(new SumAggregator(field, this.getType(), this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> min(int positionToMin) {
        return this.aggregate(new ComparableAggregator(positionToMin, this.getType(), AggregationType.MIN, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> min(String field) {
        return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MIN, false, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> max(int positionToMax) {
        return this.aggregate(new ComparableAggregator(positionToMax, this.getType(), AggregationType.MAX, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> max(String field) {
        return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MAX, false, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> minBy(String field, boolean first) {
        return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MINBY, first, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> maxBy(String field, boolean first) {
        return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MAXBY, first, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> minBy(int positionToMinBy) {
        return this.minBy(positionToMinBy, true);
    }

    public SingleOutputStreamOperator<T> minBy(String positionToMinBy) {
        return this.minBy(positionToMinBy, true);
    }

    public SingleOutputStreamOperator<T> minBy(int positionToMinBy, boolean first) {
        return this.aggregate(new ComparableAggregator(positionToMinBy, this.getType(), AggregationType.MINBY, first, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy) {
        return this.maxBy(positionToMaxBy, true);
    }

    public SingleOutputStreamOperator<T> maxBy(String positionToMaxBy) {
        return this.maxBy(positionToMaxBy, true);
    }

    public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy, boolean first) {
        return this.aggregate(new ComparableAggregator(positionToMaxBy, this.getType(), AggregationType.MAXBY, first, this.getExecutionConfig()));
    }

    protected SingleOutputStreamOperator<T> aggregate(AggregationFunction<T> aggregate) {
        return this.reduce(aggregate).name("Keyed Aggregation");
    }

    @PublicEvolving
    public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName) {
        ValueStateDescriptor<T> valueStateDescriptor = new ValueStateDescriptor(UUID.randomUUID().toString(), this.getType());
        return this.asQueryableState(queryableStateName, valueStateDescriptor);
    }

    @PublicEvolving
    public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName, ValueStateDescriptor<T> stateDescriptor) {
        this.transform("Queryable state: " + queryableStateName, this.getType(), new QueryableValueStateOperator(queryableStateName, stateDescriptor));
        stateDescriptor.initializeSerializerUnlessSet(this.getExecutionConfig());
        return new QueryableStateStream(queryableStateName, stateDescriptor, this.getKeyType().createSerializer(this.getExecutionConfig()));
    }

    @PublicEvolving
    public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName, ReducingStateDescriptor<T> stateDescriptor) {
        this.transform("Queryable state: " + queryableStateName, this.getType(), new QueryableAppendingStateOperator(queryableStateName, stateDescriptor));
        stateDescriptor.initializeSerializerUnlessSet(this.getExecutionConfig());
        return new QueryableStateStream(queryableStateName, stateDescriptor, this.getKeyType().createSerializer(this.getExecutionConfig()));
    }

    @PublicEvolving
    public static class IntervalJoined<IN1, IN2, KEY> {
        private final KeyedStream<IN1, KEY> left;
        private final KeyedStream<IN2, KEY> right;
        private final long lowerBound;
        private final long upperBound;
        private final KeySelector<IN1, KEY> keySelector1;
        private final KeySelector<IN2, KEY> keySelector2;
        private boolean lowerBoundInclusive;
        private boolean upperBoundInclusive;

        public IntervalJoined(KeyedStream<IN1, KEY> left, KeyedStream<IN2, KEY> right, long lowerBound, long upperBound, boolean lowerBoundInclusive, boolean upperBoundInclusive) {
            this.left = (KeyedStream)Preconditions.checkNotNull(left);
            this.right = (KeyedStream)Preconditions.checkNotNull(right);
            this.lowerBound = lowerBound;
            this.upperBound = upperBound;
            this.lowerBoundInclusive = lowerBoundInclusive;
            this.upperBoundInclusive = upperBoundInclusive;
            this.keySelector1 = left.getKeySelector();
            this.keySelector2 = right.getKeySelector();
        }

        @PublicEvolving
        public KeyedStream.IntervalJoined<IN1, IN2, KEY> upperBoundExclusive() {
            this.upperBoundInclusive = false;
            return this;
        }

        @PublicEvolving
        public KeyedStream.IntervalJoined<IN1, IN2, KEY> lowerBoundExclusive() {
            this.lowerBoundInclusive = false;
            return this;
        }

        @PublicEvolving
        public <OUT> SingleOutputStreamOperator<OUT> process(ProcessJoinFunction<IN1, IN2, OUT> processJoinFunction) {
            Preconditions.checkNotNull(processJoinFunction);
            TypeInformation<OUT> outputType = TypeExtractor.getBinaryOperatorReturnType(processJoinFunction, ProcessJoinFunction.class, 0, 1, 2, TypeExtractor.NO_INDEX, this.left.getType(), this.right.getType(), Utils.getCallLocationName(), true);
            return this.process(processJoinFunction, outputType);
        }

        @PublicEvolving
        public <OUT> SingleOutputStreamOperator<OUT> process(ProcessJoinFunction<IN1, IN2, OUT> processJoinFunction, TypeInformation<OUT> outputType) {
            Preconditions.checkNotNull(processJoinFunction);
            Preconditions.checkNotNull(outputType);
            ProcessJoinFunction<IN1, IN2, OUT> cleanedUdf = (ProcessJoinFunction)this.left.getExecutionEnvironment().clean(processJoinFunction);
            IntervalJoinOperator<KEY, IN1, IN2, OUT> operator = new IntervalJoinOperator(this.lowerBound, this.upperBound, this.lowerBoundInclusive, this.upperBoundInclusive, this.left.getType().createSerializer(this.left.getExecutionConfig()), this.right.getType().createSerializer(this.right.getExecutionConfig()), cleanedUdf);
            return this.left.connect(this.right).keyBy(this.keySelector1, this.keySelector2).transform("Interval Join", outputType, operator);
        }
    }

    @PublicEvolving
    public static class IntervalJoin<T1, T2, KEY> {
        private final KeyedStream<T1, KEY> streamOne;
        private final KeyedStream<T2, KEY> streamTwo;
        private KeyedStream.IntervalJoin.TimeBehaviour timeBehaviour;

        IntervalJoin(KeyedStream<T1, KEY> streamOne, KeyedStream<T2, KEY> streamTwo) {
            this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.EventTime;
            this.streamOne = (KeyedStream)Preconditions.checkNotNull(streamOne);
            this.streamTwo = (KeyedStream)Preconditions.checkNotNull(streamTwo);
        }

        public KeyedStream.IntervalJoin<T1, T2, KEY> inEventTime() {
            this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.EventTime;
            return this;
        }

        public KeyedStream.IntervalJoin<T1, T2, KEY> inProcessingTime() {
            this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.ProcessingTime;
            return this;
        }

        @PublicEvolving
        public KeyedStream.IntervalJoined<T1, T2, KEY> between(Time lowerBound, Time upperBound) {
            if (this.timeBehaviour != KeyedStream.IntervalJoin.TimeBehaviour.EventTime) {
                throw new UnsupportedTimeCharacteristicException("Time-bounded stream joins are only supported in event time");
            } else {
                Preconditions.checkNotNull(lowerBound, "A lower bound needs to be provided for a time-bounded join");
                Preconditions.checkNotNull(upperBound, "An upper bound needs to be provided for a time-bounded join");
                return new KeyedStream.IntervalJoined(this.streamOne, this.streamTwo, lowerBound.toMilliseconds(), upperBound.toMilliseconds(), true, true);
            }
        }

        static enum TimeBehaviour {
            ProcessingTime,
            EventTime;

            private TimeBehaviour() {
            }
        }
    }
}

View Code

比如之前的socket 例子：show plan 显示几个框可以理解为几个任务(一个任务可能有多个子任务，子任务的数量可以理解为并行度)，两者为什么会这么合并在了解合并算子链(并行度相同的一对一算子会合并算子链)后就会明白。

(1) 并行度设置为2的时候show plan 计划如下：

(2)并行度设置为1 的时候show plan 如下

2. 并行度

1. 什么是并行计算

　　可以理解为，我们期望的是“数据并行”。也就是多条数据同时到来，我们可以同时读入，并且在不同的节点进行flatMap 等操作。

2. 并行子任务和并行度

　　为了实现并行操作，我们把一个算子操作，复制多分到多个节点，数据来了之后到其中任意一个执行。这样一来，一个算子任务就被拆分成了多个并行的子任务，再将他们分发到不同节点，就实现了真正的并行计算。

　　在Flink 执行过程中，每个算子(operator)可以包含一个或多个子任务，这些子任务在不同的线程、物理机或者容器中完全独立地执行。

　　一个特定算子的子任务的个数被=称之为并行度(parallelism)。一个流程序的并行度可以认为是其所有算子中最大的并行度。一个程序中不同的算子可能具有不同的并行度。

如下图：

　　当前数据流中有source、map、window、sink 四个算子，除最后sink外其他的算子的并行度都为2.整个程序包含7个子任务，至少需要两个分区来执行，可以认为这段程序的并行度就是2。

3. 并行度设置

设置按照最近原则，最先设置的优先生效。

(1)代码中设置

// 全局设置
executionEnvironment.setParallelism(3);

// 对单个算子设置
txtDataSource
                .flatMap((String line, Collector<String> words) -> {
                    Arrays.stream(line.split(" ")).forEach(words::collect);
                }).setParallelism(3)

(2）提交时设置(webui也可以设置)

./flink-1.13.0/bin/flink run -c cn.qz.SocketStreamWordCount -p 2 ./study-flink-1.0-SNAPSHOT.jar

(3) 在集群的配置文件 flink-conf.yaml 中直接更改默认并行度：

parallelism.default: 1

　　这些参数都不是必须的，会按照由近到远的原则匹配(单个设置<env<-p<默认)。需要注意，有的算子即使设置了并行度也不会生效，比如读取socket 文本流的算子本身就不支持并行。在开发环境中，默认的并行度为当前机器的CPU核数(默认的任务槽的数量也是CPU核数)。

4. 测试例子

　　还是以socket 流为例子。

(1) 提交时选择并行度为2，查看任务：

　　如上。 name 是每个算子的名称，我们在源码中可以看到为这些算子起的名称。后面有子任务的数量。

(2) 7777 端口输入

[root@k8smaster01 conf]# nc -l 7777
hello china and beijing
what is your name?
my name is qz.

(3) 查看输出任务的详细信息

查看子任务信息：

(4) 查看两个子任务所在机器的标准输出：可以看出输出前面加的序号(可以理解为分区序号、任务插槽号)

第一个子任务所在机器输出：

第二个机器：

　　这里自己理解最大并行度就是一个任务最多能分到几个资源(任务槽)，任务会同时并行处理，可以理解为在不同的机器直接并行处理(至于每个机器并行几个线程跑，后面任务槽进行研究，目前是每个机器一个任务槽)。

补充：针对1、2 进行的测试

　　比如如下程序：

package cn.qz;

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

import java.util.Arrays;

public class SocketStreamWordCount {

    public static void main(String[] args) throws Exception {
        // 1. 创建执行环境(流处理执行环境)
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        // 2. 读取文件
        DataStreamSource<String> txtDataSource = executionEnvironment.socketTextStream("192.168.13.107", 7777);
        // 3. 转换数据格式
        SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                .flatMap((String line, Collector<String> words) -> {
                    Arrays.stream(line.split(" ")).forEach(words::collect);
                })
                .returns(Types.STRING)
                .map(word -> Tuple2.of(word, 1L))
                .returns(Types.TUPLE(Types.STRING, Types.LONG));// lambda 使用泛型，由于泛型擦除，需要显示的声明类型信息
        // 4. 分组
        KeyedStream<Tuple2<String, Long>, String> tuple2StringKeyedStream = singleOutputStreamOperator.keyBy(t -> t.f0);
        // 5. 求和
        SingleOutputStreamOperator<Tuple2<String, Long>> sum = tuple2StringKeyedStream.sum(1);
        // 6. 打印
        sum.print();
        System.out.println("========");
        // 7. 求最大
        SingleOutputStreamOperator<Tuple2<String, Long>> tuple2SingleOutputStreamOperator = singleOutputStreamOperator.keyBy(t -> t.f0).max(1);
        tuple2SingleOutputStreamOperator.printToErr();
        // 8. 执行
        executionEnvironment.execute();
    }
}

　　debug 查看其相关对象：可以看到默认的并行度和相关的转换

3. 算子链

　　我们观察webui 给出的计划图发现，节点数量和代码中的算子不是一一对应的。有的节点会把多个任务连接在一起合并成一个大任务。下面解释其原因。

1.算子间的数据传输

　　一个数据流在算子之间传输数据的形式可以是一对一的直通模式(forwarding)，也可以是打乱的重分区(redistributing)模式，具体是哪一种取决于算子的种类

(1)一对一直通

　　这种模式下，数据流维护着分区以及元素的顺序。比如图中的source和map 算子，source 读取完之后可以直接发给map 做处理，不需要重新分区，也不需要调整数据的顺序。这意味着map算子的子任务，看到的元素个数和顺序跟source 算子的子任务产生的完全一样，保证一对一的关系。map、filter、flatMap等算子都是这种一对一的关系。

(2)重分区

　　这种模式下，数据流的分区会发生改变。比如图中的map和后面的keyBy/window/apply算子、以及keyBy/window算子和sink 算子之间。

　　每一个算子的子任务会根据数据传输的策略，把数据发送到不同的下游目标任务。例如：keyBy是分组操作，本质上是基于key进行hash后重分区；比如从并行度为2的window 算子传递到并行度为1 的sink，这时的数据传输方式是再平衡(rebalance),会把数据均匀的向下游子任务分发出去。这些传输方式都会引起重分区(redistribute)。

2.合并算子链

　　并行度相同的一对一算子操作，可以直接连接在一起形成一个大的任务(task),这样原来算子就成了合并任务里的一部分。每个任务被一个线程执行。这就是合并算子链。合并后如下图：

合并后就有五个任务，由五个线程并行执行。合并算子链可以减少线程之间的转换，提升吞吐量。

Flink 默认按照算子链的原则进行合并，如果想禁止合并或者自定义，可以在代码对算子做一些特定设置：

// 禁用算子链
SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                .flatMap((String line, Collector<String> words) -> {
                    Arrays.stream(line.split(" ")).forEach(words::collect);
                }).disableChaining()
    
// 从当前算子开始新链
        SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                .flatMap((String line, Collector<String> words) -> {
                    Arrays.stream(line.split(" ")).forEach(words::collect);
                }).startNewChain()

4. 作业图与执行图

Flink任务调度执行的图，其按照顺序分为四层：

逻辑流图-》作业图-》执行图-》物理图

比如以soclet 为例子，其转换过程如下：

1. 逻辑流图

　　图中的节点一般对应算子操作。客户端完成的。

2. 作业图

　　数据流图经过优化就是作业图。主要的优化为将符合条件的节点连接在一起合并成一个任务节点，行成算子链。也是客户端完成的，作业提交时传递给JobMaster。

3. 执行图

　　JobMaster 收到JobGraph后用它生成执行图。执行图是JobGraph的并行化版本，是调度处最核心的数据结构。和作业图区别是对子任务进行了拆分，并明确任务之间传递数据的方式。

4. 物理图

　　JobMaster 生成执行图后，将它分发给TaskManager。TasjkManager 根据执行图部署任务，最终的物理执行过程行成物理图。

　　物理图在执行图的基础上，进一步确定数据存放的位置和收发的具体方式。

5. 任务与任务槽

　　在之前的测试中，我们三个taskManager的slots任务槽为3。提交任务cn.qz.SocketStreamWordCount 的时候选择的并行度为2，显示的任务应该是有5个(1+2+2)，但是却占据了两个任务槽，下面解释其原因。

1.任务槽

flink中一个worker(taskmanager)是一个JVM进程，既然是进程就可以启动多个独立的线程来执行多个子任务(subtask)。

flink 中的多个独立的执行任务的线程数量就是任务槽，默认为1，可以进行修改。修改 flink-conf.yaml，如下修改后每个节点变为4个槽，总共3个节点就是12个slot。

taskmanager.numberOfTaskSlots: 4

需要注意的是，slot目前用来隔离内存，不涉及cpu的隔离。具体应用需要根据cpu 核心数进行调整。

2.任务对任务槽的共享

　　默认情况下，flink 允许子任务共享slot。所以2个子任务两个slot(最大的子任务数量)就可以完成。

　　不同任务节点的子任务可以共享一个slot, 换句话说同一个任务的多个子任务必须放置在不同的slot。比如并行度为2，可能的结果就是

　　到这里可能有个疑问就是既然想要最大利用计算资源，为什么又在一个任务槽并行处理多个任务了(一个线程干多件事)?

　　原因是：不同的任务对资源占用不同，比如source、map 、sink可能处理时间极短，而window等转换操作时间长(资源密集型任务)。如果每个任务一个slot，造成的现象就是上游的source(等待下游的window任务发通知而阻塞，相当于背压)和下游的sink可能长时间浪费，但是windows却忙死，出现资源利用不平衡。于是出现了任务共享，将资源密集型和非密集型放到一个slot，这样就可以自行分配对资源占用的比例。

　　如果想某个任务独占一个slot，或者只有某部分算子共享slot，可以设置共享组：只有属于一个slot组的子任务才会开启共享slot，不同组之间的任务必须分配到不同的slot 数量。

        SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                .flatMap((String line, Collector<String> words) -> {
                    Arrays.stream(line.split(" ")).forEach(words::collect);
                }).slotSharingGroup("1")

3.任务槽和并行度的关系

　　整个流处理程序的并行度，应该是所有算子中并行度最大的那个，也就是所需要的slot 数量(这种是不指定插槽组的情况)。

posted @ 2022-06-20 23:31 QiaoZhi 阅读(561) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· DataStreamAPI核心介绍

· Flink简介

· Flink-运行时架构介绍

· Flink的基本学习

· 聊聊Flink的必知必会(一)

阅读排行：
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码，我用AI程序员写了一个飞机大战
· Obsidian + DeepSeek：免费 AI 助力你的知识管理，让你的笔记飞起来！
· MongoDB 8.0这个新功能碉堡了，比商业数据库还牛
· 白话解读 Dapr 1.15：你的「微服务管家」又秀新绝活了

历史上的今天：
2021-06-20 集合源码
2019-06-20 装饰(Decorator)模式

公告

昵称： QiaoZhi
园龄： 7年7个月
粉丝： 1000
关注： 9

+加关注

2025年3月

日

一

二

三

四

五

六

Qiao_Zhi

有远大抱负的人不可忽略眼前的工作!!!

Flink运行时架构

1. 系统架构

1. JobManager

2. TaskManager

2. 作业提交流程

3. 重要概念

1. 数据流图(Dataflow Graph)

2. 并行度

3. 算子链

4. 作业图与执行图

5. 任务与任务槽

公告

搜索

积分与排名

随笔分类 (1546)

相册 (4)

阅读排行榜

评论排行榜

推荐排行榜

最新评论