Flink运行时架构

 1. 系统架构

Flink 运行时的两大架构。JobManager作业管理器和TaskManager任务管理器。

JobManager:真正的管理者(master),负责管理和调度。在不考虑高可用的情况下只有一个。

TaskManager:可以理解为工作中(worker, slave)。可以有一个或者多个。

作业提交和任务处理时的系统如下:

  客户端并不是处理系统的一部分,只负责作业的提交。负责调用程序的main 方法,将代码转换成数据流图(Dataflow Graph) ,并且最终生成作业图(Job Graph),然后发送给JobManager。也可以从JobManager 获取任务的执行状态和结果。 TaskManager 启动之后,JobManager 会与它建立连接,并将作业图(Job Graph) 转换成可执行的执行图(ExecutionGraph) 然后分发给可用的TaskManager。

1. JobManager

    一个集群中任务管理和调度中心。JobManager 又包含三个组件:

1. JobMaster

  负责处理单独的作业,JobMaster 接收要执行的应用。包括:jar包、数据流图、作业图。JobMaster 会把JobGraph 转换成一个物理层面的数据流图,叫做执行图(包含了所有可以并发执行的任务)。JobMaster 会向资源管理器(ResourceManager)发出请求,申请执行任务必要的资源,一旦获取到足够资源就会将执行图分发到真正运行的TaskManager 上。

  在运行过程中,JobMaster 会负责所有需要中央协调的操作,比如说检查点的协调等。

2. ResourceManager

  主要负责资源的分配和管理。所谓资源主要是指TaskManager 的任务槽。任务槽是Flink 集群中的资源协调单元,包含了机器用来执行计算的一组CPU和内存资源。 每一个任务都需要分配到一个slot 上执行。

3. Dispatcher

  主要负责提供一个REST 接口,用来提交应用,并且负责为每一个新提交的作业启动一个新的JobMaster 组件。 Dispatcher 也会启动一个WEB UI,用来方便的展示和监控作业执行的信息。

2. TaskManager

  Flink 中的工作进程,也被称为worker。 一个集群包含一个或多个TaskManager,每个TaskManager 都包含一定数量的任务槽(task slots)。slot 的数量限制了TaskManager 能并行处理的任务数量。

  启动后,TaskManager 向资源管理器注册它的slots;收到资源管理器的指令后会将一个或多个槽位提供给JobMaster 调用,用于分配任务。

  执行过程中,TaskManager 可以缓冲数据,还可以跟其他运行同一应用的TaskManager 交换数据。

2. 作业提交流程

   可以用下图表示

 

(1)客户端APP通过分发器提供REST 接口,将作业提交给JobManager

(2)分发器启动JobMaster, 将作业(包含JobGraph)提交给JobMaster

(3)JobMaster 将JobGraph 解析为ExecutionGraph,得到所需的资源数量,然后向资源管理器请求资源(slots)

(4)资源管理器协调资源

(5)Taskmanager 启动只会向ResourceManager 注册自己的可以slots

(6)资源管理器通知TaskManager 为新的作业提供slots

(7)TaskManager连接到对应的JobMaster,提供slots

(8)JobMaster 将需要执行的任务分发给TaskManager

(9)TaskManager 执行任务,互相之间可以交换

3. 重要概念

  通过这些核心概念,我们可以明白:

1》怎样从Flink程序得到任务?

2》一个流处理程序,到底包含多少个任务?

3》最终执行任务,需要占用多少slot?

1. 数据流图(Dataflow Graph)

  Flink 是流式计算框架,它的程序结构其实就是定义了一连串的操作,每个数据输入之后都会调用每一个步骤一次计算,每一个操作都叫做"算子"(operator),可以理解为我们的程序是一串算子构成的管道,数据则像水流一样有序地流过。 

  所有的程序都由三部分组成。source(源算子,负责读取数据)、Transformation(转换算子,负责处理数据)、Sink(下沉子算子,负责数据的输出)。

  在运行时,Flink 程序会被映射成所有算子按照逻辑顺序拼接成一张图,这种图被称为逻辑数据流(数据流图)。数据流图类似于任意的有向无环图(DAG-Directed Acyclic Graph)。图中的每一条数据流以一个或者多个source 开始,以一个或者多个sink 结束。

  代码中,除了source和sink,其他可以被称为代码中如果返回值是 SingleOutputStreamOperator 的API 就可以称为一个算子,否则不会计算为算子(只能理解为中间的转换操作),比如:keyBy 返回值是 KeyedStream 就不是一个算子;org.apache.flink.streaming.api.datastream.KeyedStream#sum(int) 就是一个算子。

  常见的算子:

source:读txt、socket、自定义输入等

transformation:flatMap、map、filter、process 处理操作,还有sum、max、maxBy、min、minBy 等也都是聚合算子(名字都是Keyed Aggregation)

sink: print、printToErr、writeAsText、writeAsCsv 等

org.apache.flink.streaming.api.datastream.DataStream 源码可以看出每个算子都有一个特定名称:

复制代码
//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by FernFlower decompiler)
//

package org.apache.flink.streaming.api.datastream;

import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
import org.apache.flink.annotation.Experimental;
import org.apache.flink.annotation.Internal;
import org.apache.flink.annotation.Public;
import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.api.common.ExecutionConfig;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.Partitioner;
import org.apache.flink.api.common.io.OutputFormat;
import org.apache.flink.api.common.operators.Keys;
import org.apache.flink.api.common.operators.ResourceSpec;
import org.apache.flink.api.common.operators.Keys.ExpressionKeys;
import org.apache.flink.api.common.serialization.SerializationSchema;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.typeinfo.BasicArrayTypeInfo;
import org.apache.flink.api.common.typeinfo.PrimitiveArrayTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.common.typeutils.TypeSerializer;
import org.apache.flink.api.connector.sink.Sink;
import org.apache.flink.api.dag.Transformation;
import org.apache.flink.api.java.Utils;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.io.CsvOutputFormat;
import org.apache.flink.api.java.io.TextOutputFormat;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.typeutils.InputTypeConfigurable;
import org.apache.flink.api.java.typeutils.TypeExtractor;
import org.apache.flink.core.execution.JobClient;
import org.apache.flink.core.fs.Path;
import org.apache.flink.core.fs.FileSystem.WriteMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.apache.flink.streaming.api.functions.sink.SocketClientSink;
import org.apache.flink.streaming.api.operators.OneInputStreamOperator;
import org.apache.flink.streaming.api.operators.OneInputStreamOperatorFactory;
import org.apache.flink.streaming.api.operators.ProcessOperator;
import org.apache.flink.streaming.api.operators.SimpleOperatorFactory;
import org.apache.flink.streaming.api.operators.StreamFilter;
import org.apache.flink.streaming.api.operators.StreamFlatMap;
import org.apache.flink.streaming.api.operators.StreamMap;
import org.apache.flink.streaming.api.operators.StreamOperatorFactory;
import org.apache.flink.streaming.api.operators.StreamSink;
import org.apache.flink.streaming.api.operators.collect.ClientAndIterator;
import org.apache.flink.streaming.api.operators.collect.CollectResultIterator;
import org.apache.flink.streaming.api.operators.collect.CollectSinkOperator;
import org.apache.flink.streaming.api.operators.collect.CollectSinkOperatorFactory;
import org.apache.flink.streaming.api.operators.collect.CollectStreamSink;
import org.apache.flink.streaming.api.transformations.OneInputTransformation;
import org.apache.flink.streaming.api.transformations.PartitionTransformation;
import org.apache.flink.streaming.api.transformations.TimestampsAndWatermarksTransformation;
import org.apache.flink.streaming.api.transformations.UnionTransformation;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.WindowAssigner;
import org.apache.flink.streaming.api.windowing.evictors.CountEvictor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.api.windowing.windows.Window;
import org.apache.flink.streaming.runtime.operators.util.AssignerWithPeriodicWatermarksAdapter.Strategy;
import org.apache.flink.streaming.runtime.partitioner.BroadcastPartitioner;
import org.apache.flink.streaming.runtime.partitioner.CustomPartitionerWrapper;
import org.apache.flink.streaming.runtime.partitioner.ForwardPartitioner;
import org.apache.flink.streaming.runtime.partitioner.GlobalPartitioner;
import org.apache.flink.streaming.runtime.partitioner.RebalancePartitioner;
import org.apache.flink.streaming.runtime.partitioner.RescalePartitioner;
import org.apache.flink.streaming.runtime.partitioner.ShufflePartitioner;
import org.apache.flink.streaming.runtime.partitioner.StreamPartitioner;
import org.apache.flink.streaming.util.keys.KeySelectorUtil;
import org.apache.flink.util.CloseableIterator;
import org.apache.flink.util.Preconditions;

@Public
public class DataStream<T> {
    protected final StreamExecutionEnvironment environment;
    protected final Transformation<T> transformation;

    public DataStream(StreamExecutionEnvironment environment, Transformation<T> transformation) {
        this.environment = (StreamExecutionEnvironment)Preconditions.checkNotNull(environment, "Execution Environment must not be null.");
        this.transformation = (Transformation)Preconditions.checkNotNull(transformation, "Stream Transformation must not be null.");
    }

    @Internal
    public int getId() {
        return this.transformation.getId();
    }

    public int getParallelism() {
        return this.transformation.getParallelism();
    }

    @PublicEvolving
    public ResourceSpec getMinResources() {
        return this.transformation.getMinResources();
    }

    @PublicEvolving
    public ResourceSpec getPreferredResources() {
        return this.transformation.getPreferredResources();
    }

    public TypeInformation<T> getType() {
        return this.transformation.getOutputType();
    }

    protected <F> F clean(F f) {
        return this.getExecutionEnvironment().clean(f);
    }

    public StreamExecutionEnvironment getExecutionEnvironment() {
        return this.environment;
    }

    public ExecutionConfig getExecutionConfig() {
        return this.environment.getConfig();
    }

    @SafeVarargs
    public final DataStream<T> union(DataStream<T>... streams) {
        List<Transformation<T>> unionedTransforms = new ArrayList();
        unionedTransforms.add(this.transformation);
        DataStream[] var3 = streams;
        int var4 = streams.length;

        for(int var5 = 0; var5 < var4; ++var5) {
            DataStream<T> newStream = var3[var5];
            if (!this.getType().equals(newStream.getType())) {
                throw new IllegalArgumentException("Cannot union streams of different types: " + this.getType() + " and " + newStream.getType());
            }

            unionedTransforms.add(newStream.getTransformation());
        }

        return new DataStream(this.environment, new UnionTransformation(unionedTransforms));
    }

    public <R> ConnectedStreams<T, R> connect(DataStream<R> dataStream) {
        return new ConnectedStreams(this.environment, this, dataStream);
    }

    @PublicEvolving
    public <R> BroadcastConnectedStream<T, R> connect(BroadcastStream<R> broadcastStream) {
        return new BroadcastConnectedStream(this.environment, this, (BroadcastStream)Preconditions.checkNotNull(broadcastStream), broadcastStream.getBroadcastStateDescriptors());
    }

    public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key) {
        Preconditions.checkNotNull(key);
        return new KeyedStream(this, (KeySelector)this.clean(key));
    }

    public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key, TypeInformation<K> keyType) {
        Preconditions.checkNotNull(key);
        Preconditions.checkNotNull(keyType);
        return new KeyedStream(this, (KeySelector)this.clean(key), keyType);
    }

    /** @deprecated */
    @Deprecated
    public KeyedStream<T, Tuple> keyBy(int... fields) {
        return !(this.getType() instanceof BasicArrayTypeInfo) && !(this.getType() instanceof PrimitiveArrayTypeInfo) ? this.keyBy((Keys)(new ExpressionKeys(fields, this.getType()))) : this.keyBy((KeySelector)KeySelectorUtil.getSelectorForArray(fields, this.getType()));
    }

    /** @deprecated */
    @Deprecated
    public KeyedStream<T, Tuple> keyBy(String... fields) {
        return this.keyBy((Keys)(new ExpressionKeys(fields, this.getType())));
    }

    private KeyedStream<T, Tuple> keyBy(Keys<T> keys) {
        return new KeyedStream(this, (KeySelector)this.clean(KeySelectorUtil.getSelectorForKeys(keys, this.getType(), this.getExecutionConfig())));
    }

    /** @deprecated */
    @Deprecated
    public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, int field) {
        ExpressionKeys<T> outExpressionKeys = new ExpressionKeys(new int[]{field}, this.getType());
        return this.partitionCustom(partitioner, (Keys)outExpressionKeys);
    }

    /** @deprecated */
    @Deprecated
    public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, String field) {
        ExpressionKeys<T> outExpressionKeys = new ExpressionKeys(new String[]{field}, this.getType());
        return this.partitionCustom(partitioner, (Keys)outExpressionKeys);
    }

    public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, KeySelector<T, K> keySelector) {
        return this.setConnectionType(new CustomPartitionerWrapper((Partitioner)this.clean(partitioner), (KeySelector)this.clean(keySelector)));
    }

    private <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, Keys<T> keys) {
        KeySelector<T, K> keySelector = KeySelectorUtil.getSelectorForOneKey(keys, partitioner, this.getType(), this.getExecutionConfig());
        return this.setConnectionType(new CustomPartitionerWrapper((Partitioner)this.clean(partitioner), (KeySelector)this.clean(keySelector)));
    }

    public DataStream<T> broadcast() {
        return this.setConnectionType(new BroadcastPartitioner());
    }

    @PublicEvolving
    public BroadcastStream<T> broadcast(MapStateDescriptor<?, ?>... broadcastStateDescriptors) {
        Preconditions.checkNotNull(broadcastStateDescriptors);
        DataStream<T> broadcastStream = this.setConnectionType(new BroadcastPartitioner());
        return new BroadcastStream(this.environment, broadcastStream, broadcastStateDescriptors);
    }

    @PublicEvolving
    public DataStream<T> shuffle() {
        return this.setConnectionType(new ShufflePartitioner());
    }

    public DataStream<T> forward() {
        return this.setConnectionType(new ForwardPartitioner());
    }

    public DataStream<T> rebalance() {
        return this.setConnectionType(new RebalancePartitioner());
    }

    @PublicEvolving
    public DataStream<T> rescale() {
        return this.setConnectionType(new RescalePartitioner());
    }

    @PublicEvolving
    public DataStream<T> global() {
        return this.setConnectionType(new GlobalPartitioner());
    }

    @PublicEvolving
    public IterativeStream<T> iterate() {
        return new IterativeStream(this, 0L);
    }

    @PublicEvolving
    public IterativeStream<T> iterate(long maxWaitTimeMillis) {
        return new IterativeStream(this, maxWaitTimeMillis);
    }

    public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper) {
        TypeInformation<R> outType = TypeExtractor.getMapReturnTypes((MapFunction)this.clean(mapper), this.getType(), Utils.getCallLocationName(), true);
        return this.map(mapper, outType);
    }

    public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper, TypeInformation<R> outputType) {
        return this.transform("Map", outputType, (OneInputStreamOperator)(new StreamMap((MapFunction)this.clean(mapper))));
    }

    public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper) {
        TypeInformation<R> outType = TypeExtractor.getFlatMapReturnTypes((FlatMapFunction)this.clean(flatMapper), this.getType(), Utils.getCallLocationName(), true);
        return this.flatMap(flatMapper, outType);
    }

    public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper, TypeInformation<R> outputType) {
        return this.transform("Flat Map", outputType, (OneInputStreamOperator)(new StreamFlatMap((FlatMapFunction)this.clean(flatMapper))));
    }

    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction) {
        TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(processFunction, ProcessFunction.class, 0, 1, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true);
        return this.process(processFunction, outType);
    }

    @Internal
    public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction, TypeInformation<R> outputType) {
        ProcessOperator<T, R> operator = new ProcessOperator((ProcessFunction)this.clean(processFunction));
        return this.transform("Process", outputType, (OneInputStreamOperator)operator);
    }

    public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter) {
        return this.transform("Filter", this.getType(), (OneInputStreamOperator)(new StreamFilter((FilterFunction)this.clean(filter))));
    }

    @PublicEvolving
    public <R extends Tuple> SingleOutputStreamOperator<R> project(int... fieldIndexes) {
        return (new StreamProjection(this, fieldIndexes)).projectTupleX();
    }

    public <T2> CoGroupedStreams<T, T2> coGroup(DataStream<T2> otherStream) {
        return new CoGroupedStreams(this, otherStream);
    }

    public <T2> JoinedStreams<T, T2> join(DataStream<T2> otherStream) {
        return new JoinedStreams(this, otherStream);
    }

    /** @deprecated */
    @Deprecated
    public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size) {
        return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.windowAll(TumblingProcessingTimeWindows.of(size)) : this.windowAll(TumblingEventTimeWindows.of(size));
    }

    /** @deprecated */
    @Deprecated
    public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size, Time slide) {
        return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.windowAll(SlidingProcessingTimeWindows.of(size, slide)) : this.windowAll(SlidingEventTimeWindows.of(size, slide));
    }

    public AllWindowedStream<T, GlobalWindow> countWindowAll(long size) {
        return this.windowAll(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size)));
    }

    public AllWindowedStream<T, GlobalWindow> countWindowAll(long size, long slide) {
        return this.windowAll(GlobalWindows.create()).evictor(CountEvictor.of(size)).trigger(CountTrigger.of(slide));
    }

    @PublicEvolving
    public <W extends Window> AllWindowedStream<T, W> windowAll(WindowAssigner<? super T, W> assigner) {
        return new AllWindowedStream(this, assigner);
    }

    public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(WatermarkStrategy<T> watermarkStrategy) {
        WatermarkStrategy<T> cleanedStrategy = (WatermarkStrategy)this.clean(watermarkStrategy);
        int inputParallelism = this.getTransformation().getParallelism();
        TimestampsAndWatermarksTransformation<T> transformation = new TimestampsAndWatermarksTransformation("Timestamps/Watermarks", inputParallelism, this.getTransformation(), cleanedStrategy);
        this.getExecutionEnvironment().addOperator(transformation);
        return new SingleOutputStreamOperator(this.getExecutionEnvironment(), transformation);
    }

    /** @deprecated */
    @Deprecated
    public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(AssignerWithPeriodicWatermarks<T> timestampAndWatermarkAssigner) {
        AssignerWithPeriodicWatermarks<T> cleanedAssigner = (AssignerWithPeriodicWatermarks)this.clean(timestampAndWatermarkAssigner);
        WatermarkStrategy<T> wms = new Strategy(cleanedAssigner);
        return this.assignTimestampsAndWatermarks((WatermarkStrategy)wms);
    }

    /** @deprecated */
    @Deprecated
    public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(AssignerWithPunctuatedWatermarks<T> timestampAndWatermarkAssigner) {
        AssignerWithPunctuatedWatermarks<T> cleanedAssigner = (AssignerWithPunctuatedWatermarks)this.clean(timestampAndWatermarkAssigner);
        WatermarkStrategy<T> wms = new org.apache.flink.streaming.runtime.operators.util.AssignerWithPunctuatedWatermarksAdapter.Strategy(cleanedAssigner);
        return this.assignTimestampsAndWatermarks((WatermarkStrategy)wms);
    }

    @PublicEvolving
    public DataStreamSink<T> print() {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction();
        return this.addSink(printFunction).name("Print to Std. Out");
    }

    @PublicEvolving
    public DataStreamSink<T> printToErr() {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction(true);
        return this.addSink(printFunction).name("Print to Std. Err");
    }

    @PublicEvolving
    public DataStreamSink<T> print(String sinkIdentifier) {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction(sinkIdentifier, false);
        return this.addSink(printFunction).name("Print to Std. Out");
    }

    @PublicEvolving
    public DataStreamSink<T> printToErr(String sinkIdentifier) {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction(sinkIdentifier, true);
        return this.addSink(printFunction).name("Print to Std. Err");
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public DataStreamSink<T> writeAsText(String path) {
        return this.writeUsingOutputFormat(new TextOutputFormat(new Path(path)));
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public DataStreamSink<T> writeAsText(String path, WriteMode writeMode) {
        TextOutputFormat<T> tof = new TextOutputFormat(new Path(path));
        tof.setWriteMode(writeMode);
        return this.writeUsingOutputFormat(tof);
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public DataStreamSink<T> writeAsCsv(String path) {
        return this.writeAsCsv(path, (WriteMode)null, "\n", CsvOutputFormat.DEFAULT_FIELD_DELIMITER);
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public DataStreamSink<T> writeAsCsv(String path, WriteMode writeMode) {
        return this.writeAsCsv(path, writeMode, "\n", CsvOutputFormat.DEFAULT_FIELD_DELIMITER);
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public <X extends Tuple> DataStreamSink<T> writeAsCsv(String path, WriteMode writeMode, String rowDelimiter, String fieldDelimiter) {
        Preconditions.checkArgument(this.getType().isTupleType(), "The writeAsCsv() method can only be used on data streams of tuples.");
        CsvOutputFormat<X> of = new CsvOutputFormat(new Path(path), rowDelimiter, fieldDelimiter);
        if (writeMode != null) {
            of.setWriteMode(writeMode);
        }

        return this.writeUsingOutputFormat(of);
    }

    @PublicEvolving
    public DataStreamSink<T> writeToSocket(String hostName, int port, SerializationSchema<T> schema) {
        DataStreamSink<T> returnStream = this.addSink(new SocketClientSink(hostName, port, schema, 0));
        returnStream.setParallelism(1);
        return returnStream;
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public DataStreamSink<T> writeUsingOutputFormat(OutputFormat<T> format) {
        return this.addSink(new OutputFormatSinkFunction(format));
    }

    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperator<T, R> operator) {
        return this.doTransform(operatorName, outTypeInfo, SimpleOperatorFactory.of(operator));
    }

    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperatorFactory<T, R> operatorFactory) {
        return this.doTransform(operatorName, outTypeInfo, operatorFactory);
    }

    protected <R> SingleOutputStreamOperator<R> doTransform(String operatorName, TypeInformation<R> outTypeInfo, StreamOperatorFactory<R> operatorFactory) {
        this.transformation.getOutputType();
        OneInputTransformation<T, R> resultTransform = new OneInputTransformation(this.transformation, operatorName, operatorFactory, outTypeInfo, this.environment.getParallelism());
        SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(this.environment, resultTransform);
        this.getExecutionEnvironment().addOperator(resultTransform);
        return returnStream;
    }

    protected DataStream<T> setConnectionType(StreamPartitioner<T> partitioner) {
        return new DataStream(this.getExecutionEnvironment(), new PartitionTransformation(this.getTransformation(), partitioner));
    }

    public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction) {
        this.transformation.getOutputType();
        if (sinkFunction instanceof InputTypeConfigurable) {
            ((InputTypeConfigurable)sinkFunction).setInputType(this.getType(), this.getExecutionConfig());
        }

        StreamSink<T> sinkOperator = new StreamSink((SinkFunction)this.clean(sinkFunction));
        DataStreamSink<T> sink = new DataStreamSink(this, sinkOperator);
        this.getExecutionEnvironment().addOperator(sink.getTransformation());
        return sink;
    }

    @Experimental
    public DataStreamSink<T> sinkTo(Sink<T, ?, ?, ?> sink) {
        this.transformation.getOutputType();
        return new DataStreamSink(this, sink);
    }

    public CloseableIterator<T> executeAndCollect() throws Exception {
        return this.executeAndCollect("DataStream Collect");
    }

    public CloseableIterator<T> executeAndCollect(String jobExecutionName) throws Exception {
        return this.executeAndCollectWithClient(jobExecutionName).iterator;
    }

    public List<T> executeAndCollect(int limit) throws Exception {
        return this.executeAndCollect("DataStream Collect", limit);
    }

    public List<T> executeAndCollect(String jobExecutionName, int limit) throws Exception {
        Preconditions.checkState(limit > 0, "Limit must be greater than 0");
        ClientAndIterator<T> clientAndIterator = this.executeAndCollectWithClient(jobExecutionName);
        Throwable var4 = null;

        try {
            ArrayList results;
            for(results = new ArrayList(limit); clientAndIterator.iterator.hasNext() && limit > 0; --limit) {
                results.add(clientAndIterator.iterator.next());
            }

            ArrayList var6 = results;
            return var6;
        } catch (Throwable var15) {
            var4 = var15;
            throw var15;
        } finally {
            if (clientAndIterator != null) {
                if (var4 != null) {
                    try {
                        clientAndIterator.close();
                    } catch (Throwable var14) {
                        var4.addSuppressed(var14);
                    }
                } else {
                    clientAndIterator.close();
                }
            }

        }
    }

    ClientAndIterator<T> executeAndCollectWithClient(String jobExecutionName) throws Exception {
        TypeSerializer<T> serializer = this.getType().createSerializer(this.getExecutionEnvironment().getConfig());
        String accumulatorName = "dataStreamCollect_" + UUID.randomUUID().toString();
        StreamExecutionEnvironment env = this.getExecutionEnvironment();
        CollectSinkOperatorFactory<T> factory = new CollectSinkOperatorFactory(serializer, accumulatorName);
        CollectSinkOperator<T> operator = (CollectSinkOperator)factory.getOperator();
        CollectResultIterator<T> iterator = new CollectResultIterator(operator.getOperatorIdFuture(), serializer, accumulatorName, env.getCheckpointConfig());
        CollectStreamSink<T> sink = new CollectStreamSink(this, factory);
        sink.name("Data stream collect sink");
        env.addOperator(sink.getTransformation());
        JobClient jobClient = env.executeAsync(jobExecutionName);
        iterator.setJobClient(jobClient);
        return new ClientAndIterator(jobClient, iterator);
    }

    @Internal
    public Transformation<T> getTransformation() {
        return this.transformation;
    }
}
View Code
复制代码

org.apache.flink.streaming.api.datastream.KeyedStream 针对集合的算子API:

复制代码
//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by FernFlower decompiler)
//

package org.apache.flink.streaming.api.datastream;

import java.util.ArrayList;
import java.util.Stack;
import java.util.UUID;
import org.apache.commons.lang3.StringUtils;
import org.apache.flink.annotation.Internal;
import org.apache.flink.annotation.Public;
import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.api.common.InvalidProgramException;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.BasicArrayTypeInfo;
import org.apache.flink.api.common.typeinfo.PrimitiveArrayTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.Utils;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.typeutils.EnumTypeInfo;
import org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo;
import org.apache.flink.api.java.typeutils.PojoTypeInfo;
import org.apache.flink.api.java.typeutils.TupleTypeInfoBase;
import org.apache.flink.api.java.typeutils.TypeExtractor;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.aggregation.AggregationFunction;
import org.apache.flink.streaming.api.functions.aggregation.ComparableAggregator;
import org.apache.flink.streaming.api.functions.aggregation.SumAggregator;
import org.apache.flink.streaming.api.functions.aggregation.AggregationFunction.AggregationType;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.functions.query.QueryableAppendingStateOperator;
import org.apache.flink.streaming.api.functions.query.QueryableValueStateOperator;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.apache.flink.streaming.api.operators.KeyedProcessOperator;
import org.apache.flink.streaming.api.operators.LegacyKeyedProcessOperator;
import org.apache.flink.streaming.api.operators.StreamOperatorFactory;
import org.apache.flink.streaming.api.operators.co.IntervalJoinOperator;
import org.apache.flink.streaming.api.transformations.OneInputTransformation;
import org.apache.flink.streaming.api.transformations.PartitionTransformation;
import org.apache.flink.streaming.api.transformations.ReduceTransformation;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.WindowAssigner;
import org.apache.flink.streaming.api.windowing.evictors.CountEvictor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.api.windowing.windows.Window;
import org.apache.flink.streaming.runtime.partitioner.KeyGroupStreamPartitioner;
import org.apache.flink.streaming.runtime.partitioner.StreamPartitioner;
import org.apache.flink.util.Preconditions;

@Public
public class KeyedStream<T, KEY> extends DataStream<T> {
    private final KeySelector<T, KEY> keySelector;
    private final TypeInformation<KEY> keyType;

    public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector) {
        this(dataStream, keySelector, TypeExtractor.getKeySelectorTypes(keySelector, dataStream.getType()));
    }

    public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
        this(dataStream, new PartitionTransformation(dataStream.getTransformation(), new KeyGroupStreamPartitioner(keySelector, 128)), keySelector, keyType);
    }

    @Internal
    KeyedStream(DataStream<T> stream, PartitionTransformation<T> partitionTransformation, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
        super(stream.getExecutionEnvironment(), partitionTransformation);
        this.keySelector = (KeySelector)this.clean(keySelector);
        this.keyType = this.validateKeyType(keyType);
    }

    private TypeInformation<KEY> validateKeyType(TypeInformation<KEY> keyType) {
        Stack<TypeInformation<?>> stack = new Stack();
        stack.push(keyType);
        ArrayList unsupportedTypes = new ArrayList();

        while(true) {
            TypeInformation typeInfo;
            do {
                if (stack.isEmpty()) {
                    if (!unsupportedTypes.isEmpty()) {
                        throw new InvalidProgramException("Type " + keyType + " cannot be used as key. Contained UNSUPPORTED key types: " + StringUtils.join(unsupportedTypes, ", ") + ". Look at the keyBy() documentation for the conditions a type has to satisfy in order to be eligible for a key.");
                    }

                    return keyType;
                }

                typeInfo = (TypeInformation)stack.pop();
                if (!this.validateKeyTypeIsHashable(typeInfo)) {
                    unsupportedTypes.add(typeInfo);
                }
            } while(!(typeInfo instanceof TupleTypeInfoBase));

            for(int i = 0; i < typeInfo.getArity(); ++i) {
                stack.push(((TupleTypeInfoBase)typeInfo).getTypeAt(i));
            }
        }
    }

    private boolean validateKeyTypeIsHashable(TypeInformation<?> type) {
        try {
            return type instanceof PojoTypeInfo ? !type.getTypeClass().getMethod("hashCode").getDeclaringClass().equals(Object.class) : !isArrayType(type) && !isEnumType(type);
        } catch (NoSuchMethodException var3) {
            return false;
        }
    }

    private static boolean isArrayType(TypeInformation<?> type) {
        return type instanceof PrimitiveArrayTypeInfo || type instanceof BasicArrayTypeInfo || type instanceof ObjectArrayTypeInfo;
    }

    private static boolean isEnumType(TypeInformation<?> type) {
        return type instanceof EnumTypeInfo;
    }

    @Internal
    public KeySelector<T, KEY> getKeySelector() {
        return this.keySelector;
    }

    @Internal
    public TypeInformation<KEY> getKeyType() {
        return this.keyType;
    }

    protected DataStream<T> setConnectionType(StreamPartitioner<T> partitioner) {
        throw new UnsupportedOperationException("Cannot override partitioning for KeyedStream.");
    }

    protected <R> SingleOutputStreamOperator<R> doTransform(String operatorName, TypeInformation<R> outTypeInfo, StreamOperatorFactory<R> operatorFactory) {
        SingleOutputStreamOperator<R> returnStream = super.doTransform(operatorName, outTypeInfo, operatorFactory);
        OneInputTransformation<T, R> transform = (OneInputTransformation)returnStream.getTransformation();
        transform.setStateKeySelector(this.keySelector);
        transform.setStateKeyType(this.keyType);
        return returnStream;
    }

    public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction) {
        DataStreamSink<T> result = super.addSink(sinkFunction);
        result.getTransformation().setStateKeySelector(this.keySelector);
        result.getTransformation().setStateKeyType(this.keyType);
        return result;
    }

    /** @deprecated */
    @Deprecated
    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction) {
        TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(processFunction, ProcessFunction.class, 0, 1, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true);
        return this.process(processFunction, outType);
    }

    /** @deprecated */
    @Deprecated
    @Internal
    public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction, TypeInformation<R> outputType) {
        LegacyKeyedProcessOperator<KEY, T, R> operator = new LegacyKeyedProcessOperator((ProcessFunction)this.clean(processFunction));
        return this.transform("Process", outputType, operator);
    }

    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> process(KeyedProcessFunction<KEY, T, R> keyedProcessFunction) {
        TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(keyedProcessFunction, KeyedProcessFunction.class, 1, 2, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true);
        return this.process(keyedProcessFunction, outType);
    }

    @Internal
    public <R> SingleOutputStreamOperator<R> process(KeyedProcessFunction<KEY, T, R> keyedProcessFunction, TypeInformation<R> outputType) {
        KeyedProcessOperator<KEY, T, R> operator = new KeyedProcessOperator((KeyedProcessFunction)this.clean(keyedProcessFunction));
        return this.transform("KeyedProcess", outputType, operator);
    }

    @PublicEvolving
    public <T1> KeyedStream.IntervalJoin<T, T1, KEY> intervalJoin(KeyedStream<T1, KEY> otherStream) {
        return new KeyedStream.IntervalJoin(this, otherStream);
    }

    /** @deprecated */
    @Deprecated
    public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size) {
        return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.window(TumblingProcessingTimeWindows.of(size)) : this.window(TumblingEventTimeWindows.of(size));
    }

    /** @deprecated */
    @Deprecated
    public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size, Time slide) {
        return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.window(SlidingProcessingTimeWindows.of(size, slide)) : this.window(SlidingEventTimeWindows.of(size, slide));
    }

    public WindowedStream<T, KEY, GlobalWindow> countWindow(long size) {
        return this.window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size)));
    }

    public WindowedStream<T, KEY, GlobalWindow> countWindow(long size, long slide) {
        return this.window(GlobalWindows.create()).evictor(CountEvictor.of(size)).trigger(CountTrigger.of(slide));
    }

    @PublicEvolving
    public <W extends Window> WindowedStream<T, KEY, W> window(WindowAssigner<? super T, W> assigner) {
        return new WindowedStream(this, assigner);
    }

    public SingleOutputStreamOperator<T> reduce(ReduceFunction<T> reducer) {
        ReduceTransformation<T, KEY> reduce = new ReduceTransformation("Keyed Reduce", this.environment.getParallelism(), this.transformation, (ReduceFunction)this.clean(reducer), this.keySelector, this.getKeyType());
        this.getExecutionEnvironment().addOperator(reduce);
        return new SingleOutputStreamOperator(this.getExecutionEnvironment(), reduce);
    }

    public SingleOutputStreamOperator<T> sum(int positionToSum) {
        return this.aggregate(new SumAggregator(positionToSum, this.getType(), this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> sum(String field) {
        return this.aggregate(new SumAggregator(field, this.getType(), this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> min(int positionToMin) {
        return this.aggregate(new ComparableAggregator(positionToMin, this.getType(), AggregationType.MIN, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> min(String field) {
        return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MIN, false, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> max(int positionToMax) {
        return this.aggregate(new ComparableAggregator(positionToMax, this.getType(), AggregationType.MAX, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> max(String field) {
        return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MAX, false, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> minBy(String field, boolean first) {
        return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MINBY, first, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> maxBy(String field, boolean first) {
        return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MAXBY, first, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> minBy(int positionToMinBy) {
        return this.minBy(positionToMinBy, true);
    }

    public SingleOutputStreamOperator<T> minBy(String positionToMinBy) {
        return this.minBy(positionToMinBy, true);
    }

    public SingleOutputStreamOperator<T> minBy(int positionToMinBy, boolean first) {
        return this.aggregate(new ComparableAggregator(positionToMinBy, this.getType(), AggregationType.MINBY, first, this.getExecutionConfig()));
    }

    public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy) {
        return this.maxBy(positionToMaxBy, true);
    }

    public SingleOutputStreamOperator<T> maxBy(String positionToMaxBy) {
        return this.maxBy(positionToMaxBy, true);
    }

    public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy, boolean first) {
        return this.aggregate(new ComparableAggregator(positionToMaxBy, this.getType(), AggregationType.MAXBY, first, this.getExecutionConfig()));
    }

    protected SingleOutputStreamOperator<T> aggregate(AggregationFunction<T> aggregate) {
        return this.reduce(aggregate).name("Keyed Aggregation");
    }

    @PublicEvolving
    public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName) {
        ValueStateDescriptor<T> valueStateDescriptor = new ValueStateDescriptor(UUID.randomUUID().toString(), this.getType());
        return this.asQueryableState(queryableStateName, valueStateDescriptor);
    }

    @PublicEvolving
    public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName, ValueStateDescriptor<T> stateDescriptor) {
        this.transform("Queryable state: " + queryableStateName, this.getType(), new QueryableValueStateOperator(queryableStateName, stateDescriptor));
        stateDescriptor.initializeSerializerUnlessSet(this.getExecutionConfig());
        return new QueryableStateStream(queryableStateName, stateDescriptor, this.getKeyType().createSerializer(this.getExecutionConfig()));
    }

    @PublicEvolving
    public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName, ReducingStateDescriptor<T> stateDescriptor) {
        this.transform("Queryable state: " + queryableStateName, this.getType(), new QueryableAppendingStateOperator(queryableStateName, stateDescriptor));
        stateDescriptor.initializeSerializerUnlessSet(this.getExecutionConfig());
        return new QueryableStateStream(queryableStateName, stateDescriptor, this.getKeyType().createSerializer(this.getExecutionConfig()));
    }

    @PublicEvolving
    public static class IntervalJoined<IN1, IN2, KEY> {
        private final KeyedStream<IN1, KEY> left;
        private final KeyedStream<IN2, KEY> right;
        private final long lowerBound;
        private final long upperBound;
        private final KeySelector<IN1, KEY> keySelector1;
        private final KeySelector<IN2, KEY> keySelector2;
        private boolean lowerBoundInclusive;
        private boolean upperBoundInclusive;

        public IntervalJoined(KeyedStream<IN1, KEY> left, KeyedStream<IN2, KEY> right, long lowerBound, long upperBound, boolean lowerBoundInclusive, boolean upperBoundInclusive) {
            this.left = (KeyedStream)Preconditions.checkNotNull(left);
            this.right = (KeyedStream)Preconditions.checkNotNull(right);
            this.lowerBound = lowerBound;
            this.upperBound = upperBound;
            this.lowerBoundInclusive = lowerBoundInclusive;
            this.upperBoundInclusive = upperBoundInclusive;
            this.keySelector1 = left.getKeySelector();
            this.keySelector2 = right.getKeySelector();
        }

        @PublicEvolving
        public KeyedStream.IntervalJoined<IN1, IN2, KEY> upperBoundExclusive() {
            this.upperBoundInclusive = false;
            return this;
        }

        @PublicEvolving
        public KeyedStream.IntervalJoined<IN1, IN2, KEY> lowerBoundExclusive() {
            this.lowerBoundInclusive = false;
            return this;
        }

        @PublicEvolving
        public <OUT> SingleOutputStreamOperator<OUT> process(ProcessJoinFunction<IN1, IN2, OUT> processJoinFunction) {
            Preconditions.checkNotNull(processJoinFunction);
            TypeInformation<OUT> outputType = TypeExtractor.getBinaryOperatorReturnType(processJoinFunction, ProcessJoinFunction.class, 0, 1, 2, TypeExtractor.NO_INDEX, this.left.getType(), this.right.getType(), Utils.getCallLocationName(), true);
            return this.process(processJoinFunction, outputType);
        }

        @PublicEvolving
        public <OUT> SingleOutputStreamOperator<OUT> process(ProcessJoinFunction<IN1, IN2, OUT> processJoinFunction, TypeInformation<OUT> outputType) {
            Preconditions.checkNotNull(processJoinFunction);
            Preconditions.checkNotNull(outputType);
            ProcessJoinFunction<IN1, IN2, OUT> cleanedUdf = (ProcessJoinFunction)this.left.getExecutionEnvironment().clean(processJoinFunction);
            IntervalJoinOperator<KEY, IN1, IN2, OUT> operator = new IntervalJoinOperator(this.lowerBound, this.upperBound, this.lowerBoundInclusive, this.upperBoundInclusive, this.left.getType().createSerializer(this.left.getExecutionConfig()), this.right.getType().createSerializer(this.right.getExecutionConfig()), cleanedUdf);
            return this.left.connect(this.right).keyBy(this.keySelector1, this.keySelector2).transform("Interval Join", outputType, operator);
        }
    }

    @PublicEvolving
    public static class IntervalJoin<T1, T2, KEY> {
        private final KeyedStream<T1, KEY> streamOne;
        private final KeyedStream<T2, KEY> streamTwo;
        private KeyedStream.IntervalJoin.TimeBehaviour timeBehaviour;

        IntervalJoin(KeyedStream<T1, KEY> streamOne, KeyedStream<T2, KEY> streamTwo) {
            this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.EventTime;
            this.streamOne = (KeyedStream)Preconditions.checkNotNull(streamOne);
            this.streamTwo = (KeyedStream)Preconditions.checkNotNull(streamTwo);
        }

        public KeyedStream.IntervalJoin<T1, T2, KEY> inEventTime() {
            this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.EventTime;
            return this;
        }

        public KeyedStream.IntervalJoin<T1, T2, KEY> inProcessingTime() {
            this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.ProcessingTime;
            return this;
        }

        @PublicEvolving
        public KeyedStream.IntervalJoined<T1, T2, KEY> between(Time lowerBound, Time upperBound) {
            if (this.timeBehaviour != KeyedStream.IntervalJoin.TimeBehaviour.EventTime) {
                throw new UnsupportedTimeCharacteristicException("Time-bounded stream joins are only supported in event time");
            } else {
                Preconditions.checkNotNull(lowerBound, "A lower bound needs to be provided for a time-bounded join");
                Preconditions.checkNotNull(upperBound, "An upper bound needs to be provided for a time-bounded join");
                return new KeyedStream.IntervalJoined(this.streamOne, this.streamTwo, lowerBound.toMilliseconds(), upperBound.toMilliseconds(), true, true);
            }
        }

        static enum TimeBehaviour {
            ProcessingTime,
            EventTime;

            private TimeBehaviour() {
            }
        }
    }
}
View Code
复制代码

比如之前的socket 例子:show plan 显示几个框可以理解为几个任务(一个任务可能有多个子任务,子任务的数量可以理解为并行度),两者为什么会这么合并在了解合并算子链(并行度相同的一对一算子会合并算子链)后就会明白。

(1) 并行度设置为2的时候show plan 计划如下:

 (2)并行度设置为1 的时候show plan 如下

2. 并行度

1. 什么是并行计算

  可以理解为,我们期望的是“数据并行”。也就是多条数据同时到来,我们可以同时读入,并且在不同的节点进行flatMap 等操作。

2. 并行子任务和并行度

  为了实现并行操作,我们把一个算子操作,复制多分到多个节点,数据来了之后到其中任意一个执行。这样一来,一个算子任务就被拆分成了多个并行的子任务,再将他们分发到不同节点,就实现了真正的并行计算。

  在Flink 执行过程中,每个算子(operator)可以包含一个或多个子任务,这些子任务在不同的线程、物理机或者容器中完全独立地执行。

  一个特定算子的子任务的个数被=称之为并行度(parallelism)。一个流程序的并行度可以认为是其所有算子中最大的并行度。一个程序中不同的算子可能具有不同的并行度。

如下图:

  当前数据流中有source、map、window、sink 四个算子,除最后sink外其他的算子的并行度都为2.整个程序包含7个子任务,至少需要两个分区来执行,可以认为这段程序的并行度就是2。

3. 并行度设置

设置按照最近原则,最先设置的优先生效。

(1)代码中设置

// 全局设置
executionEnvironment.setParallelism(3);

// 对单个算子设置
txtDataSource
                .flatMap((String line, Collector<String> words) -> {
                    Arrays.stream(line.split(" ")).forEach(words::collect);
                }).setParallelism(3)

 (2)提交时设置(webui也可以设置)

./flink-1.13.0/bin/flink run -c cn.qz.SocketStreamWordCount -p 2 ./study-flink-1.0-SNAPSHOT.jar

(3) 在集群的配置文件 flink-conf.yaml 中直接更改默认并行度:

parallelism.default: 1

  这些参数都不是必须的,会按照由近到远的原则匹配(单个设置<env<-p<默认)。需要注意,有的算子即使设置了并行度也不会生效,比如读取socket 文本流的算子本身就不支持并行。在开发环境中,默认的并行度为当前机器的CPU核数(默认的任务槽的数量也是CPU核数)。

 4. 测试例子

  还是以socket 流为例子。

(1) 提交时选择并行度为2, 查看任务:

   如上。 name 是每个算子的名称,我们在源码中可以看到为这些算子起的名称。 后面有子任务的数量。

(2) 7777 端口输入

[root@k8smaster01 conf]# nc -l 7777
hello china and beijing
what is your name?
my name is qz.

(3) 查看输出任务的详细信息

 查看子任务信息:

 (4) 查看两个子任务所在机器的标准输出: 可以看出输出前面加的序号(可以理解为分区序号、任务插槽号)

第一个子任务所在机器输出:

 第二个机器:

   这里自己理解最大并行度就是一个任务最多能分到几个资源(任务槽),任务会同时并行处理,可以理解为在不同的机器直接并行处理(至于每个机器并行几个线程跑,后面任务槽进行研究,目前是每个机器一个任务槽)

补充: 针对1、2 进行的测试

  比如如下程序:

复制代码
package cn.qz;

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

import java.util.Arrays;

public class SocketStreamWordCount {

    public static void main(String[] args) throws Exception {
        // 1. 创建执行环境(流处理执行环境)
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        // 2. 读取文件
        DataStreamSource<String> txtDataSource = executionEnvironment.socketTextStream("192.168.13.107", 7777);
        // 3. 转换数据格式
        SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                .flatMap((String line, Collector<String> words) -> {
                    Arrays.stream(line.split(" ")).forEach(words::collect);
                })
                .returns(Types.STRING)
                .map(word -> Tuple2.of(word, 1L))
                .returns(Types.TUPLE(Types.STRING, Types.LONG));// lambda 使用泛型,由于泛型擦除,需要显示的声明类型信息
        // 4. 分组
        KeyedStream<Tuple2<String, Long>, String> tuple2StringKeyedStream = singleOutputStreamOperator.keyBy(t -> t.f0);
        // 5. 求和
        SingleOutputStreamOperator<Tuple2<String, Long>> sum = tuple2StringKeyedStream.sum(1);
        // 6. 打印
        sum.print();
        System.out.println("========");
        // 7. 求最大
        SingleOutputStreamOperator<Tuple2<String, Long>> tuple2SingleOutputStreamOperator = singleOutputStreamOperator.keyBy(t -> t.f0).max(1);
        tuple2SingleOutputStreamOperator.printToErr();
        // 8. 执行
        executionEnvironment.execute();
    }
}
复制代码

  debug 查看其相关对象:可以看到默认的并行度和相关的转换

3. 算子链

  我们观察webui 给出的计划图发现,节点数量和代码中的算子不是一一对应的。 有的节点会把多个任务连接在一起合并成一个大任务。下面解释其原因。

1.算子间的数据传输

  一个数据流在算子之间传输数据的形式可以是一对一的直通模式(forwarding),也可以是打乱的重分区(redistributing)模式,具体是哪一种取决于算子的种类

(1)一对一直通

  这种模式下,数据流维护着分区以及元素的顺序。比如图中的source和map 算子,source 读取完之后可以直接发给map 做处理,不需要重新分区,也不需要调整数据的顺序。这意味着map算子的子任务,看到的元素个数和顺序跟source 算子的子任务产生的完全一样,保证一对一的关系。map、filter、flatMap等算子都是这种一对一的关系。

(2)重分区

  这种模式下,数据流的分区会发生改变。比如图中的map和后面的keyBy/window/apply算子、以及keyBy/window算子和sink 算子之间。

  每一个算子的子任务会根据数据传输的策略,把数据发送到不同的下游目标任务。例如:keyBy是分组操作,本质上是基于key进行hash后重分区;比如从并行度为2的window 算子传递到并行度为1 的sink,这时的数据传输方式是再平衡(rebalance),会把数据均匀的向下游子任务分发出去。这些传输方式都会引起重分区(redistribute)。

2.合并算子链

  并行度相同的一对一算子操作,可以直接连接在一起形成一个大的任务(task),这样原来算子就成了合并任务里的一部分。每个任务被一个线程执行。这就是合并算子链。合并后如下图:

 

合并后就有五个任务,由五个线程并行执行。合并算子链可以减少线程之间的转换,提升吞吐量。

Flink 默认按照算子链的原则进行合并,如果想禁止合并或者自定义,可以在代码对算子做一些特定设置:

复制代码
// 禁用算子链
SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                .flatMap((String line, Collector<String> words) -> {
                    Arrays.stream(line.split(" ")).forEach(words::collect);
                }).disableChaining()
    
// 从当前算子开始新链
        SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                .flatMap((String line, Collector<String> words) -> {
                    Arrays.stream(line.split(" ")).forEach(words::collect);
                }).startNewChain()    
复制代码

4. 作业图与执行图

Flink任务调度执行的图,其按照顺序分为四层:

逻辑流图-》作业图-》执行图-》物理图

比如以soclet 为例子,其转换过程如下:

 1. 逻辑流图

  图中的节点一般对应算子操作。客户端完成的。

2. 作业图

  数据流图经过优化就是作业图。主要的优化为将符合条件的节点连接在一起合并成一个任务节点,行成算子链。也是客户端完成的,作业提交时传递给JobMaster。

3. 执行图

  JobMaster 收到JobGraph后用它生成执行图。执行图是JobGraph的并行化版本,是调度处最核心的数据结构。和作业图区别是对子任务进行了拆分,并明确任务之间传递数据的方式。

4. 物理图

  JobMaster 生成执行图后,将它分发给TaskManager。TasjkManager 根据执行图部署任务,最终的物理执行过程行成物理图。

  物理图在执行图的基础上,进一步确定数据存放的位置和收发的具体方式。

5. 任务与任务槽

   在之前的测试中,我们三个taskManager的slots任务槽为3。提交任务cn.qz.SocketStreamWordCount 的时候选择的并行度为2, 显示的任务应该是有5个(1+2+2),但是却占据了两个任务槽,下面解释其原因。

 1.任务槽

flink中一个worker(taskmanager)是一个JVM进程,既然是进程就可以启动多个独立的线程来执行多个子任务(subtask)。

flink 中的多个独立的执行任务的线程数量就是任务槽,默认为1,可以进行修改。修改 flink-conf.yaml,如下修改后每个节点变为4个槽,总共3个节点就是12个slot。

taskmanager.numberOfTaskSlots: 4

需要注意的是,slot目前用来隔离内存,不涉及cpu的隔离。具体应用需要根据cpu 核心数进行调整。

2.任务对任务槽的共享

  默认情况下,flink 允许子任务共享slot。所以2个子任务两个slot(最大的子任务数量)就可以完成。

  不同任务节点的子任务可以共享一个slot, 换句话说同一个任务的多个子任务必须放置在不同的slot。比如并行度为2,可能的结果就是

  到这里可能有个疑问就是既然想要最大利用计算资源,为什么又在一个任务槽并行处理多个任务了(一个线程干多件事)?

  原因是: 不同的任务对资源占用不同,比如source、map 、sink可能处理时间极短,而window等转换操作时间长(资源密集型任务)。如果每个任务一个slot,造成的现象就是上游的source(等待下游的window任务发通知而阻塞,相当于背压)和下游的sink可能长时间浪费,但是windows却忙死,出现资源利用不平衡。于是出现了任务共享,将资源密集型和非密集型放到一个slot,这样就可以自行分配对资源占用的比例。

  如果想某个任务独占一个slot,或者只有某部分算子共享slot,可以设置共享组:只有属于一个slot组的子任务才会开启共享slot,不同组之间的任务必须分配到不同的slot 数量。

        SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                .flatMap((String line, Collector<String> words) -> {
                    Arrays.stream(line.split(" ")).forEach(words::collect);
                }).slotSharingGroup("1")

3.任务槽和并行度的关系

  整个流处理程序的并行度,应该是所有算子中并行度最大的那个,也就是所需要的slot 数量(这种是不指定插槽组的情况)。

 

posted @   QiaoZhi  阅读(561)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· Obsidian + DeepSeek:免费 AI 助力你的知识管理,让你的笔记飞起来!
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
历史上的今天:
2021-06-20 集合源码
2019-06-20 装饰(Decorator)模式
点击右上角即可分享
微信分享提示