Flink运行时架构
1. 系统架构
Flink 运行时的两大架构。JobManager作业管理器和TaskManager任务管理器。
JobManager:真正的管理者(master),负责管理和调度。在不考虑高可用的情况下只有一个。
TaskManager:可以理解为工作中(worker, slave)。可以有一个或者多个。
作业提交和任务处理时的系统如下:
客户端并不是处理系统的一部分,只负责作业的提交。负责调用程序的main 方法,将代码转换成数据流图(Dataflow Graph) ,并且最终生成作业图(Job Graph),然后发送给JobManager。也可以从JobManager 获取任务的执行状态和结果。 TaskManager 启动之后,JobManager 会与它建立连接,并将作业图(Job Graph) 转换成可执行的执行图(ExecutionGraph) 然后分发给可用的TaskManager。
1. JobManager
一个集群中任务管理和调度中心。JobManager 又包含三个组件:
1. JobMaster
负责处理单独的作业,JobMaster 接收要执行的应用。包括:jar包、数据流图、作业图。JobMaster 会把JobGraph 转换成一个物理层面的数据流图,叫做执行图(包含了所有可以并发执行的任务)。JobMaster 会向资源管理器(ResourceManager)发出请求,申请执行任务必要的资源,一旦获取到足够资源就会将执行图分发到真正运行的TaskManager 上。
在运行过程中,JobMaster 会负责所有需要中央协调的操作,比如说检查点的协调等。
2. ResourceManager
主要负责资源的分配和管理。所谓资源主要是指TaskManager 的任务槽。任务槽是Flink 集群中的资源协调单元,包含了机器用来执行计算的一组CPU和内存资源。 每一个任务都需要分配到一个slot 上执行。
3. Dispatcher
主要负责提供一个REST 接口,用来提交应用,并且负责为每一个新提交的作业启动一个新的JobMaster 组件。 Dispatcher 也会启动一个WEB UI,用来方便的展示和监控作业执行的信息。
2. TaskManager
Flink 中的工作进程,也被称为worker。 一个集群包含一个或多个TaskManager,每个TaskManager 都包含一定数量的任务槽(task slots)。slot 的数量限制了TaskManager 能并行处理的任务数量。
启动后,TaskManager 向资源管理器注册它的slots;收到资源管理器的指令后会将一个或多个槽位提供给JobMaster 调用,用于分配任务。
执行过程中,TaskManager 可以缓冲数据,还可以跟其他运行同一应用的TaskManager 交换数据。
2. 作业提交流程
可以用下图表示
(1)客户端APP通过分发器提供REST 接口,将作业提交给JobManager
(2)分发器启动JobMaster, 将作业(包含JobGraph)提交给JobMaster
(3)JobMaster 将JobGraph 解析为ExecutionGraph,得到所需的资源数量,然后向资源管理器请求资源(slots)
(4)资源管理器协调资源
(5)Taskmanager 启动只会向ResourceManager 注册自己的可以slots
(6)资源管理器通知TaskManager 为新的作业提供slots
(7)TaskManager连接到对应的JobMaster,提供slots
(8)JobMaster 将需要执行的任务分发给TaskManager
(9)TaskManager 执行任务,互相之间可以交换
3. 重要概念
通过这些核心概念,我们可以明白:
1》怎样从Flink程序得到任务?
2》一个流处理程序,到底包含多少个任务?
3》最终执行任务,需要占用多少slot?
1. 数据流图(Dataflow Graph)
Flink 是流式计算框架,它的程序结构其实就是定义了一连串的操作,每个数据输入之后都会调用每一个步骤一次计算,每一个操作都叫做"算子"(operator),可以理解为我们的程序是一串算子构成的管道,数据则像水流一样有序地流过。
所有的程序都由三部分组成。source(源算子,负责读取数据)、Transformation(转换算子,负责处理数据)、Sink(下沉子算子,负责数据的输出)。
在运行时,Flink 程序会被映射成所有算子按照逻辑顺序拼接成一张图,这种图被称为逻辑数据流(数据流图)。数据流图类似于任意的有向无环图(DAG-Directed Acyclic Graph)。图中的每一条数据流以一个或者多个source 开始,以一个或者多个sink 结束。
代码中,除了source和sink,其他可以被称为代码中如果返回值是 SingleOutputStreamOperator 的API 就可以称为一个算子,否则不会计算为算子(只能理解为中间的转换操作),比如:keyBy 返回值是 KeyedStream 就不是一个算子;org.apache.flink.streaming.api.datastream.KeyedStream#sum(int) 就是一个算子。
常见的算子:
source:读txt、socket、自定义输入等
transformation:flatMap、map、filter、process 处理操作,还有sum、max、maxBy、min、minBy 等也都是聚合算子(名字都是Keyed Aggregation)
sink: print、printToErr、writeAsText、writeAsCsv 等
org.apache.flink.streaming.api.datastream.DataStream 源码可以看出每个算子都有一个特定名称:

// // Source code recreated from a .class file by IntelliJ IDEA // (powered by FernFlower decompiler) // package org.apache.flink.streaming.api.datastream; import java.util.ArrayList; import java.util.List; import java.util.UUID; import org.apache.flink.annotation.Experimental; import org.apache.flink.annotation.Internal; import org.apache.flink.annotation.Public; import org.apache.flink.annotation.PublicEvolving; import org.apache.flink.api.common.ExecutionConfig; import org.apache.flink.api.common.eventtime.WatermarkStrategy; import org.apache.flink.api.common.functions.FilterFunction; import org.apache.flink.api.common.functions.FlatMapFunction; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.functions.Partitioner; import org.apache.flink.api.common.io.OutputFormat; import org.apache.flink.api.common.operators.Keys; import org.apache.flink.api.common.operators.ResourceSpec; import org.apache.flink.api.common.operators.Keys.ExpressionKeys; import org.apache.flink.api.common.serialization.SerializationSchema; import org.apache.flink.api.common.state.MapStateDescriptor; import org.apache.flink.api.common.typeinfo.BasicArrayTypeInfo; import org.apache.flink.api.common.typeinfo.PrimitiveArrayTypeInfo; import org.apache.flink.api.common.typeinfo.TypeInformation; import org.apache.flink.api.common.typeutils.TypeSerializer; import org.apache.flink.api.connector.sink.Sink; import org.apache.flink.api.dag.Transformation; import org.apache.flink.api.java.Utils; import org.apache.flink.api.java.functions.KeySelector; import org.apache.flink.api.java.io.CsvOutputFormat; import org.apache.flink.api.java.io.TextOutputFormat; import org.apache.flink.api.java.tuple.Tuple; import org.apache.flink.api.java.typeutils.InputTypeConfigurable; import org.apache.flink.api.java.typeutils.TypeExtractor; import org.apache.flink.core.execution.JobClient; import org.apache.flink.core.fs.Path; import org.apache.flink.core.fs.FileSystem.WriteMode; import org.apache.flink.streaming.api.TimeCharacteristic; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks; import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks; import org.apache.flink.streaming.api.functions.ProcessFunction; import org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction; import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction; import org.apache.flink.streaming.api.functions.sink.SinkFunction; import org.apache.flink.streaming.api.functions.sink.SocketClientSink; import org.apache.flink.streaming.api.operators.OneInputStreamOperator; import org.apache.flink.streaming.api.operators.OneInputStreamOperatorFactory; import org.apache.flink.streaming.api.operators.ProcessOperator; import org.apache.flink.streaming.api.operators.SimpleOperatorFactory; import org.apache.flink.streaming.api.operators.StreamFilter; import org.apache.flink.streaming.api.operators.StreamFlatMap; import org.apache.flink.streaming.api.operators.StreamMap; import org.apache.flink.streaming.api.operators.StreamOperatorFactory; import org.apache.flink.streaming.api.operators.StreamSink; import org.apache.flink.streaming.api.operators.collect.ClientAndIterator; import org.apache.flink.streaming.api.operators.collect.CollectResultIterator; import org.apache.flink.streaming.api.operators.collect.CollectSinkOperator; import org.apache.flink.streaming.api.operators.collect.CollectSinkOperatorFactory; import org.apache.flink.streaming.api.operators.collect.CollectStreamSink; import org.apache.flink.streaming.api.transformations.OneInputTransformation; import org.apache.flink.streaming.api.transformations.PartitionTransformation; import org.apache.flink.streaming.api.transformations.TimestampsAndWatermarksTransformation; import org.apache.flink.streaming.api.transformations.UnionTransformation; import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows; import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows; import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows; import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows; import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows; import org.apache.flink.streaming.api.windowing.assigners.WindowAssigner; import org.apache.flink.streaming.api.windowing.evictors.CountEvictor; import org.apache.flink.streaming.api.windowing.time.Time; import org.apache.flink.streaming.api.windowing.triggers.CountTrigger; import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger; import org.apache.flink.streaming.api.windowing.windows.GlobalWindow; import org.apache.flink.streaming.api.windowing.windows.TimeWindow; import org.apache.flink.streaming.api.windowing.windows.Window; import org.apache.flink.streaming.runtime.operators.util.AssignerWithPeriodicWatermarksAdapter.Strategy; import org.apache.flink.streaming.runtime.partitioner.BroadcastPartitioner; import org.apache.flink.streaming.runtime.partitioner.CustomPartitionerWrapper; import org.apache.flink.streaming.runtime.partitioner.ForwardPartitioner; import org.apache.flink.streaming.runtime.partitioner.GlobalPartitioner; import org.apache.flink.streaming.runtime.partitioner.RebalancePartitioner; import org.apache.flink.streaming.runtime.partitioner.RescalePartitioner; import org.apache.flink.streaming.runtime.partitioner.ShufflePartitioner; import org.apache.flink.streaming.runtime.partitioner.StreamPartitioner; import org.apache.flink.streaming.util.keys.KeySelectorUtil; import org.apache.flink.util.CloseableIterator; import org.apache.flink.util.Preconditions; @Public public class DataStream<T> { protected final StreamExecutionEnvironment environment; protected final Transformation<T> transformation; public DataStream(StreamExecutionEnvironment environment, Transformation<T> transformation) { this.environment = (StreamExecutionEnvironment)Preconditions.checkNotNull(environment, "Execution Environment must not be null."); this.transformation = (Transformation)Preconditions.checkNotNull(transformation, "Stream Transformation must not be null."); } @Internal public int getId() { return this.transformation.getId(); } public int getParallelism() { return this.transformation.getParallelism(); } @PublicEvolving public ResourceSpec getMinResources() { return this.transformation.getMinResources(); } @PublicEvolving public ResourceSpec getPreferredResources() { return this.transformation.getPreferredResources(); } public TypeInformation<T> getType() { return this.transformation.getOutputType(); } protected <F> F clean(F f) { return this.getExecutionEnvironment().clean(f); } public StreamExecutionEnvironment getExecutionEnvironment() { return this.environment; } public ExecutionConfig getExecutionConfig() { return this.environment.getConfig(); } @SafeVarargs public final DataStream<T> union(DataStream<T>... streams) { List<Transformation<T>> unionedTransforms = new ArrayList(); unionedTransforms.add(this.transformation); DataStream[] var3 = streams; int var4 = streams.length; for(int var5 = 0; var5 < var4; ++var5) { DataStream<T> newStream = var3[var5]; if (!this.getType().equals(newStream.getType())) { throw new IllegalArgumentException("Cannot union streams of different types: " + this.getType() + " and " + newStream.getType()); } unionedTransforms.add(newStream.getTransformation()); } return new DataStream(this.environment, new UnionTransformation(unionedTransforms)); } public <R> ConnectedStreams<T, R> connect(DataStream<R> dataStream) { return new ConnectedStreams(this.environment, this, dataStream); } @PublicEvolving public <R> BroadcastConnectedStream<T, R> connect(BroadcastStream<R> broadcastStream) { return new BroadcastConnectedStream(this.environment, this, (BroadcastStream)Preconditions.checkNotNull(broadcastStream), broadcastStream.getBroadcastStateDescriptors()); } public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key) { Preconditions.checkNotNull(key); return new KeyedStream(this, (KeySelector)this.clean(key)); } public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key, TypeInformation<K> keyType) { Preconditions.checkNotNull(key); Preconditions.checkNotNull(keyType); return new KeyedStream(this, (KeySelector)this.clean(key), keyType); } /** @deprecated */ @Deprecated public KeyedStream<T, Tuple> keyBy(int... fields) { return !(this.getType() instanceof BasicArrayTypeInfo) && !(this.getType() instanceof PrimitiveArrayTypeInfo) ? this.keyBy((Keys)(new ExpressionKeys(fields, this.getType()))) : this.keyBy((KeySelector)KeySelectorUtil.getSelectorForArray(fields, this.getType())); } /** @deprecated */ @Deprecated public KeyedStream<T, Tuple> keyBy(String... fields) { return this.keyBy((Keys)(new ExpressionKeys(fields, this.getType()))); } private KeyedStream<T, Tuple> keyBy(Keys<T> keys) { return new KeyedStream(this, (KeySelector)this.clean(KeySelectorUtil.getSelectorForKeys(keys, this.getType(), this.getExecutionConfig()))); } /** @deprecated */ @Deprecated public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, int field) { ExpressionKeys<T> outExpressionKeys = new ExpressionKeys(new int[]{field}, this.getType()); return this.partitionCustom(partitioner, (Keys)outExpressionKeys); } /** @deprecated */ @Deprecated public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, String field) { ExpressionKeys<T> outExpressionKeys = new ExpressionKeys(new String[]{field}, this.getType()); return this.partitionCustom(partitioner, (Keys)outExpressionKeys); } public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, KeySelector<T, K> keySelector) { return this.setConnectionType(new CustomPartitionerWrapper((Partitioner)this.clean(partitioner), (KeySelector)this.clean(keySelector))); } private <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, Keys<T> keys) { KeySelector<T, K> keySelector = KeySelectorUtil.getSelectorForOneKey(keys, partitioner, this.getType(), this.getExecutionConfig()); return this.setConnectionType(new CustomPartitionerWrapper((Partitioner)this.clean(partitioner), (KeySelector)this.clean(keySelector))); } public DataStream<T> broadcast() { return this.setConnectionType(new BroadcastPartitioner()); } @PublicEvolving public BroadcastStream<T> broadcast(MapStateDescriptor<?, ?>... broadcastStateDescriptors) { Preconditions.checkNotNull(broadcastStateDescriptors); DataStream<T> broadcastStream = this.setConnectionType(new BroadcastPartitioner()); return new BroadcastStream(this.environment, broadcastStream, broadcastStateDescriptors); } @PublicEvolving public DataStream<T> shuffle() { return this.setConnectionType(new ShufflePartitioner()); } public DataStream<T> forward() { return this.setConnectionType(new ForwardPartitioner()); } public DataStream<T> rebalance() { return this.setConnectionType(new RebalancePartitioner()); } @PublicEvolving public DataStream<T> rescale() { return this.setConnectionType(new RescalePartitioner()); } @PublicEvolving public DataStream<T> global() { return this.setConnectionType(new GlobalPartitioner()); } @PublicEvolving public IterativeStream<T> iterate() { return new IterativeStream(this, 0L); } @PublicEvolving public IterativeStream<T> iterate(long maxWaitTimeMillis) { return new IterativeStream(this, maxWaitTimeMillis); } public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper) { TypeInformation<R> outType = TypeExtractor.getMapReturnTypes((MapFunction)this.clean(mapper), this.getType(), Utils.getCallLocationName(), true); return this.map(mapper, outType); } public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper, TypeInformation<R> outputType) { return this.transform("Map", outputType, (OneInputStreamOperator)(new StreamMap((MapFunction)this.clean(mapper)))); } public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper) { TypeInformation<R> outType = TypeExtractor.getFlatMapReturnTypes((FlatMapFunction)this.clean(flatMapper), this.getType(), Utils.getCallLocationName(), true); return this.flatMap(flatMapper, outType); } public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper, TypeInformation<R> outputType) { return this.transform("Flat Map", outputType, (OneInputStreamOperator)(new StreamFlatMap((FlatMapFunction)this.clean(flatMapper)))); } @PublicEvolving public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction) { TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(processFunction, ProcessFunction.class, 0, 1, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true); return this.process(processFunction, outType); } @Internal public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction, TypeInformation<R> outputType) { ProcessOperator<T, R> operator = new ProcessOperator((ProcessFunction)this.clean(processFunction)); return this.transform("Process", outputType, (OneInputStreamOperator)operator); } public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter) { return this.transform("Filter", this.getType(), (OneInputStreamOperator)(new StreamFilter((FilterFunction)this.clean(filter)))); } @PublicEvolving public <R extends Tuple> SingleOutputStreamOperator<R> project(int... fieldIndexes) { return (new StreamProjection(this, fieldIndexes)).projectTupleX(); } public <T2> CoGroupedStreams<T, T2> coGroup(DataStream<T2> otherStream) { return new CoGroupedStreams(this, otherStream); } public <T2> JoinedStreams<T, T2> join(DataStream<T2> otherStream) { return new JoinedStreams(this, otherStream); } /** @deprecated */ @Deprecated public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size) { return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.windowAll(TumblingProcessingTimeWindows.of(size)) : this.windowAll(TumblingEventTimeWindows.of(size)); } /** @deprecated */ @Deprecated public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size, Time slide) { return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.windowAll(SlidingProcessingTimeWindows.of(size, slide)) : this.windowAll(SlidingEventTimeWindows.of(size, slide)); } public AllWindowedStream<T, GlobalWindow> countWindowAll(long size) { return this.windowAll(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size))); } public AllWindowedStream<T, GlobalWindow> countWindowAll(long size, long slide) { return this.windowAll(GlobalWindows.create()).evictor(CountEvictor.of(size)).trigger(CountTrigger.of(slide)); } @PublicEvolving public <W extends Window> AllWindowedStream<T, W> windowAll(WindowAssigner<? super T, W> assigner) { return new AllWindowedStream(this, assigner); } public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(WatermarkStrategy<T> watermarkStrategy) { WatermarkStrategy<T> cleanedStrategy = (WatermarkStrategy)this.clean(watermarkStrategy); int inputParallelism = this.getTransformation().getParallelism(); TimestampsAndWatermarksTransformation<T> transformation = new TimestampsAndWatermarksTransformation("Timestamps/Watermarks", inputParallelism, this.getTransformation(), cleanedStrategy); this.getExecutionEnvironment().addOperator(transformation); return new SingleOutputStreamOperator(this.getExecutionEnvironment(), transformation); } /** @deprecated */ @Deprecated public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(AssignerWithPeriodicWatermarks<T> timestampAndWatermarkAssigner) { AssignerWithPeriodicWatermarks<T> cleanedAssigner = (AssignerWithPeriodicWatermarks)this.clean(timestampAndWatermarkAssigner); WatermarkStrategy<T> wms = new Strategy(cleanedAssigner); return this.assignTimestampsAndWatermarks((WatermarkStrategy)wms); } /** @deprecated */ @Deprecated public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(AssignerWithPunctuatedWatermarks<T> timestampAndWatermarkAssigner) { AssignerWithPunctuatedWatermarks<T> cleanedAssigner = (AssignerWithPunctuatedWatermarks)this.clean(timestampAndWatermarkAssigner); WatermarkStrategy<T> wms = new org.apache.flink.streaming.runtime.operators.util.AssignerWithPunctuatedWatermarksAdapter.Strategy(cleanedAssigner); return this.assignTimestampsAndWatermarks((WatermarkStrategy)wms); } @PublicEvolving public DataStreamSink<T> print() { PrintSinkFunction<T> printFunction = new PrintSinkFunction(); return this.addSink(printFunction).name("Print to Std. Out"); } @PublicEvolving public DataStreamSink<T> printToErr() { PrintSinkFunction<T> printFunction = new PrintSinkFunction(true); return this.addSink(printFunction).name("Print to Std. Err"); } @PublicEvolving public DataStreamSink<T> print(String sinkIdentifier) { PrintSinkFunction<T> printFunction = new PrintSinkFunction(sinkIdentifier, false); return this.addSink(printFunction).name("Print to Std. Out"); } @PublicEvolving public DataStreamSink<T> printToErr(String sinkIdentifier) { PrintSinkFunction<T> printFunction = new PrintSinkFunction(sinkIdentifier, true); return this.addSink(printFunction).name("Print to Std. Err"); } /** @deprecated */ @Deprecated @PublicEvolving public DataStreamSink<T> writeAsText(String path) { return this.writeUsingOutputFormat(new TextOutputFormat(new Path(path))); } /** @deprecated */ @Deprecated @PublicEvolving public DataStreamSink<T> writeAsText(String path, WriteMode writeMode) { TextOutputFormat<T> tof = new TextOutputFormat(new Path(path)); tof.setWriteMode(writeMode); return this.writeUsingOutputFormat(tof); } /** @deprecated */ @Deprecated @PublicEvolving public DataStreamSink<T> writeAsCsv(String path) { return this.writeAsCsv(path, (WriteMode)null, "\n", CsvOutputFormat.DEFAULT_FIELD_DELIMITER); } /** @deprecated */ @Deprecated @PublicEvolving public DataStreamSink<T> writeAsCsv(String path, WriteMode writeMode) { return this.writeAsCsv(path, writeMode, "\n", CsvOutputFormat.DEFAULT_FIELD_DELIMITER); } /** @deprecated */ @Deprecated @PublicEvolving public <X extends Tuple> DataStreamSink<T> writeAsCsv(String path, WriteMode writeMode, String rowDelimiter, String fieldDelimiter) { Preconditions.checkArgument(this.getType().isTupleType(), "The writeAsCsv() method can only be used on data streams of tuples."); CsvOutputFormat<X> of = new CsvOutputFormat(new Path(path), rowDelimiter, fieldDelimiter); if (writeMode != null) { of.setWriteMode(writeMode); } return this.writeUsingOutputFormat(of); } @PublicEvolving public DataStreamSink<T> writeToSocket(String hostName, int port, SerializationSchema<T> schema) { DataStreamSink<T> returnStream = this.addSink(new SocketClientSink(hostName, port, schema, 0)); returnStream.setParallelism(1); return returnStream; } /** @deprecated */ @Deprecated @PublicEvolving public DataStreamSink<T> writeUsingOutputFormat(OutputFormat<T> format) { return this.addSink(new OutputFormatSinkFunction(format)); } @PublicEvolving public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperator<T, R> operator) { return this.doTransform(operatorName, outTypeInfo, SimpleOperatorFactory.of(operator)); } @PublicEvolving public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperatorFactory<T, R> operatorFactory) { return this.doTransform(operatorName, outTypeInfo, operatorFactory); } protected <R> SingleOutputStreamOperator<R> doTransform(String operatorName, TypeInformation<R> outTypeInfo, StreamOperatorFactory<R> operatorFactory) { this.transformation.getOutputType(); OneInputTransformation<T, R> resultTransform = new OneInputTransformation(this.transformation, operatorName, operatorFactory, outTypeInfo, this.environment.getParallelism()); SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(this.environment, resultTransform); this.getExecutionEnvironment().addOperator(resultTransform); return returnStream; } protected DataStream<T> setConnectionType(StreamPartitioner<T> partitioner) { return new DataStream(this.getExecutionEnvironment(), new PartitionTransformation(this.getTransformation(), partitioner)); } public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction) { this.transformation.getOutputType(); if (sinkFunction instanceof InputTypeConfigurable) { ((InputTypeConfigurable)sinkFunction).setInputType(this.getType(), this.getExecutionConfig()); } StreamSink<T> sinkOperator = new StreamSink((SinkFunction)this.clean(sinkFunction)); DataStreamSink<T> sink = new DataStreamSink(this, sinkOperator); this.getExecutionEnvironment().addOperator(sink.getTransformation()); return sink; } @Experimental public DataStreamSink<T> sinkTo(Sink<T, ?, ?, ?> sink) { this.transformation.getOutputType(); return new DataStreamSink(this, sink); } public CloseableIterator<T> executeAndCollect() throws Exception { return this.executeAndCollect("DataStream Collect"); } public CloseableIterator<T> executeAndCollect(String jobExecutionName) throws Exception { return this.executeAndCollectWithClient(jobExecutionName).iterator; } public List<T> executeAndCollect(int limit) throws Exception { return this.executeAndCollect("DataStream Collect", limit); } public List<T> executeAndCollect(String jobExecutionName, int limit) throws Exception { Preconditions.checkState(limit > 0, "Limit must be greater than 0"); ClientAndIterator<T> clientAndIterator = this.executeAndCollectWithClient(jobExecutionName); Throwable var4 = null; try { ArrayList results; for(results = new ArrayList(limit); clientAndIterator.iterator.hasNext() && limit > 0; --limit) { results.add(clientAndIterator.iterator.next()); } ArrayList var6 = results; return var6; } catch (Throwable var15) { var4 = var15; throw var15; } finally { if (clientAndIterator != null) { if (var4 != null) { try { clientAndIterator.close(); } catch (Throwable var14) { var4.addSuppressed(var14); } } else { clientAndIterator.close(); } } } } ClientAndIterator<T> executeAndCollectWithClient(String jobExecutionName) throws Exception { TypeSerializer<T> serializer = this.getType().createSerializer(this.getExecutionEnvironment().getConfig()); String accumulatorName = "dataStreamCollect_" + UUID.randomUUID().toString(); StreamExecutionEnvironment env = this.getExecutionEnvironment(); CollectSinkOperatorFactory<T> factory = new CollectSinkOperatorFactory(serializer, accumulatorName); CollectSinkOperator<T> operator = (CollectSinkOperator)factory.getOperator(); CollectResultIterator<T> iterator = new CollectResultIterator(operator.getOperatorIdFuture(), serializer, accumulatorName, env.getCheckpointConfig()); CollectStreamSink<T> sink = new CollectStreamSink(this, factory); sink.name("Data stream collect sink"); env.addOperator(sink.getTransformation()); JobClient jobClient = env.executeAsync(jobExecutionName); iterator.setJobClient(jobClient); return new ClientAndIterator(jobClient, iterator); } @Internal public Transformation<T> getTransformation() { return this.transformation; } }
org.apache.flink.streaming.api.datastream.KeyedStream 针对集合的算子API:

// // Source code recreated from a .class file by IntelliJ IDEA // (powered by FernFlower decompiler) // package org.apache.flink.streaming.api.datastream; import java.util.ArrayList; import java.util.Stack; import java.util.UUID; import org.apache.commons.lang3.StringUtils; import org.apache.flink.annotation.Internal; import org.apache.flink.annotation.Public; import org.apache.flink.annotation.PublicEvolving; import org.apache.flink.api.common.InvalidProgramException; import org.apache.flink.api.common.functions.ReduceFunction; import org.apache.flink.api.common.state.ReducingStateDescriptor; import org.apache.flink.api.common.state.ValueStateDescriptor; import org.apache.flink.api.common.typeinfo.BasicArrayTypeInfo; import org.apache.flink.api.common.typeinfo.PrimitiveArrayTypeInfo; import org.apache.flink.api.common.typeinfo.TypeInformation; import org.apache.flink.api.java.Utils; import org.apache.flink.api.java.functions.KeySelector; import org.apache.flink.api.java.typeutils.EnumTypeInfo; import org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo; import org.apache.flink.api.java.typeutils.PojoTypeInfo; import org.apache.flink.api.java.typeutils.TupleTypeInfoBase; import org.apache.flink.api.java.typeutils.TypeExtractor; import org.apache.flink.streaming.api.TimeCharacteristic; import org.apache.flink.streaming.api.functions.KeyedProcessFunction; import org.apache.flink.streaming.api.functions.ProcessFunction; import org.apache.flink.streaming.api.functions.aggregation.AggregationFunction; import org.apache.flink.streaming.api.functions.aggregation.ComparableAggregator; import org.apache.flink.streaming.api.functions.aggregation.SumAggregator; import org.apache.flink.streaming.api.functions.aggregation.AggregationFunction.AggregationType; import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction; import org.apache.flink.streaming.api.functions.query.QueryableAppendingStateOperator; import org.apache.flink.streaming.api.functions.query.QueryableValueStateOperator; import org.apache.flink.streaming.api.functions.sink.SinkFunction; import org.apache.flink.streaming.api.operators.KeyedProcessOperator; import org.apache.flink.streaming.api.operators.LegacyKeyedProcessOperator; import org.apache.flink.streaming.api.operators.StreamOperatorFactory; import org.apache.flink.streaming.api.operators.co.IntervalJoinOperator; import org.apache.flink.streaming.api.transformations.OneInputTransformation; import org.apache.flink.streaming.api.transformations.PartitionTransformation; import org.apache.flink.streaming.api.transformations.ReduceTransformation; import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows; import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows; import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows; import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows; import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows; import org.apache.flink.streaming.api.windowing.assigners.WindowAssigner; import org.apache.flink.streaming.api.windowing.evictors.CountEvictor; import org.apache.flink.streaming.api.windowing.time.Time; import org.apache.flink.streaming.api.windowing.triggers.CountTrigger; import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger; import org.apache.flink.streaming.api.windowing.windows.GlobalWindow; import org.apache.flink.streaming.api.windowing.windows.TimeWindow; import org.apache.flink.streaming.api.windowing.windows.Window; import org.apache.flink.streaming.runtime.partitioner.KeyGroupStreamPartitioner; import org.apache.flink.streaming.runtime.partitioner.StreamPartitioner; import org.apache.flink.util.Preconditions; @Public public class KeyedStream<T, KEY> extends DataStream<T> { private final KeySelector<T, KEY> keySelector; private final TypeInformation<KEY> keyType; public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector) { this(dataStream, keySelector, TypeExtractor.getKeySelectorTypes(keySelector, dataStream.getType())); } public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) { this(dataStream, new PartitionTransformation(dataStream.getTransformation(), new KeyGroupStreamPartitioner(keySelector, 128)), keySelector, keyType); } @Internal KeyedStream(DataStream<T> stream, PartitionTransformation<T> partitionTransformation, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) { super(stream.getExecutionEnvironment(), partitionTransformation); this.keySelector = (KeySelector)this.clean(keySelector); this.keyType = this.validateKeyType(keyType); } private TypeInformation<KEY> validateKeyType(TypeInformation<KEY> keyType) { Stack<TypeInformation<?>> stack = new Stack(); stack.push(keyType); ArrayList unsupportedTypes = new ArrayList(); while(true) { TypeInformation typeInfo; do { if (stack.isEmpty()) { if (!unsupportedTypes.isEmpty()) { throw new InvalidProgramException("Type " + keyType + " cannot be used as key. Contained UNSUPPORTED key types: " + StringUtils.join(unsupportedTypes, ", ") + ". Look at the keyBy() documentation for the conditions a type has to satisfy in order to be eligible for a key."); } return keyType; } typeInfo = (TypeInformation)stack.pop(); if (!this.validateKeyTypeIsHashable(typeInfo)) { unsupportedTypes.add(typeInfo); } } while(!(typeInfo instanceof TupleTypeInfoBase)); for(int i = 0; i < typeInfo.getArity(); ++i) { stack.push(((TupleTypeInfoBase)typeInfo).getTypeAt(i)); } } } private boolean validateKeyTypeIsHashable(TypeInformation<?> type) { try { return type instanceof PojoTypeInfo ? !type.getTypeClass().getMethod("hashCode").getDeclaringClass().equals(Object.class) : !isArrayType(type) && !isEnumType(type); } catch (NoSuchMethodException var3) { return false; } } private static boolean isArrayType(TypeInformation<?> type) { return type instanceof PrimitiveArrayTypeInfo || type instanceof BasicArrayTypeInfo || type instanceof ObjectArrayTypeInfo; } private static boolean isEnumType(TypeInformation<?> type) { return type instanceof EnumTypeInfo; } @Internal public KeySelector<T, KEY> getKeySelector() { return this.keySelector; } @Internal public TypeInformation<KEY> getKeyType() { return this.keyType; } protected DataStream<T> setConnectionType(StreamPartitioner<T> partitioner) { throw new UnsupportedOperationException("Cannot override partitioning for KeyedStream."); } protected <R> SingleOutputStreamOperator<R> doTransform(String operatorName, TypeInformation<R> outTypeInfo, StreamOperatorFactory<R> operatorFactory) { SingleOutputStreamOperator<R> returnStream = super.doTransform(operatorName, outTypeInfo, operatorFactory); OneInputTransformation<T, R> transform = (OneInputTransformation)returnStream.getTransformation(); transform.setStateKeySelector(this.keySelector); transform.setStateKeyType(this.keyType); return returnStream; } public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction) { DataStreamSink<T> result = super.addSink(sinkFunction); result.getTransformation().setStateKeySelector(this.keySelector); result.getTransformation().setStateKeyType(this.keyType); return result; } /** @deprecated */ @Deprecated @PublicEvolving public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction) { TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(processFunction, ProcessFunction.class, 0, 1, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true); return this.process(processFunction, outType); } /** @deprecated */ @Deprecated @Internal public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction, TypeInformation<R> outputType) { LegacyKeyedProcessOperator<KEY, T, R> operator = new LegacyKeyedProcessOperator((ProcessFunction)this.clean(processFunction)); return this.transform("Process", outputType, operator); } @PublicEvolving public <R> SingleOutputStreamOperator<R> process(KeyedProcessFunction<KEY, T, R> keyedProcessFunction) { TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(keyedProcessFunction, KeyedProcessFunction.class, 1, 2, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true); return this.process(keyedProcessFunction, outType); } @Internal public <R> SingleOutputStreamOperator<R> process(KeyedProcessFunction<KEY, T, R> keyedProcessFunction, TypeInformation<R> outputType) { KeyedProcessOperator<KEY, T, R> operator = new KeyedProcessOperator((KeyedProcessFunction)this.clean(keyedProcessFunction)); return this.transform("KeyedProcess", outputType, operator); } @PublicEvolving public <T1> KeyedStream.IntervalJoin<T, T1, KEY> intervalJoin(KeyedStream<T1, KEY> otherStream) { return new KeyedStream.IntervalJoin(this, otherStream); } /** @deprecated */ @Deprecated public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size) { return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.window(TumblingProcessingTimeWindows.of(size)) : this.window(TumblingEventTimeWindows.of(size)); } /** @deprecated */ @Deprecated public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size, Time slide) { return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.window(SlidingProcessingTimeWindows.of(size, slide)) : this.window(SlidingEventTimeWindows.of(size, slide)); } public WindowedStream<T, KEY, GlobalWindow> countWindow(long size) { return this.window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size))); } public WindowedStream<T, KEY, GlobalWindow> countWindow(long size, long slide) { return this.window(GlobalWindows.create()).evictor(CountEvictor.of(size)).trigger(CountTrigger.of(slide)); } @PublicEvolving public <W extends Window> WindowedStream<T, KEY, W> window(WindowAssigner<? super T, W> assigner) { return new WindowedStream(this, assigner); } public SingleOutputStreamOperator<T> reduce(ReduceFunction<T> reducer) { ReduceTransformation<T, KEY> reduce = new ReduceTransformation("Keyed Reduce", this.environment.getParallelism(), this.transformation, (ReduceFunction)this.clean(reducer), this.keySelector, this.getKeyType()); this.getExecutionEnvironment().addOperator(reduce); return new SingleOutputStreamOperator(this.getExecutionEnvironment(), reduce); } public SingleOutputStreamOperator<T> sum(int positionToSum) { return this.aggregate(new SumAggregator(positionToSum, this.getType(), this.getExecutionConfig())); } public SingleOutputStreamOperator<T> sum(String field) { return this.aggregate(new SumAggregator(field, this.getType(), this.getExecutionConfig())); } public SingleOutputStreamOperator<T> min(int positionToMin) { return this.aggregate(new ComparableAggregator(positionToMin, this.getType(), AggregationType.MIN, this.getExecutionConfig())); } public SingleOutputStreamOperator<T> min(String field) { return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MIN, false, this.getExecutionConfig())); } public SingleOutputStreamOperator<T> max(int positionToMax) { return this.aggregate(new ComparableAggregator(positionToMax, this.getType(), AggregationType.MAX, this.getExecutionConfig())); } public SingleOutputStreamOperator<T> max(String field) { return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MAX, false, this.getExecutionConfig())); } public SingleOutputStreamOperator<T> minBy(String field, boolean first) { return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MINBY, first, this.getExecutionConfig())); } public SingleOutputStreamOperator<T> maxBy(String field, boolean first) { return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MAXBY, first, this.getExecutionConfig())); } public SingleOutputStreamOperator<T> minBy(int positionToMinBy) { return this.minBy(positionToMinBy, true); } public SingleOutputStreamOperator<T> minBy(String positionToMinBy) { return this.minBy(positionToMinBy, true); } public SingleOutputStreamOperator<T> minBy(int positionToMinBy, boolean first) { return this.aggregate(new ComparableAggregator(positionToMinBy, this.getType(), AggregationType.MINBY, first, this.getExecutionConfig())); } public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy) { return this.maxBy(positionToMaxBy, true); } public SingleOutputStreamOperator<T> maxBy(String positionToMaxBy) { return this.maxBy(positionToMaxBy, true); } public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy, boolean first) { return this.aggregate(new ComparableAggregator(positionToMaxBy, this.getType(), AggregationType.MAXBY, first, this.getExecutionConfig())); } protected SingleOutputStreamOperator<T> aggregate(AggregationFunction<T> aggregate) { return this.reduce(aggregate).name("Keyed Aggregation"); } @PublicEvolving public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName) { ValueStateDescriptor<T> valueStateDescriptor = new ValueStateDescriptor(UUID.randomUUID().toString(), this.getType()); return this.asQueryableState(queryableStateName, valueStateDescriptor); } @PublicEvolving public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName, ValueStateDescriptor<T> stateDescriptor) { this.transform("Queryable state: " + queryableStateName, this.getType(), new QueryableValueStateOperator(queryableStateName, stateDescriptor)); stateDescriptor.initializeSerializerUnlessSet(this.getExecutionConfig()); return new QueryableStateStream(queryableStateName, stateDescriptor, this.getKeyType().createSerializer(this.getExecutionConfig())); } @PublicEvolving public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName, ReducingStateDescriptor<T> stateDescriptor) { this.transform("Queryable state: " + queryableStateName, this.getType(), new QueryableAppendingStateOperator(queryableStateName, stateDescriptor)); stateDescriptor.initializeSerializerUnlessSet(this.getExecutionConfig()); return new QueryableStateStream(queryableStateName, stateDescriptor, this.getKeyType().createSerializer(this.getExecutionConfig())); } @PublicEvolving public static class IntervalJoined<IN1, IN2, KEY> { private final KeyedStream<IN1, KEY> left; private final KeyedStream<IN2, KEY> right; private final long lowerBound; private final long upperBound; private final KeySelector<IN1, KEY> keySelector1; private final KeySelector<IN2, KEY> keySelector2; private boolean lowerBoundInclusive; private boolean upperBoundInclusive; public IntervalJoined(KeyedStream<IN1, KEY> left, KeyedStream<IN2, KEY> right, long lowerBound, long upperBound, boolean lowerBoundInclusive, boolean upperBoundInclusive) { this.left = (KeyedStream)Preconditions.checkNotNull(left); this.right = (KeyedStream)Preconditions.checkNotNull(right); this.lowerBound = lowerBound; this.upperBound = upperBound; this.lowerBoundInclusive = lowerBoundInclusive; this.upperBoundInclusive = upperBoundInclusive; this.keySelector1 = left.getKeySelector(); this.keySelector2 = right.getKeySelector(); } @PublicEvolving public KeyedStream.IntervalJoined<IN1, IN2, KEY> upperBoundExclusive() { this.upperBoundInclusive = false; return this; } @PublicEvolving public KeyedStream.IntervalJoined<IN1, IN2, KEY> lowerBoundExclusive() { this.lowerBoundInclusive = false; return this; } @PublicEvolving public <OUT> SingleOutputStreamOperator<OUT> process(ProcessJoinFunction<IN1, IN2, OUT> processJoinFunction) { Preconditions.checkNotNull(processJoinFunction); TypeInformation<OUT> outputType = TypeExtractor.getBinaryOperatorReturnType(processJoinFunction, ProcessJoinFunction.class, 0, 1, 2, TypeExtractor.NO_INDEX, this.left.getType(), this.right.getType(), Utils.getCallLocationName(), true); return this.process(processJoinFunction, outputType); } @PublicEvolving public <OUT> SingleOutputStreamOperator<OUT> process(ProcessJoinFunction<IN1, IN2, OUT> processJoinFunction, TypeInformation<OUT> outputType) { Preconditions.checkNotNull(processJoinFunction); Preconditions.checkNotNull(outputType); ProcessJoinFunction<IN1, IN2, OUT> cleanedUdf = (ProcessJoinFunction)this.left.getExecutionEnvironment().clean(processJoinFunction); IntervalJoinOperator<KEY, IN1, IN2, OUT> operator = new IntervalJoinOperator(this.lowerBound, this.upperBound, this.lowerBoundInclusive, this.upperBoundInclusive, this.left.getType().createSerializer(this.left.getExecutionConfig()), this.right.getType().createSerializer(this.right.getExecutionConfig()), cleanedUdf); return this.left.connect(this.right).keyBy(this.keySelector1, this.keySelector2).transform("Interval Join", outputType, operator); } } @PublicEvolving public static class IntervalJoin<T1, T2, KEY> { private final KeyedStream<T1, KEY> streamOne; private final KeyedStream<T2, KEY> streamTwo; private KeyedStream.IntervalJoin.TimeBehaviour timeBehaviour; IntervalJoin(KeyedStream<T1, KEY> streamOne, KeyedStream<T2, KEY> streamTwo) { this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.EventTime; this.streamOne = (KeyedStream)Preconditions.checkNotNull(streamOne); this.streamTwo = (KeyedStream)Preconditions.checkNotNull(streamTwo); } public KeyedStream.IntervalJoin<T1, T2, KEY> inEventTime() { this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.EventTime; return this; } public KeyedStream.IntervalJoin<T1, T2, KEY> inProcessingTime() { this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.ProcessingTime; return this; } @PublicEvolving public KeyedStream.IntervalJoined<T1, T2, KEY> between(Time lowerBound, Time upperBound) { if (this.timeBehaviour != KeyedStream.IntervalJoin.TimeBehaviour.EventTime) { throw new UnsupportedTimeCharacteristicException("Time-bounded stream joins are only supported in event time"); } else { Preconditions.checkNotNull(lowerBound, "A lower bound needs to be provided for a time-bounded join"); Preconditions.checkNotNull(upperBound, "An upper bound needs to be provided for a time-bounded join"); return new KeyedStream.IntervalJoined(this.streamOne, this.streamTwo, lowerBound.toMilliseconds(), upperBound.toMilliseconds(), true, true); } } static enum TimeBehaviour { ProcessingTime, EventTime; private TimeBehaviour() { } } } }
比如之前的socket 例子:show plan 显示几个框可以理解为几个任务(一个任务可能有多个子任务,子任务的数量可以理解为并行度),两者为什么会这么合并在了解合并算子链(并行度相同的一对一算子会合并算子链)后就会明白。
(1) 并行度设置为2的时候show plan 计划如下:
(2)并行度设置为1 的时候show plan 如下
2. 并行度
1. 什么是并行计算
可以理解为,我们期望的是“数据并行”。也就是多条数据同时到来,我们可以同时读入,并且在不同的节点进行flatMap 等操作。
2. 并行子任务和并行度
为了实现并行操作,我们把一个算子操作,复制多分到多个节点,数据来了之后到其中任意一个执行。这样一来,一个算子任务就被拆分成了多个并行的子任务,再将他们分发到不同节点,就实现了真正的并行计算。
在Flink 执行过程中,每个算子(operator)可以包含一个或多个子任务,这些子任务在不同的线程、物理机或者容器中完全独立地执行。
一个特定算子的子任务的个数被=称之为并行度(parallelism)。一个流程序的并行度可以认为是其所有算子中最大的并行度。一个程序中不同的算子可能具有不同的并行度。
如下图:
当前数据流中有source、map、window、sink 四个算子,除最后sink外其他的算子的并行度都为2.整个程序包含7个子任务,至少需要两个分区来执行,可以认为这段程序的并行度就是2。
3. 并行度设置
设置按照最近原则,最先设置的优先生效。
(1)代码中设置
// 全局设置 executionEnvironment.setParallelism(3); // 对单个算子设置 txtDataSource .flatMap((String line, Collector<String> words) -> { Arrays.stream(line.split(" ")).forEach(words::collect); }).setParallelism(3)
(2)提交时设置(webui也可以设置)
./flink-1.13.0/bin/flink run -c cn.qz.SocketStreamWordCount -p 2 ./study-flink-1.0-SNAPSHOT.jar
(3) 在集群的配置文件 flink-conf.yaml 中直接更改默认并行度:
parallelism.default: 1
这些参数都不是必须的,会按照由近到远的原则匹配(单个设置<env<-p<默认)。需要注意,有的算子即使设置了并行度也不会生效,比如读取socket 文本流的算子本身就不支持并行。在开发环境中,默认的并行度为当前机器的CPU核数(默认的任务槽的数量也是CPU核数)。
4. 测试例子
还是以socket 流为例子。
(1) 提交时选择并行度为2, 查看任务:
如上。 name 是每个算子的名称,我们在源码中可以看到为这些算子起的名称。 后面有子任务的数量。
(2) 7777 端口输入
[root@k8smaster01 conf]# nc -l 7777 hello china and beijing what is your name? my name is qz.
(3) 查看输出任务的详细信息
查看子任务信息:
(4) 查看两个子任务所在机器的标准输出: 可以看出输出前面加的序号(可以理解为分区序号、任务插槽号)
第一个子任务所在机器输出:
第二个机器:
这里自己理解最大并行度就是一个任务最多能分到几个资源(任务槽),任务会同时并行处理,可以理解为在不同的机器直接并行处理(至于每个机器并行几个线程跑,后面任务槽进行研究,目前是每个机器一个任务槽)。
补充: 针对1、2 进行的测试
比如如下程序:
package cn.qz; import org.apache.flink.api.common.typeinfo.Types; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.KeyedStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.util.Collector; import java.util.Arrays; public class SocketStreamWordCount { public static void main(String[] args) throws Exception { // 1. 创建执行环境(流处理执行环境) StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); // 2. 读取文件 DataStreamSource<String> txtDataSource = executionEnvironment.socketTextStream("192.168.13.107", 7777); // 3. 转换数据格式 SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource .flatMap((String line, Collector<String> words) -> { Arrays.stream(line.split(" ")).forEach(words::collect); }) .returns(Types.STRING) .map(word -> Tuple2.of(word, 1L)) .returns(Types.TUPLE(Types.STRING, Types.LONG));// lambda 使用泛型,由于泛型擦除,需要显示的声明类型信息 // 4. 分组 KeyedStream<Tuple2<String, Long>, String> tuple2StringKeyedStream = singleOutputStreamOperator.keyBy(t -> t.f0); // 5. 求和 SingleOutputStreamOperator<Tuple2<String, Long>> sum = tuple2StringKeyedStream.sum(1); // 6. 打印 sum.print(); System.out.println("========"); // 7. 求最大 SingleOutputStreamOperator<Tuple2<String, Long>> tuple2SingleOutputStreamOperator = singleOutputStreamOperator.keyBy(t -> t.f0).max(1); tuple2SingleOutputStreamOperator.printToErr(); // 8. 执行 executionEnvironment.execute(); } }
debug 查看其相关对象:可以看到默认的并行度和相关的转换
3. 算子链
我们观察webui 给出的计划图发现,节点数量和代码中的算子不是一一对应的。 有的节点会把多个任务连接在一起合并成一个大任务。下面解释其原因。
1.算子间的数据传输
一个数据流在算子之间传输数据的形式可以是一对一的直通模式(forwarding),也可以是打乱的重分区(redistributing)模式,具体是哪一种取决于算子的种类
(1)一对一直通
这种模式下,数据流维护着分区以及元素的顺序。比如图中的source和map 算子,source 读取完之后可以直接发给map 做处理,不需要重新分区,也不需要调整数据的顺序。这意味着map算子的子任务,看到的元素个数和顺序跟source 算子的子任务产生的完全一样,保证一对一的关系。map、filter、flatMap等算子都是这种一对一的关系。
这种模式下,数据流的分区会发生改变。比如图中的map和后面的keyBy/window/apply算子、以及keyBy/window算子和sink 算子之间。
每一个算子的子任务会根据数据传输的策略,把数据发送到不同的下游目标任务。例如:keyBy是分组操作,本质上是基于key进行hash后重分区;比如从并行度为2的window 算子传递到并行度为1 的sink,这时的数据传输方式是再平衡(rebalance),会把数据均匀的向下游子任务分发出去。这些传输方式都会引起重分区(redistribute)。
2.合并算子链
并行度相同的一对一算子操作,可以直接连接在一起形成一个大的任务(task),这样原来算子就成了合并任务里的一部分。每个任务被一个线程执行。这就是合并算子链。合并后如下图:
合并后就有五个任务,由五个线程并行执行。合并算子链可以减少线程之间的转换,提升吞吐量。
Flink 默认按照算子链的原则进行合并,如果想禁止合并或者自定义,可以在代码对算子做一些特定设置:
// 禁用算子链 SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource .flatMap((String line, Collector<String> words) -> { Arrays.stream(line.split(" ")).forEach(words::collect); }).disableChaining() // 从当前算子开始新链 SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource .flatMap((String line, Collector<String> words) -> { Arrays.stream(line.split(" ")).forEach(words::collect); }).startNewChain()
4. 作业图与执行图
Flink任务调度执行的图,其按照顺序分为四层:
逻辑流图-》作业图-》执行图-》物理图
比如以soclet 为例子,其转换过程如下:
1. 逻辑流图
图中的节点一般对应算子操作。客户端完成的。
2. 作业图
数据流图经过优化就是作业图。主要的优化为将符合条件的节点连接在一起合并成一个任务节点,行成算子链。也是客户端完成的,作业提交时传递给JobMaster。
3. 执行图
JobMaster 收到JobGraph后用它生成执行图。执行图是JobGraph的并行化版本,是调度处最核心的数据结构。和作业图区别是对子任务进行了拆分,并明确任务之间传递数据的方式。
4. 物理图
JobMaster 生成执行图后,将它分发给TaskManager。TasjkManager 根据执行图部署任务,最终的物理执行过程行成物理图。
物理图在执行图的基础上,进一步确定数据存放的位置和收发的具体方式。
5. 任务与任务槽
在之前的测试中,我们三个taskManager的slots任务槽为3。提交任务cn.qz.SocketStreamWordCount 的时候选择的并行度为2, 显示的任务应该是有5个(1+2+2),但是却占据了两个任务槽,下面解释其原因。
1.任务槽
flink中一个worker(taskmanager)是一个JVM进程,既然是进程就可以启动多个独立的线程来执行多个子任务(subtask)。
flink 中的多个独立的执行任务的线程数量就是任务槽,默认为1,可以进行修改。修改 flink-conf.yaml,如下修改后每个节点变为4个槽,总共3个节点就是12个slot。
taskmanager.numberOfTaskSlots: 4
需要注意的是,slot目前用来隔离内存,不涉及cpu的隔离。具体应用需要根据cpu 核心数进行调整。
2.
默认情况下,flink 允许子任务共享slot。所以2个子任务两个slot(最大的子任务数量)就可以完成。
不同任务节点的子任务可以共享一个slot, 换句话说同一个任务的多个子任务必须放置在不同的slot。比如并行度为2,可能的结果就是
到这里可能有个疑问就是既然想要最大利用计算资源,为什么又在一个任务槽并行处理多个任务了(一个线程干多件事)?
原因是: 不同的任务对资源占用不同,比如source、map 、sink可能处理时间极短,而window等转换操作时间长(资源密集型任务)。如果每个任务一个slot,造成的现象就是上游的source(等待下游的window任务发通知而阻塞,相当于背压)和下游的sink可能长时间浪费,但是windows却忙死,出现资源利用不平衡。于是出现了任务共享,将资源密集型和非密集型放到一个slot,这样就可以自行分配对资源占用的比例。
如果想某个任务独占一个slot,或者只有某部分算子共享slot,可以设置共享组:只有属于一个slot组的子任务才会开启共享slot,不同组之间的任务必须分配到不同的slot 数量。
SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource .flatMap((String line, Collector<String> words) -> { Arrays.stream(line.split(" ")).forEach(words::collect); }).slotSharingGroup("1")
3.任务槽和并行度的关系
整个流处理程序的并行度,应该是所有算子中并行度最大的那个,也就是所需要的slot 数量(这种是不指定插槽组的情况)。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· Obsidian + DeepSeek:免费 AI 助力你的知识管理,让你的笔记飞起来!
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
2021-06-20 集合源码
2019-06-20 装饰(Decorator)模式