Flink SideOutput 和 Filter 分流对比

Flink 分流有Filter、Split(已经废弃移除)、Side Output进行分流,到底时有什么区别,哪个种更好呢?

对比

代码对比

直接上代码对比:

import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.source.SocketTextStreamFunction;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

public class FlinkSideOutput {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setParallelism(1);
        DataStreamSource<String> streamSource = executionEnvironment.addSource(new SocketTextStreamFunction("localhost", 9999, "\n", 1));
        OutputTag<String> zero = new OutputTag<String>("zero") {
        };
        SingleOutputStreamOperator<String> process = streamSource.process(new ProcessFunction<String, String>() {
            @Override
            public void processElement(String s, ProcessFunction<String, String>.Context context, Collector<String> collector) throws Exception {
                System.out.println("SideOuput:" + s);
                if (Integer.parseInt(s) % 2 == 0) {
                    context.output(zero, s);
                } else {
                    collector.collect(s);
                }
            }
        });
        process.print();
        process.getSideOutput(zero).print();
        System.out.println(executionEnvironment.getExecutionPlan());
        executionEnvironment.execute();
    }
}
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SocketTextStreamFunction;


public class FlinkStreamFilter {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setParallelism(1);
        DataStreamSource<String> streamSource = executionEnvironment.addSource(new SocketTextStreamFunction("localhost", 9999, "\n", 1));
        streamSource.filter(s -> {
            System.out.println("filter % 2 == 0 ,value:" + s);
            return Integer.parseInt(s) % 2 == 0;
        }).print();
        streamSource.filter(s -> {
            System.out.println("filter % 2 != 0 ,value:" + s);
            return Integer.parseInt(s) % 2 != 0;
        }).print();
        System.out.println(executionEnvironment.getExecutionPlan());
        executionEnvironment.execute();
    }
}

从代码上看,SideOutput的写法更为麻烦,且对于代码理解上Filter 更容易去理解,如果了解Spark的可能会想到Filter 这边可能会对 streamSource 进行两次计算。

执行图对比

当全局 parallelism = 1 时

从SideOutput 这个执行图Name看到

Custom Source -> Process , Process 分流到两个Sink:Print

从Filter 这个执行图的Name 可以看到

Custom Source 分成两个流,Custom Source -> Filter -> Sink:Print , Custom Source -> Filter -> Sink:Print

当把 Sink的 Parallelism = 8 ,刚刚说的就更明显了。

比Operator 图对比

通过打印计划

executionEnvironment.getExecutionPlan()

SideOuput:

{
  "nodes" : [ {
    "id" : 1,
    "type" : "Source: Custom Source",
    "pact" : "Data Source",
    "contents" : "Source: Custom Source",
    "parallelism" : 1
  }, {
    "id" : 2,
    "type" : "Process",
    "pact" : "Operator",
    "contents" : "Process",
    "parallelism" : 1,
    "predecessors" : [ {
      "id" : 1,
      "ship_strategy" : "FORWARD",
      "side" : "second"
    } ]
  }, {
    "id" : 3,
    "type" : "Sink: Print to Std. Out",
    "pact" : "Data Sink",
    "contents" : "Sink: Print to Std. Out",
    "parallelism" : 1,
    "predecessors" : [ {
      "id" : 2,
      "ship_strategy" : "FORWARD",
      "side" : "second"
    } ]
  }, {
    "id" : 5,
    "type" : "Sink: Print to Std. Out",
    "pact" : "Data Sink",
    "contents" : "Sink: Print to Std. Out",
    "parallelism" : 1,
    "predecessors" : [ {
      "id" : 2,
      "ship_strategy" : "FORWARD",
      "side" : "second"
    } ]
  } ]
}

对应的图就是:

Filter:

{
  "nodes" : [ {
    "id" : 1,
    "type" : "Source: Custom Source",
    "pact" : "Data Source",
    "contents" : "Source: Custom Source",
    "parallelism" : 1
  }, {
    "id" : 2,
    "type" : "Filter",
    "pact" : "Operator",
    "contents" : "Filter",
    "parallelism" : 1,
    "predecessors" : [ {
      "id" : 1,
      "ship_strategy" : "FORWARD",
      "side" : "second"
    } ]
  }, {
    "id" : 4,
    "type" : "Filter",
    "pact" : "Operator",
    "contents" : "Filter",
    "parallelism" : 1,
    "predecessors" : [ {
      "id" : 1,
      "ship_strategy" : "FORWARD",
      "side" : "second"
    } ]
  }, {
    "id" : 3,
    "type" : "Sink: Print to Std. Out",
    "pact" : "Data Sink",
    "contents" : "Sink: Print to Std. Out",
    "parallelism" : 1,
    "predecessors" : [ {
      "id" : 2,
      "ship_strategy" : "FORWARD",
      "side" : "second"
    } ]
  }, {
    "id" : 5,
    "type" : "Sink: Print to Std. Out",
    "pact" : "Data Sink",
    "contents" : "Sink: Print to Std. Out",
    "parallelism" : 1,
    "predecessors" : [ {
      "id" : 4,
      "ship_strategy" : "FORWARD",
      "side" : "second"
    } ]
  } ]
}

这样就很明显了,Filter分流,Source会把数据发两次,一次给ID=2 的FilterOperator ,一次给ID=4 的FilterOperator,然后进行过滤分别转发给下一个Sink,而 SideOutput,Source直接把数据分发给Process,Process然后分别在内部根据逻辑转发给不同的Sink。

这下就很清楚了,SideOutput 是优于Filter ,Filter的方式会对数据进行两个过滤,SideOutput是只有一次。

验证

nc -lk 9999
1
2
3
4
5
6
7
8
9
10

执行代码

SideOutput输出:

SideOuput:1
1
SideOuput:2
2
SideOuput:3
3
SideOuput:4
4
SideOuput:5
5
SideOuput:6
6
SideOuput:7
7
SideOuput:8
8
SideOuput:9
9
SideOuput:0
0

Filter输出:

filter % 2 == 0 ,value:1
filter % 2 != 0 ,value:1
1
filter % 2 == 0 ,value:2
2
filter % 2 != 0 ,value:2
filter % 2 == 0 ,value:3
filter % 2 != 0 ,value:3
3
filter % 2 == 0 ,value:4
4
filter % 2 != 0 ,value:4
filter % 2 == 0 ,value:5
filter % 2 != 0 ,value:5
5
filter % 2 == 0 ,value:6
6
filter % 2 != 0 ,value:6
filter % 2 == 0 ,value:7
filter % 2 != 0 ,value:7
7
filter % 2 == 0 ,value:8
8
filter % 2 != 0 ,value:8
filter % 2 == 0 ,value:9
filter % 2 != 0 ,value:9
9
filter % 2 == 0 ,value:0
0
filter % 2 != 0 ,value:0

结果和我们说的一样

结论

Filter 不建议使用在分流场景,只是一个过滤器

SideOutput 推荐使用分流场景

posted on 2023-03-08 11:35  chouc  阅读(239)  评论(0编辑  收藏  举报

导航