



1、Tumbling Window Join 


DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

    .apply(new JoinFunction<Integer, Integer, String> (){
        public String join(Integer first, Integer second) {
            return first + "," + second;

2、Sliding Window Join


DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

    .window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */))
    .apply(new JoinFunction<Integer, Integer, String> (){
        public String join(Integer first, Integer second) {
            return first + "," + second;

3、Session Window Join


DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

    .apply(new JoinFunction<Integer, Integer, String> (){
        public String join(Integer first, Integer second) {
            return first + "," + second;

以上3种都是“inner join”,只是窗口类型不一样。

4、Interval Join


right.timestamp ∈ [left.timestamp + lowerBound; left.timestamp + upperBound]

In the example above, we join two streams ‘orange’ and ‘green’ with a lower bound of -2 milliseconds and an upper bound of +1 millisecond. Be default, these boundaries are inclusive, but .lowerBoundExclusive() and .upperBoundExclusive can be applied to change the behaviour.

Using the more formal notation again this will translate to

orangeElem.ts + lowerBound <= greenElem.ts <= orangeElem.ts + upperBound

注意:目前 interval join 只支持 Event time


DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

    .between(Time.milliseconds(-2), Time.milliseconds(1))
    .process(new ProcessJoinFunction<Integer, Integer, String(){
        public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
            out.collect(first + "," + second);


只有inner join肯定还不够,如何实现left/right outer join呢?答案就是利用coGroup()算子。它的调用方式类似于join()算子,也需要开窗,但是CoGroupFunction比JoinFunction更加灵活,可以按照用户指定的逻辑匹配左流和/或右流的数据并输出。

以下的例子就实现了点击流left join订单流的功能,是很朴素的nested loop join思想(二重循环)。

  .where(record -> record.getMerchandiseId())
  .equalTo(record -> record.getMerchandiseId())
  .apply(new CoGroupFunction<AnalyticsAccessLogRecord, OrderDoneLogRecord, Tuple2<String, Long>>() {
    public void coGroup(Iterable<AnalyticsAccessLogRecord> accessRecords, Iterable<OrderDoneLogRecord> orderRecords, 
Collector<Tuple2<String, Long>> collector) throws Exception { for (AnalyticsAccessLogRecord accessRecord : accessRecords) { boolean isMatched = false; for (OrderDoneLogRecord orderRecord : orderRecords) { // 右流中有对应的记录 collector.collect(new Tuple2<>(accessRecord.getMerchandiseName(), orderRecord.getPrice())); isMatched = true; } if (!isMatched) { // 右流中没有对应的记录 collector.collect(new Tuple2<>(accessRecord.getMerchandiseName(), null)); } } } }) .print().setParallelism(1);


1、 预加载维表



  • 优点:实现简单
  • 缺点:因为数据存于内存,所以只适合小数据量并且维表数据更新频率不高的情况下。虽然可以在open中定义一个定时器定时更新维表,但是还是存在维表更新不及时的情况。
class MapJoinDemo1 extends RichMapFunction<Tuple2<String, Integer>, Tuple3<String, Integer, String>> {
        Map<Integer, String> dim;

        public void open(Configuration parameters) throws Exception {
            dim = new HashMap<>();
            dim.put(1001, "beijing");
            dim.put(1002, "shanghai");
            dim.put(1003, "wuhan");
            dim.put(1004, "changsha");

        public Tuple3<String, Integer, String> map(Tuple2<String, Integer> value) throws Exception {
            String cityName = "";
            if (dim.containsKey(value.f1)) {
                cityName = dim.get(value.f1);
            return new Tuple3<>(value.f0, value.f1, cityName);

2、 热存储维表


  • 优点:维度数据量不受内存限制,可以存储很大的数据量。
  • 缺点:因为维表数据在外部存储中,读取速度受制于外部存储的读取速度;另外维表的同步也有延迟。

(1) 使用cache来减轻访问压力

可以使用缓存来存储一部分常访问的维表数据,以减少访问外部系统的次数,比如使用guava Cache。

class MapJoinDemo1 extends RichMapFunction<Tuple2<String, Integer>, Tuple3<String, Integer, String>> {
        LoadingCache<Integer, String> dim;

        public void open(Configuration parameters) throws Exception {
            //使用google LoadingCache来进行缓存
            dim = CacheBuilder.newBuilder()
                    .expireAfterWrite(10, TimeUnit.MINUTES)
                    .removalListener(new RemovalListener<Integer, String>() {
                        public void onRemoval(RemovalNotification<Integer, String> removalNotification) {
                            System.out.println(removalNotification.getKey() + "被移除了,值为:" + removalNotification.getValue());
                            new CacheLoader<Integer, String>() {
                                public String load(Integer cityId) throws Exception {
                                    String cityName = readFromHbase(cityId);
                                    return cityName;

        private String readFromHbase(Integer cityId) {
            Map<Integer, String> temp = new HashMap<>();
            temp.put(1001, "beijing");
            temp.put(1002, "shanghai");
            temp.put(1003, "wuhan");
            temp.put(1004, "changsha");
            String cityName = "";
            if (temp.containsKey(cityId)) {
                cityName = temp.get(cityId);
            return cityName;

        public Tuple3<String, Integer, String> map(Tuple2<String, Integer> value) throws Exception {
            String cityName = "";
            if (dim.get(value.f1) != null) {
                cityName = dim.get(value.f1);
            return new Tuple3<>(value.f0, value.f1, cityName);

(2) 使用异步IO来提高访问吞吐量



  • 超时:如果查询超时那么就认为是读写失败,需要按失败处理;
  • 并发数量:如果并发数量太多,就要触发Flink的反压机制来抑制上游的写入。
  • 返回顺序错乱:顺序错乱了要根据实际情况来处理,Flink支持两种方式:允许乱序、保证顺序。
public class JoinDemo3 {
    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<Tuple2<String, Integer>> textStream = env.socketTextStream("localhost", 9000, "\n")
                .map(p -> {
                    String[] list = p.split(",");
                    return new Tuple2<String, Integer>(list[0], Integer.valueOf(list[1]));
                .returns(new TypeHint<Tuple2<String, Integer>>() {

        DataStream<Tuple3<String,Integer, String>> orderedResult = AsyncDataStream
                .orderedWait(textStream, new JoinDemo3AyncFunction(), 1000L, TimeUnit.MILLISECONDS, 2)

        DataStream<Tuple3<String,Integer, String>> unorderedResult = AsyncDataStream
                .unorderedWait(textStream, new JoinDemo3AyncFunction(), 1000L, TimeUnit.MILLISECONDS, 2)


    //输入用户名、城市ID,返回 Tuple3<用户名、城市ID,城市名称>
    static class JoinDemo3AyncFunction extends RichAsyncFunction<Tuple2<String, Integer>, Tuple3<String, Integer, String>> {
        // 链接
        private static String jdbcUrl = "jdbc:mysql://";
        private static String username = "root";
        private static String password = "123";
        private static String driverName = "com.mysql.jdbc.Driver";
        java.sql.Connection conn;
        PreparedStatement ps;

        public void open(Configuration parameters) throws Exception {

            conn = DriverManager.getConnection(jdbcUrl, username, password);
            ps = conn.prepareStatement("select city_name from tmp.city_info where id = ?");

        public void close() throws Exception {

        public void asyncInvoke(Tuple2<String, Integer> input, ResultFuture<Tuple3<String,Integer, String>> resultFuture) throws Exception {
            // 使用 city id 查询
            ps.setInt(1, input.f1);
            ResultSet rs = ps.executeQuery();
            String cityName = null;
            if (rs.next()) {
                cityName = rs.getString(1);
            List list = new ArrayList<Tuple2<Integer, String>>();
            list.add(new Tuple3<>(input.f0,input.f1, cityName));

        public void timeout(Tuple2<String, Integer> input, ResultFuture<Tuple3<String,Integer, String>> resultFuture) throws Exception {
            List list = new ArrayList<Tuple2<Integer, String>>();
            list.add(new Tuple3<>(input.f0,input.f1, ""));

3、 广播维表

利用Flink的Broadcast State将维度数据流广播到下游做join操作。特点如下:

  • 优点:维度数据变更后可以即时更新到结果中。
  • 缺点:数据保存在内存中,支持的维度数据量比较小。
        DataStream<Tuple2<Integer, String>> cityStream = env.socketTextStream("localhost", 9001, "\n")
                .map(p -> {
                    String[] list = p.split(",");
                    return new Tuple2<Integer, String>(Integer.valueOf(list[0]), list[1]);
                .returns(new TypeHint<Tuple2<Integer, String>>() {

        final MapStateDescriptor<Integer, String> broadcastDesc = new MapStateDescriptor("broad1", Integer.class, String.class);
        BroadcastStream<Tuple2<Integer, String>> broadcastStream = cityStream.broadcast(broadcastDesc);

        DataStream result = textStream.connect(broadcastStream)
                .process(new BroadcastProcessFunction<Tuple2<String, Integer>, Tuple2<Integer, String>, Tuple3<String, Integer, String>>() {
                    public void processElement(Tuple2<String, Integer> value, ReadOnlyContext ctx, Collector<Tuple3<String, Integer, String>> out) throws Exception {
                        ReadOnlyBroadcastState<Integer, String> state = ctx.getBroadcastState(broadcastDesc);
                        String cityName = "";
                        if (state.contains(value.f1)) {
                            cityName = state.get(value.f1);
                        out.collect(new Tuple3<>(value.f0, value.f1, cityName));

                    public void processBroadcastElement(Tuple2<Integer, String> value, Context ctx, Collector<Tuple3<String, Integer, String>> out) throws Exception {
                        System.out.println("收到广播数据:" + value);
                        ctx.getBroadcastState(broadcastDesc).put(value.f0, value.f1);

4、 Temporal table function join

Temporal table是持续变化表上某一时刻的视图,Temporal table function是一个表函数,传递一个时间参数,返回Temporal table这一指定时刻的视图。

可以将维度数据流映射为Temporal table,主流与这个Temporal table进行关联,可以关联到某一个版本(历史上某一个时刻)的维度数据。

Temporal table function join的特点如下:

  • 优点:维度数据量可以很大,维度数据更新及时,不依赖外部存储,可以关联不同版本的维度数据。
  • 缺点:只支持在Flink SQL API中使用。
public class JoinDemo5 {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, bsSettings);

        DataStream<Tuple2<String, Integer>> textStream = env.socketTextStream("localhost", 9000, "\n")
                .map(p -> {
                    String[] list = p.split(",");
                    return new Tuple2<String, Integer>(list[0], Integer.valueOf(list[1]));
                .returns(new TypeHint<Tuple2<String, Integer>>() {

        DataStream<Tuple2<Integer, String>> cityStream = env.socketTextStream("localhost", 9001, "\n")
                .map(p -> {
                    String[] list = p.split(",");
                    return new Tuple2<Integer, String>(Integer.valueOf(list[0]), list[1]);
                .returns(new TypeHint<Tuple2<Integer, String>>() {

        Table userTable = tableEnv.fromDataStream(textStream, "user_name,city_id,ps.proctime");
        Table cityTable = tableEnv.fromDataStream(cityStream, "city_id,city_name,ps.proctime");

        TemporalTableFunction dimCity = cityTable.createTemporalTableFunction("ps", "city_id");
        tableEnv.registerFunction("dimCity", dimCity);

        Table result = tableEnv
                .sqlQuery("select u.user_name,u.city_id,d.city_name from " + userTable + " as u " +
                        ", Lateral table (dimCity(u.ps)) d " +
                        "where u.city_id=d.city_id");
        DataStream resultDs = tableEnv.toAppendStream(result, Row.class);









