在 Flink 1.10 中SQL正式生产,在尝试使用的时候,遇到了这样的问题: KafkaTableSink 的 'update-mode' 只支持 ‘append’,如下面这样:

CREATE TABLE user_log_sink (
    user_id VARCHAR,
    item_id VARCHAR,
    category_id VARCHAR,
    behavior VARCHAR,
    ts TIMESTAMP(3)
) WITH (
    'connector.type' = 'kafka',
    'connector.version' = 'universal',
    'connector.topic' = 'user_behavior_sink',
    'connector.properties.zookeeper.connect' = 'venn:2181',
    'connector.properties.bootstrap.servers' = 'venn:9092',
    'update-mode' = 'append',  # 仅支持 append
     'format.type' = 'json'
);

看起来好像没问题,因为kafka 也只能往里面写数据,不能删数据

官网链接:https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/connect.html#kafka-connector

但是,如果在SQL中使用到了group 就不行了,如下:

SELECT item_id, category_id, behavior, max(ts), min(proctime), max(proctime), count(user_id)
FROM user_log
group by item_id, category_id, behavior;

报错如下:

Exception in thread "main" org.apache.flink.table.api.TableException: AppendStreamTableSink requires that Table has only insert changes.

之前一个同学在社区问这个问题,得到的建议是用 DataStream 转一下。

大概看了下 KafkaTableSink 的源码,有这样的继承关系,从 AppendStreamTableSink 继承下来的

public class KafkaTableSink extends KafkaTableSinkBase
public abstract class KafkaTableSinkBase implements AppendStreamTableSink<Row>

 group by (非窗口的)语句是 以 key 撤回的,需要用 RetractStreamTableSink 或 UpsertStreamTableSink 。

public interface RetractStreamTableSink<T> extends StreamTableSink<Tuple2<Boolean, T>>
public interface UpsertStreamTableSink<T> extends StreamTableSink<Tuple2<Boolean, T>>

注:RetractStreamTableSink 一般在Flink 内部使用,UpsertStreamTableSink 适合于连接外部存储系统。

到这里,不能将 group by 的结果直接用 KafkaTableSink 的原因已经找到了,接下来就自己实现一个 KafkaUpsertTableSink,就可以解决我们的问题了。

参考如下类,实现了自定义的 KafkaUpsertTableSink:

KafkaTableSink
KafkaTableSinkBase
KafkaTableSourceSinkFactory
KafkaTableSourceSinkFactoryBase
KafkaValidator
和
HBaseUpsertTableSink
Elasticsearch7UpsertTableSink
Elasticsearch7UpsertTableSinkFactory

 参考上一篇翻译:【翻译】Flink Table API & SQL 自定义 Source & Sink 

直接把Kafka的一套copy出来修改:

MyKafkaValidator 直接copy KafkaValidator ,修改 connector_type 和 update-mode 检验的代码:

public static final String CONNECTOR_TYPE_VALUE_KAFKA = "myKafka";
@Override
public void validate(DescriptorProperties properties) {
    super.validate(properties);
    properties.validateEnumValues(UPDATE_MODE, true, Collections.singletonList(UPDATE_MODE_VALUE_UPSERT));

    properties.validateValue(CONNECTOR_TYPE, CONNECTOR_TYPE_VALUE_KAFKA, false);

    properties.validateString(CONNECTOR_TOPIC, false, 1, Integer.MAX_VALUE);

    validateStartupMode(properties);

    validateKafkaProperties(properties);

    validateSinkPartitioner(properties);
}

KafkaUpsertTableSinkBase 改为继承 UpsertStreamTableSink:

public abstract class KafkaUpsertTableSinkBase implements UpsertStreamTableSink<Row>

修改对应 consumeDataStream 方法的实现: 将  DataStream<Tuple2<Boolean, Row>> 转成 DataStream< Row>,让 kafka 接收

public DataStreamSink<?> consumeDataStream(DataStream<Tuple2<Boolean, Row>> dataStream) {

    final SinkFunction<Row> kafkaProducer = createKafkaProducer(
            topic,
            properties,
            serializationSchema,
            partitioner);
    // update by venn
    return dataStream
            .flatMap(new FlatMapFunction<Tuple2<Boolean, Row>, Row>() {
                @Override
                public void flatMap(Tuple2<Boolean, Row> element, Collector<Row> out) throws Exception {
                    // true is upsert, false is delete,这里false 的直接丢弃了
                    if (element.f0) {
                        out.collect(element.f1);
                    } else {
                        //System.out.println("KafkaUpsertTableSinkBase : retract stream f0 will be false");
                    }
                }
            })
            .addSink(kafkaProducer)
            .setParallelism(dataStream.getParallelism())
            .name(TableConnectorUtils.generateRuntimeName(this.getClass(), getFieldNames()));
}

 KafakUpsertTableSink 继承 KafkaUpsertTableSinkBase:

public class KafkaUpsertTableSink extends KafkaUpsertTableSinkBase

修改对应实现:

@Override
public TypeInformation<Row> getRecordType() {
    return TypeInformation.of(Row.class);
}

KafkaUpsertTableSourceSinkFactoryBase 和 KafkaTableSourceSinkFactoryBase 一样实现 : StreamTableSourceFactory<Row>, StreamTableSinkFactory<Row>

public abstract class KafkaUpsertTableSourceSinkFactoryBase implements
        StreamTableSourceFactory<Row>,
        StreamTableSinkFactory<Row>

KafkaUpsertTableSourceSinkFactory 继承 KafkaUpsertTableSourceSinkFactoryBase:

public class KafkaUpsertTableSourceSinkFactory extends KafkaUpsertTableSourceSinkFactoryBase

注意:将代码中用到 KafkaValidator 改为 MyKafkaValidator

最后一个很重要的步骤是在 resource 目录下添加文件夹  META_INF/services,并创建文件 org.apache.flink.table.factories.TableFactory,在文件中写上新建的 Factory 类:

 TableFactory允许从基于字符串的属性创建与表相关的不同实例。 调用所有可用的工厂以匹配给定的属性集和相应的工厂类。

工厂利用 Java’s Service Provider Interfaces(SPI)进行发现。 这意味着每个依赖项和JAR文件都应在 META_INF/services 资源目录中包含一个文件org.apache.flink.table.factories.TableFactory,该文件列出了它提供的所有可用表工厂。

注:不加不会加载新的工厂方法

好了,代码改完了试下效果:

---sourceTable
CREATE TABLE user_log(
    user_id VARCHAR,
    item_id VARCHAR,
    category_id VARCHAR,
    behavior VARCHAR,
    ts TIMESTAMP(3),
    proctime as PROCTIME()
) WITH (
    'connector.type' = 'kafka',
    'connector.version' = 'universal',
    'connector.topic' = 'user_behavior',
    'connector.properties.zookeeper.connect' = 'venn:2181',
    'connector.properties.bootstrap.servers' = 'venn:9092',
    'connector.startup-mode' = 'earliest-offset',
    'format.type' = 'json'
);

---sinkTable
CREATE TABLE user_log_sink (
    item_id VARCHAR ,
    category_id VARCHAR ,
    behavior VARCHAR ,
    max_tx TIMESTAMP(3),
    min_prc TIMESTAMP(3),
    max_prc TIMESTAMP(3),
    coun BIGINT
) WITH (
    'connector.type' = 'myKafka',
    'connector.version' = 'universal',
    'connector.topic' = 'user_behavior_sink',
    'connector.properties.zookeeper.connect' = 'venn:2181',
    'connector.properties.bootstrap.servers' = 'venn:9092',
    'update-mode' = 'upsert',
    'format.type' = 'json'
);

---insert
INSERT INTO user_log_sink
SELECT item_id, category_id, behavior, max(ts), min(proctime), max(proctime), count(user_id)
FROM user_log
group by item_id, category_id, behavior;

执行图如下:

 

 在  KafkaUpsertTableSinkBase 中 查看消息的 布尔值:

 upsert 的消息:

 

比较尴尬的问题是,没看到 false 的消息,哪怕是同一条消息一直发,count 值是在 增加,但是没有 false 的去删除上一条消息

输出结果:

{"item_id":"3611281","category_id":"965809","behavior":"pv","max_tx":"2017-11-26T01:00:00Z","min_prc":"2020-04-08T05:20:26.694Z","max_prc":"2020-04-08T05:26:16.525Z","coun":10}
{"item_id":"3611281","category_id":"965809","behavior":"pv","max_tx":"2017-11-26T01:00:00Z","min_prc":"2020-04-08T05:20:26.694Z","max_prc":"2020-04-08T05:26:27.317Z","coun":11}

完整代码可以在这里找到:https://github.com/springMoon/flink-rookie/tree/master/src/main/scala/com/venn/source

水平有限,样例仅供参考 

欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文

posted on 2020-04-08 13:30  Flink菜鸟  阅读(4606)  评论(0编辑  收藏  举报