flink1.11.1 从源码看table api以及表达式的问题
想玩一下flink1.11全新的table api,发觉全是坑啊,不知道是因为我新手哪里使用不对,还是flink开发者的疏忽导致的BUG。
table api新特性
在flink1.11中,单纯的sql字符串对字段的操作已经被标记为过时,看一下部分源码:
/** * Performs a selection operation. Similar to a SQL SELECT statement. The field expressions * can contain complex expressions and aggregations. * * <p>Example: * * <pre> * {@code * tab.select("key, value.avg + ' The average' as average") * } * </pre> * @deprecated use {@link #select(Expression...)} */ @Deprecated Table select(String fields); /** * Performs a selection operation. Similar to a SQL SELECT statement. The field expressions * can contain complex expressions and aggregations. * * <p>Scala Example: * * <pre> * {@code * tab.select($("key"), $("value").avg().plus(" The average").as("average")); * } * </pre> * * <p>Scala Example: * * <pre> * {@code * tab.select($"key", $"value".avg + " The average" as "average") * } * </pre> */ Table select(Expression... fields);
这里是通过select 举例,之外还有where、filter、groupBy等。
可以看api文档注释,使用$用来表示字段,并且还可以提供字段方法,使用起来比较方便,例如$("a").as("b"),$是一个静态方法。
/** * Creates an unresolved reference to a table's field. * * <p>Example: * <pre>{@code * tab.select($("key"), $("value")) * } * </pre> */ //CHECKSTYLE.OFF: MethodName public static ApiExpression $(String name) { return new ApiExpression(unresolvedRef(name)); }
除此之外还可以用row,如row("field"),我个人使用看来,$和row是等价的。
问题
- TableEnviroment接口并没有提供流输入
上篇文章说到TableEnviroment比起flink1.10的改动提供接受原始数据的方法,
但是这些方法没有支持输入流数据,需要流数据还得用回StreamTableEnvironment。
- StreamTableEnvironment新方法有问题
SteamTableEnviroment提供了新的重载方法fromDataStream,先看一下具体事先类
@Override public <T> Table fromDataStream(DataStream<T> dataStream, String fields) { List<Expression> expressions = ExpressionParser.parseExpressionList(fields); return fromDataStream(dataStream, expressions.toArray(new Expression[0])); } @Override public <T> Table fromDataStream(DataStream<T> dataStream, Expression... fields) { JavaDataStreamQueryOperation<T> queryOperation = asQueryOperation( dataStream, Optional.of(Arrays.asList(fields))); return createTable(queryOperation); }
再看接口文档注释
/** * Converts the given {@link DataStream} into a {@link Table} with specified field names. * * <p>There are two modes for mapping original fields to the fields of the {@link Table}: * * <p>1. Reference input fields by name: * All fields in the schema definition are referenced by name * (and possibly renamed using an alias (as). Moreover, we can define proctime and rowtime * attributes at arbitrary positions using arbitrary names (except those that exist in the * result schema). In this mode, fields can be reordered and projected out. This mode can * be used for any input type, including POJOs. * * <p>Example: * * <pre> * {@code * DataStream<Tuple2<String, Long>> stream = ... * Table table = tableEnv.fromDataStream( * stream, * $("f1"), // reorder and use the original field * $("rowtime").rowtime(), // extract the internally attached timestamp into an event-time * // attribute named 'rowtime' * $("f0").as("name") // reorder and give the original field a better name * ); * } * </pre> * * <p>2. Reference input fields by position: * In this mode, fields are simply renamed. Event-time attributes can * replace the field on their position in the input data (if it is of correct type) or be * appended at the end. Proctime attributes must be appended at the end. This mode can only be * used if the input type has a defined field order (tuple, case class, Row) and none of * the {@code fields} references a field of the input type. * * <p>Example: * * <pre> * {@code * DataStream<Tuple2<String, Long>> stream = ... * Table table = tableEnv.fromDataStream( * stream, * $("a"), // rename the first field to 'a' * $("b"), // rename the second field to 'b' * $("rowtime").rowtime() // extract the internally attached timestamp into an event-time * // attribute named 'rowtime' * ); * } * </pre> * * @param dataStream The {@link DataStream} to be converted. * @param fields The fields expressions to map original fields of the DataStream to the fields of the {@code Table}. * @param <T> The type of the {@link DataStream}. * @return The converted {@link Table}. */ <T> Table fromDataStream(DataStream<T> dataStream, Expression... fields);
看起来很美好,可以直接$("field")获得字段,as重命名,rowtime可以直接转为时间,但是实际上,用起来肯定报错,先看一段测试代码
Table words = tblEnv.fromDataStream(dataStream
, row("f0").as("word"));
说明一下tuple中第一个字段是f0,可以直接引用。运行到这一段代码就会报错:
org.apache.flink.table.api.ValidationException: Alias 'word' is not allowed if other fields are referenced by position.
at org.apache.flink.table.typeutils.FieldInfoUtils$IndexedExprToFieldInfo.visitAlias(FieldInfoUtils.java:500)
at org.apache.flink.table.typeutils.FieldInfoUtils$IndexedExprToFieldInfo.visit(FieldInfoUtils.java:478)
接下来看源码是怎么处理的
1 @Override 2 public FieldInfo visit(UnresolvedCallExpression unresolvedCall) { 3 if (unresolvedCall.getFunctionDefinition() == BuiltInFunctionDefinitions.AS) { 4 return visitAlias(unresolvedCall); 5 } else if (isRowTimeExpression(unresolvedCall)) { 6 validateRowtimeReplacesCompatibleType(unresolvedCall); 7 return createTimeAttributeField(getChildAsReference(unresolvedCall), TimestampKind.ROWTIME, null); 8 } else if (isProcTimeExpression(unresolvedCall)) { 9 validateProcTimeAttributeAppended(unresolvedCall); 10 return createTimeAttributeField(getChildAsReference(unresolvedCall), TimestampKind.PROCTIME, null); 11 } 12 13 return defaultMethod(unresolvedCall); 14 } 15 16 private FieldInfo visitAlias(UnresolvedCallExpression unresolvedCall) { 17 List<Expression> children = unresolvedCall.getChildren(); 18 String newName = extractAlias(children.get(1)); 19 20 Expression child = children.get(0); 21 if (isProcTimeExpression(child)) { 22 validateProcTimeAttributeAppended(unresolvedCall); 23 return createTimeAttributeField(getChildAsReference(child), TimestampKind.PROCTIME, newName); 24 } else { 25 throw new ValidationException( 26 format("Alias '%s' is not allowed if other fields are referenced by position.", newName)); 27 } 28 } 29 30 @Override 31 protected FieldInfo defaultMethod(Expression expression) { 32 throw new ValidationException("Field reference expression or alias on field expression expected."); 33 }
调用visitAlias时候报错了,看逻辑,必须是procTime才能使用as,很莫名其妙的判断。
再往上看实在visit中调用的visitAlias,看看visit有三个判断,四个分支,是procTime,可以使用,rowTime也可以使用,如果用了as,就进入了visitAlias里面,这里就必须要是procTime。那如果不用as起别名,直接用原生名称,会进入defaultMethod,也是抛个错误出来。按照代码逻辑呢,就是除了rowTime、proTime就别用这fromDataStream的重载方法了,很接口注释文档的内容很不符合,我也不敢说这一定是一个BUG,如果有大神在的话,可以交流一下这个方法到底是怎么用的,在什么场景用。
实际上使用以下方法就不会有问题fromDataStream,
@Override public <T> Table fromDataStream(DataStream<T> dataStream) { JavaDataStreamQueryOperation<T> queryOperation = asQueryOperation( dataStream, Optional.empty()); return createTable(queryOperation); }
如果要用字段别名,之后再用select方法就可以。从代码上可以看出三个DataStream都是复用一套方法,只不过这一个没有在后面参数写入字段而已,所以那个错误更加显得莫名其妙,我还是觉得是个Bug。
因为table api几乎所有算子都换了新的表达式入参,还有很多问题,待续。
posted on 2020-09-13 11:20 SaltFishYe 阅读(691) 评论(0) 编辑 收藏 举报