【presto】presto 查询hive分桶表问题

前言

在使用presto0.220版本的查询 hive的分桶表时，遇到报错，这里整理并记录一下。

报错内容

在这里插入图片描述

Error running query: Hive table 'tj_tmp.student_bucket' is corrupt. The number of files in the directory (16) does not match the declared bucket count (4) for partition: <UNPARTITIONED>

问题原因

先说结论：

presto默认是支持分桶表并不需要开启什么配置之类的。
但是有限制条件，分桶表必须是严格标准的。

那么什么严格标准的分桶表呢？

分桶表的文件数必须等于分桶数才行，不然会报错。
分桶表的HDFS目录的组织结构也必须符合标准，举个例子：我们使用的分桶表从HDFS的目录上看像是一张分区分桶表，但是实际上确只是一张分桶表。

presto从HDFS的文件组织形式与HIVE表信息做对比，发现不一致，则会抛出异常。例如上面那个问题，presto误以为是张分区分桶表，但是只从HIVE中拿到分桶字段，没有拿到分区字段，就会报错。

贴一下报错的代码,代码位置：

presto-hive/src/main/java/com/facebook/presto/hive/HiveBucketAdapterRecordCursor.java

注：我这里是presto-0.0220版本，不同的版本可能位置不同。

int bucket = HiveBucketing.getHiveBucket(tableBucketCount, typeInfoList, scratch);
if ((bucket - bucketToKeep) % partitionBucketCount != 0) {
     throw new PrestoException(HIVE_INVALID_BUCKET_FILES, format(
             "A row that is supposed to be in bucket %s is encountered. Only rows in bucket %s (modulo %s) are expected",
             bucket, bucketToKeep % partitionBucketCount, partitionBucketCount));
 }
 if (bucket == bucketToKeep) {
     return true;
 }

另一处：
presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

// list all files in the partition
ArrayList<LocatedFileStatus> files = new ArrayList<>(partitionBucketCount);
try {
    Iterators.addAll(files, new HiveFileIterator(path, fileSystem, directoryLister, namenodeStats, FAIL));
}
catch (NestedDirectoryNotAllowedException e) {
    // Fail here to be on the safe side. This seems to be the same as what Hive does
    throw new PrestoException(
            HIVE_INVALID_BUCKET_FILES,
            format("Hive table '%s' is corrupt. Found sub-directory in bucket directory for partition: %s",
                    new SchemaTableName(table.getDatabaseName(), table.getTableName()),
                    splitFactory.getPartitionName()));
        }

// verify we found one file per bucket
if (files.size() != partitionBucketCount) {
    throw new PrestoException(
            HIVE_INVALID_BUCKET_FILES,
            format("Hive table '%s' is corrupt. The number of files in the directory (%s) does not match the declared bucket count (%s) for partition: %s",
                    new SchemaTableName(table.getDatabaseName(), table.getTableName()),
                    files.size(),
                    partitionBucketCount,
                    splitFactory.getPartitionName()));
}

测试

于是我们来测试一下，这个情况。

准备一张分桶表

create table student_bucket (id int, name string) CLUSTERED BY (id ) INTO 4 BUCKETS;

再来写点数据

insert into student_bucket values (3, "zhangsan");

可以看到表下有4个文件
在这里插入图片描述
查询下数据：

在这里插入图片描述
成功取到！

那么在试着多插入些数据

insert into student_bucket values (2, "zhangsan");

此时这张表下面已经有很多的文件了，远远超过分桶数。
在这里插入图片描述

再来查询一下：

果然报错！
在这里插入图片描述

总结

presto默认是支持分桶表并不需要开启什么配置之类的。
分桶表的组织形式也必须规范。

规范：
3. 分桶表下的文件数等于分桶数
4. 分桶表的文件组织形式必须能与HIVE建表（分区、分桶）对应上。

解决方案

将表在重新插入到一张表里，强制分桶。这样分桶数和文件数就一致了。重新导入表格强制分桶 set hive.enforce.bucketing = true;然后再insert。
需要修改presto的源码了去掉presto的代码限制。

好了到此！这里整理并记录一下！网上有说teradata版本的presto没有这个限制，有兴趣的可以试一下。

有三秋桂子，十里荷花。——柳永《望海潮·东南形胜》

posted @ 2022-11-10 19:25 彬在俊阅读(129) 评论(0) 编辑收藏举报

刷新页面返回顶部

彬在俊

【presto】presto 查询hive分桶表问题

前言

报错内容

问题原因

测试

总结

解决方案

公告