dremio parquet zstd 压缩支持尝试

主要是dremio parquet zstd 压缩支持尝试,说明下思路,大家可以参考

修改

  • ExecConstants 配置 sabot/kernel/src/main/java/com/dremio/exec/ExecConstants.java
 EnumeratedStringValidator PARQUET_WRITER_COMPRESSION_TYPE_VALIDATOR = new EnumeratedStringValidator(
      PARQUET_WRITER_COMPRESSION_TYPE, "snappy", "snappy", "gzip", "none");
  • kernel 的 sabot/kernel/src/main/java/com/dremio/exec/store/parquet/ParquetRecordWriter.java
 switch (codecName) {
      case "snappy":
        codec = CompressionCodecName.SNAPPY;
        break;
      case "lzo":
        codec = CompressionCodecName.LZO;
        break;
      case "gzip":
        codec = CompressionCodecName.GZIP;
        break;
      case "zstd":
        codec = CompressionCodecName.ZSTD;
        break;
      case "none":
      case "uncompressed":
        codec = CompressionCodecName.UNCOMPRESSED;
        break;
      default:
        throw new UnsupportedOperationException(String.format("Unknown compression type: %s", codecName));
    }

构建

修改之后需要自己编译构建,具体参考 dremio github 官方

运行

  • docker-compose 环境
    注意需要 zstd-jni-1.5.0-1.jar,默认dremio 木有包含需要自己添加下,可以到3rdpart 中
    Dockerfile
 
FROM  dremio/dremio-oss:23.0.1
USER root
COPY dremio-env-profile /opt/dremio/conf/dremio-env
# 支持dfs 配置
COPY dremio.conf /opt/dremio/conf/dremio.conf
COPY --from=hengyunabc/arthas:latest /opt/arthas /opt/arthas
COPY zstd-jni-1.5.0-1.jar /opt/dremio/jars/3rdparty/
COPY libhadoop.so.1.0.0 /opt/dremio/lib/libhadoop.so
# 自己构建的dremio-sabot-kernel
COPY dremio-sabot-kernel-23.0.1-202210141019030815-c1de8bcc.jar /opt/dremio/jars/dremio-sabot-kernel-23.0.1-202210141019030815-c1de8bcc.jar
USER dremio

docker-compose 文件

version: "3"
services:
  mongo:
    image: mongo:4.2
    environment:
      - "MONGO_INITDB_ROOT_USERNAME=root"
      - "MONGO_INITDB_ROOT_PASSWORD=dalongdemo"
    ports:
      - "27017:27017"
  minio:
    image: minio/minio
    ports:
      - "9002:9000"
      - "19001:19001"
    environment:
      MINIO_ACCESS_KEY: minio
      MINIO_SECRET_KEY: minio123
    command: server --console-address :19001 --quiet /data
  mysql:
    image: mysql:5.6
    environment:
      - "MYSQL_ROOT_PASSWORD=dalong"
    ports:
      - "3306:3306"
  dremio:
    build: ./
    image: dalongrong/dremio:23.1-jprofiler
    volumes:
      - ./dat:/data
    ports:
      - "9047:9047"
      - "8849:8849"
      - "31010:31010"
  • 开启配置

dremio support key 添加 store.parquet.compression=zstd

 

 

问题

  • hadoop zstd 支持

 

 

 

解决方法,下载一个包含了zstd 的libhadoop.so 文件,我已经找到一个提供在github 中

  • 创建效果

 

 

  • 查询数据

 

 

  • 查看生成的parquet 文件

使用parquet-tools 自己下载安装(最好使用python venv),然后下载s3 的数据

 

 

说明

具体运行环境我已经放到github 中了,大家可以参考

参考资料

https://github.com/luben/zstd-jni
https://github.com/dremio/dremio-oss
https://facebook.github.io/zstd/
https://github.com/apache/parquet-format/blob/master/Compression.md
https://github.com/dremio/dremio-oss/blob/d41cb52143b6b0289fc8ed4d970bfcf410a669e8/services/jobs/src/main/java/com/dremio/service/jobs/JobResultsStore.java
https://github.com/rongfengliang/dremio-parquet-zstd

posted on 2022-10-24 21:44  荣锋亮  阅读(152)  评论(0编辑  收藏  举报

导航