dremio parquet zstd 压缩支持尝试
主要是dremio parquet zstd 压缩支持尝试,说明下思路,大家可以参考
修改
- ExecConstants 配置 sabot/kernel/src/main/java/com/dremio/exec/ExecConstants.java
EnumeratedStringValidator PARQUET_WRITER_COMPRESSION_TYPE_VALIDATOR = new EnumeratedStringValidator(
PARQUET_WRITER_COMPRESSION_TYPE, "snappy", "snappy", "gzip", "none");
- kernel 的 sabot/kernel/src/main/java/com/dremio/exec/store/parquet/ParquetRecordWriter.java
switch (codecName) {
case "snappy":
codec = CompressionCodecName.SNAPPY;
break;
case "lzo":
codec = CompressionCodecName.LZO;
break;
case "gzip":
codec = CompressionCodecName.GZIP;
break;
case "zstd":
codec = CompressionCodecName.ZSTD;
break;
case "none":
case "uncompressed":
codec = CompressionCodecName.UNCOMPRESSED;
break;
default:
throw new UnsupportedOperationException(String.format("Unknown compression type: %s", codecName));
}
构建
修改之后需要自己编译构建,具体参考 dremio github 官方
运行
- docker-compose 环境
注意需要 zstd-jni-1.5.0-1.jar,默认dremio 木有包含需要自己添加下,可以到3rdpart 中
Dockerfile
FROM dremio/dremio-oss:23.0.1
USER root
COPY dremio-env-profile /opt/dremio/conf/dremio-env
# 支持dfs 配置
COPY dremio.conf /opt/dremio/conf/dremio.conf
COPY --from=hengyunabc/arthas:latest /opt/arthas /opt/arthas
COPY zstd-jni-1.5.0-1.jar /opt/dremio/jars/3rdparty/
COPY libhadoop.so.1.0.0 /opt/dremio/lib/libhadoop.so
# 自己构建的dremio-sabot-kernel
COPY dremio-sabot-kernel-23.0.1-202210141019030815-c1de8bcc.jar /opt/dremio/jars/dremio-sabot-kernel-23.0.1-202210141019030815-c1de8bcc.jar
USER dremio
docker-compose 文件
version: "3"
services:
mongo:
image: mongo:4.2
environment:
- "MONGO_INITDB_ROOT_USERNAME=root"
- "MONGO_INITDB_ROOT_PASSWORD=dalongdemo"
ports:
- "27017:27017"
minio:
image: minio/minio
ports:
- "9002:9000"
- "19001:19001"
environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY: minio123
command: server --console-address :19001 --quiet /data
mysql:
image: mysql:5.6
environment:
- "MYSQL_ROOT_PASSWORD=dalong"
ports:
- "3306:3306"
dremio:
build: ./
image: dalongrong/dremio:23.1-jprofiler
volumes:
- ./dat:/data
ports:
- "9047:9047"
- "8849:8849"
- "31010:31010"
- 开启配置
dremio support key 添加 store.parquet.compression=zstd
问题
- hadoop zstd 支持
解决方法,下载一个包含了zstd 的libhadoop.so 文件,我已经找到一个提供在github 中
- 创建效果
- 查询数据
- 查看生成的parquet 文件
使用parquet-tools 自己下载安装(最好使用python venv),然后下载s3 的数据
说明
具体运行环境我已经放到github 中了,大家可以参考
参考资料
https://github.com/luben/zstd-jni
https://github.com/dremio/dremio-oss
https://facebook.github.io/zstd/
https://github.com/apache/parquet-format/blob/master/Compression.md
https://github.com/dremio/dremio-oss/blob/d41cb52143b6b0289fc8ed4d970bfcf410a669e8/services/jobs/src/main/java/com/dremio/service/jobs/JobResultsStore.java
https://github.com/rongfengliang/dremio-parquet-zstd