Daft nessie rest catalog + dremio 集成代码简单说明
以前简单说明过集成玩法,以下通过代码示例简单说明下集成
nessie + dremio 部署
基于docker-compose 部署
- docker-compose
一些参考可以参考我以前关于dremio 以及nessie 的一些介绍
version: "3"
services:
zk:
image: zookeeper
ports:
- 2181:2181
minio:
image: minio/minio
ports:
- "9000:9000"
- "19001:19001"
environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY: minio123
MINIO_API_SELECT_PARQUET: on
command: server --console-address :19001 --quiet /data
# dremio_coordinator-up:
# build: .
# privileged: true
# # command: /opt/dremio/bin/dremio-admin upgrade
# entrypoint: ["/opt/dremio/bin/dremio-admin", "upgrade"]
# cap_add:
# - ALL
# volumes:
# - ./conf/dremio_coor.conf:/opt/dremio/conf/dremio.conf
# - ./datas:/myappdemo
# - ./logback.xml:/opt/dremio/conf/logback.xml
# - ./data:/opt/dremio/data
# - ./logs:/opt/dremio/logs
dremio_coordinator:
build: .
hostname: dremio-coordinator
container_name: dremio-coordinator
privileged: true
cap_add:
- ALL
volumes:
- ./conf/dremio_coor.conf:/opt/dremio/conf/dremio.conf
- ./datas:/myappdemo
- ./logback.xml:/opt/dremio/conf/logback.xml
- ./data:/opt/dremio/data
- /Users/dalong/mylearning/daft-learning/warehouse:/warehouse
- ./logs:/opt/dremio/logs
ports:
- "9047:9047"
- "31010:31010"
- "9090:9090"
- "9092:9092"
- "8849:8849"
- "32010:32010"
dremio_executor_1:
build: .
hostname: dremio-executor-1
container_name: dremio-executor-1
privileged: true
cap_add:
- ALL
volumes:
- ./conf/dremio_exec.conf:/opt/dremio/conf/dremio.conf
- ./dremio-telemetry-exec.yaml:/opt/dremio/conf/dremio-telemetry.yaml
- ./datas:/myappdemo
- ./logsv2:/opt/dremio/logs
- ./logback.xml:/opt/dremio/conf/logback.xml
- /Users/dalong/mylearning/daft-learning/warehouse:/warehouse
ports:
- "9048:9047"
- "31011:31010"
- "9091:9090"
- "8859:8849"
- "9093:9092"
depends_on:
- zk
- dremio_coordinator
pg:
image: postgres:16.0
ports:
- "5432:5432"
environment:
- POSTGRES_PASSWORD=dalongdemo
nessie:
image: quay.io/projectnessie/nessie:0.100.2-java
environment:
- "JAVA_DEBUG=true"
- "JAVA_DEBUG_PORT=*:5005"
- nessie_version_store_type=JDBC
- quarkus.datasource.username=postgres
- quarkus.datasource.password=dalongdemo
- quarkus_datasource_jdbc_url=jdbc:postgresql://pg:5432/postgres
- nessie.catalog.default-warehouse=warehouse
- nessie.catalog.warehouses.warehouse.location=s3://demobucket/
- nessie.catalog.service.s3.default-options.region=us-east-1
- nessie.catalog.service.s3.default-options.path-style-access=true
- nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
- nessie.catalog.secrets.access-key.name=minio
- nessie.catalog.secrets.access-key.secret=minio123
- nessie.catalog.service.s3.default-options.endpoint=http://192.168.31.166:9000
- nessie.catalog.service.s3.default-options.external-endpoint=http://192.168.31.166:9000
ports:
- "19120:19120"
- "19121:19121"
- "5005:5005"
Daft 集成
- 参考代码
import daft
from daft.io import IOConfig, S3Config
from pyiceberg.catalog.rest import RestCatalog as ISqlCatalog
import pyarrow.parquet as pq
from daft.sql import SQLCatalog
# 此处不配置s3一些信息也是可以的,但是s3.region 建议配置,不然会有api 查询操作,比较慢
catalog = ISqlCatalog(
"default",
**{
"uri": "http://192.168.31.166:19120/iceberg",
"s3.endpoint": "http://localhost:9000",
"s3.access-key-id": "minio",
"s3.secret-access-key": "minio123",
"s3.region": "us-east-1",
"s3_url_style": "path"
},
)
# iceberg 数据写入,基于pyarrow 模式的,当然还有直接基于Daft的
# catalog.create_namespace("default")
df = pq.read_table("yellow_tripdata_2023-01.parquet")
table = catalog.create_table(
"default.dalongdemov14",
schema=df.schema,
location="s3://demobucket/dalongdemov14"
)
table.append(df)
# Daft 读取iceberg 数据
mytable = catalog.load_table("default.dalongdemov14")
df2 = daft.read_iceberg(mytable)
print(df2.collect())
Dremio 集成
核心就是配置rest catalog 地址,同时进行一些s3相关的配置
- 参考配置
此处的一些配置与nessie catalog 集成是类似的,
fs.s3a.endpoint xxxxxx
fs.s3a.path.style.access true
fs.s3a.connection.ssl.enabled false
fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
fs.dremioS3.impl com.dremio.plugins.s3.store.S3FileSystem
dremio.s3.compat true
fs.s3a.access.key xxxx
fs.s3a.secret.key xxx
- 查询效果
说明
以上是一个简单的集成玩法,基于Daft 以及pyiceberg 我们可以开发一个简单的etl (轻量的),同时集成dremio 进行方便的数据查询,当然Daft 也是直接可以进行数据查询操作的,但是目前能力相对弱一些
参考资料
https://py.iceberg.apache.org/configuration/#fileio
https://github.com/projectnessie/nessie/releases/tag/nessie-0.90.2
https://github.com/apache/iceberg-python/issues/540
https://www.getdaft.io/projects/docs/en/stable/user_guide/integrations/iceberg.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
2024-02-11 netflix conductor 停止维护
2024-02-11 extism 基于rust 开发的强大webassembly 框架
2023-02-11 flannel 低版本glog flag redefined error 问题解决
2023-02-11 flannel vni 一些说明
2023-02-11 go: cannot find main module, but found glide.lock 问题解决
2023-02-11 dremio 24 新特性简单说明
2023-02-11 dremio 24 发布了