Daft nessie rest catalog + dremio 集成代码简单说明

以前简单说明过集成玩法,以下通过代码示例简单说明下集成

nessie + dremio 部署

基于docker-compose 部署

  • docker-compose

一些参考可以参考我以前关于dremio 以及nessie 的一些介绍

version: "3"
services:
  zk:
    image: zookeeper
    ports:
      - 2181:2181
  minio:
    image: minio/minio
    ports:
      - "9000:9000"
      - "19001:19001"
    environment:
      MINIO_ACCESS_KEY: minio
      MINIO_SECRET_KEY: minio123
      MINIO_API_SELECT_PARQUET: on
    command: server --console-address :19001 --quiet /data
  # dremio_coordinator-up:
  #   build: .
  #   privileged: true
  #   # command: /opt/dremio/bin/dremio-admin upgrade
  #   entrypoint: ["/opt/dremio/bin/dremio-admin", "upgrade"]
  #   cap_add:
  #     - ALL
  #   volumes:
  #     - ./conf/dremio_coor.conf:/opt/dremio/conf/dremio.conf
  #     - ./datas:/myappdemo
  #     - ./logback.xml:/opt/dremio/conf/logback.xml
  #     - ./data:/opt/dremio/data
  #     - ./logs:/opt/dremio/logs
  dremio_coordinator:
    build: .
    hostname: dremio-coordinator
    container_name: dremio-coordinator
    privileged: true
    cap_add:
      - ALL
    volumes:
      - ./conf/dremio_coor.conf:/opt/dremio/conf/dremio.conf
      - ./datas:/myappdemo
      - ./logback.xml:/opt/dremio/conf/logback.xml
      - ./data:/opt/dremio/data
      - /Users/dalong/mylearning/daft-learning/warehouse:/warehouse
      - ./logs:/opt/dremio/logs
    ports:
      - "9047:9047"
      - "31010:31010"
      - "9090:9090"
      - "9092:9092"
      - "8849:8849"
      - "32010:32010"
  dremio_executor_1:
    build: .
    hostname: dremio-executor-1
    container_name: dremio-executor-1
    privileged: true
    cap_add:
      - ALL
    volumes:
      - ./conf/dremio_exec.conf:/opt/dremio/conf/dremio.conf
      - ./dremio-telemetry-exec.yaml:/opt/dremio/conf/dremio-telemetry.yaml
      - ./datas:/myappdemo
      - ./logsv2:/opt/dremio/logs
      - ./logback.xml:/opt/dremio/conf/logback.xml
      - /Users/dalong/mylearning/daft-learning/warehouse:/warehouse
    ports:
      - "9048:9047"
      - "31011:31010"
      - "9091:9090"
      - "8859:8849"
      - "9093:9092"
    depends_on:
      - zk
      - dremio_coordinator
  pg:
    image: postgres:16.0
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_PASSWORD=dalongdemo
  nessie:
    image: quay.io/projectnessie/nessie:0.100.2-java
    environment:
      - "JAVA_DEBUG=true"
      - "JAVA_DEBUG_PORT=*:5005"
      - nessie_version_store_type=JDBC
      - quarkus.datasource.username=postgres
      - quarkus.datasource.password=dalongdemo
      - quarkus_datasource_jdbc_url=jdbc:postgresql://pg:5432/postgres
      - nessie.catalog.default-warehouse=warehouse
      - nessie.catalog.warehouses.warehouse.location=s3://demobucket/
      - nessie.catalog.service.s3.default-options.region=us-east-1
      - nessie.catalog.service.s3.default-options.path-style-access=true
      - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
      - nessie.catalog.secrets.access-key.name=minio
      - nessie.catalog.secrets.access-key.secret=minio123
      - nessie.catalog.service.s3.default-options.endpoint=http://192.168.31.166:9000
      - nessie.catalog.service.s3.default-options.external-endpoint=http://192.168.31.166:9000
    ports:
      - "19120:19120"
      - "19121:19121"
      - "5005:5005"

Daft 集成

  • 参考代码
import daft

from daft.io import IOConfig, S3Config

from pyiceberg.catalog.rest import RestCatalog as ISqlCatalog

import pyarrow.parquet as pq

from daft.sql import SQLCatalog

# 此处不配置s3一些信息也是可以的,但是s3.region 建议配置,不然会有api 查询操作,比较慢
catalog = ISqlCatalog(
    "default",
    **{
        "uri": "http://192.168.31.166:19120/iceberg",
        "s3.endpoint": "http://localhost:9000",
        "s3.access-key-id": "minio",
        "s3.secret-access-key": "minio123",
        "s3.region": "us-east-1",
        "s3_url_style": "path"
    },
)

#  iceberg 数据写入,基于pyarrow 模式的,当然还有直接基于Daft的
# catalog.create_namespace("default")

df = pq.read_table("yellow_tripdata_2023-01.parquet")

table = catalog.create_table(
    "default.dalongdemov14",
    schema=df.schema,
    location="s3://demobucket/dalongdemov14"
)

table.append(df)
# Daft 读取iceberg 数据
mytable = catalog.load_table("default.dalongdemov14")
df2 = daft.read_iceberg(mytable)

print(df2.collect())

Dremio 集成

核心就是配置rest catalog 地址,同时进行一些s3相关的配置

  • 参考配置

此处的一些配置与nessie catalog 集成是类似的,

fs.s3a.endpoint                                xxxxxx
fs.s3a.path.style.access           true
fs.s3a.connection.ssl.enabled      false
fs.s3a.aws.credentials.provider       org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
fs.dremioS3.impl                             com.dremio.plugins.s3.store.S3FileSystem
dremio.s3.compat                           true
fs.s3a.access.key                            xxxx
fs.s3a.secret.key                              xxx

  • 查询效果

说明

以上是一个简单的集成玩法,基于Daft 以及pyiceberg 我们可以开发一个简单的etl (轻量的),同时集成dremio 进行方便的数据查询,当然Daft 也是直接可以进行数据查询操作的,但是目前能力相对弱一些

参考资料

https://py.iceberg.apache.org/configuration/#fileio

https://github.com/projectnessie/nessie/releases/tag/nessie-0.90.2

https://github.com/apache/iceberg-python/issues/540

https://www.getdaft.io/projects/docs/en/stable/user_guide/integrations/iceberg.html

posted on   荣锋亮  阅读(5)  评论(0编辑  收藏  举报

相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
历史上的今天:
2024-02-11 netflix conductor 停止维护
2024-02-11 extism 基于rust 开发的强大webassembly 框架
2023-02-11 flannel 低版本glog flag redefined error 问题解决
2023-02-11 flannel vni 一些说明
2023-02-11 go: cannot find main module, but found glide.lock 问题解决
2023-02-11 dremio 24 新特性简单说明
2023-02-11 dremio 24 发布了

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
点击右上角即可分享
微信分享提示