Daft 集成iceberg 测试

Daft 就用pyiceberg 提供了对于iceberg 的集成,以下是一个简单测试

准备iceberg 表

直接基于了本地数据存储

  • 参考代码
import daft

from pyiceberg.catalog.sql import SqlCatalog as ISqlCatalog

import pyarrow.parquet as pq

from daft.sql import SQLCatalog


warehouse_path = "warehouse"
catalog = ISqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

# write table
catalog.create_namespace("default")

table = catalog.create_table(
     "default.taxi_dataset",
     schema=df.schema,
 )
df = pq.read_table("yellow_tripdata_2023-01.parquet")
table.append(df)

读取以及写入操作

  • 读取

集成了sql 操作,注意对于表的读取需要通过catalog 信息提供

import daft

from pyiceberg.catalog.sql import SqlCatalog as ISqlCatalog

import pyarrow.parquet as pq

from daft.sql import SQLCatalog


warehouse_path = "warehouse"
catalog = ISqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

# read table
mytable = catalog.load_table("default.taxi_dataset")

df = daft.read_iceberg(mytable)

sql_catalog = SQLCatalog({"mydemo":df})

result_df = daft.sql("SELECT * FROM mydemo",catalog=sql_catalog)

print(result_df.collect())
  • 写入

当前版本的Daft 对于新版本的pyiceberg 支持有问题(0.8.0)需要自己fix 下才能写入

dataframe/dataframe.py 修改如下

if parse(pyiceberg.__version__) >= parse("0.7.0"):
    from pyiceberg.table import ALWAYS_TRUE, TableProperties
    if parse(pyiceberg.__version__) >= parse("0.8.0"):
        from pyiceberg.utils.properties import property_as_bool
        property_as_bool = property_as_bool
    if parse(pyiceberg.__version__) < parse("0.8.0"):
        from pyiceberg.table import PropertyUtil
        property_as_bool = PropertyUtil.property_as_bool
        
    tx = table.transaction()

    if mode == "overwrite":
        tx.delete(delete_filter=ALWAYS_TRUE)

    update_snapshot = tx.update_snapshot()

    manifest_merge_enabled = mode == "append" and property_as_bool(
        tx.table_metadata.properties,
        TableProperties.MANIFEST_MERGE_ENABLED,
        TableProperties.MANIFEST_MERGE_ENABLED_DEFAULT,
    )

写入


import daft

from pyiceberg.catalog.sql import SqlCatalog as ISqlCatalog

import pyarrow.parquet as pq

from daft.sql import SQLCatalog


warehouse_path = "warehouse"
catalog = ISqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)


# read table
mytable = catalog.load_table("default.taxi_dataset")

df = daft.read_iceberg(mytable)

sql_catalog = SQLCatalog({"mydemo":df})

result_df = daft.sql("SELECT * FROM mydemo",catalog=sql_catalog)

print(result_df.collect())

result_df.write_iceberg(mytable)

print(result_df.collect())

说明

以上是一个简单测试,整体上还是比较方便的,就是对于新版本支持有一些问题,可以自己fix

参考资料

https://github.com/apache/iceberg-python/blob/pyiceberg-0.7.1/pyiceberg/table/__init__.py

posted on   荣锋亮  阅读(12)  评论(0编辑  收藏  举报

相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
历史上的今天:
2024-02-07 dremio FileSystem 简单说明
2023-02-07 支持配置版本的flannel 服务
2023-02-07 编译flannel 支持VNI 扩展
2022-02-07 使用parcel api 进行npm 项目
2022-02-07 npm 8 的overrides 进行npm 依赖包的重写
2022-02-07 maven 同一个模块加载多个版本的方法
2022-02-07 webjars 的优缺点

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
点击右上角即可分享
微信分享提示