Daft 集成iceberg 测试
Daft 就用pyiceberg 提供了对于iceberg 的集成,以下是一个简单测试
准备iceberg 表
直接基于了本地数据存储
- 参考代码
import daft
from pyiceberg.catalog.sql import SqlCatalog as ISqlCatalog
import pyarrow.parquet as pq
from daft.sql import SQLCatalog
warehouse_path = "warehouse"
catalog = ISqlCatalog(
"default",
**{
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
},
)
# write table
catalog.create_namespace("default")
table = catalog.create_table(
"default.taxi_dataset",
schema=df.schema,
)
df = pq.read_table("yellow_tripdata_2023-01.parquet")
table.append(df)
读取以及写入操作
- 读取
集成了sql 操作,注意对于表的读取需要通过catalog 信息提供
import daft
from pyiceberg.catalog.sql import SqlCatalog as ISqlCatalog
import pyarrow.parquet as pq
from daft.sql import SQLCatalog
warehouse_path = "warehouse"
catalog = ISqlCatalog(
"default",
**{
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
},
)
# read table
mytable = catalog.load_table("default.taxi_dataset")
df = daft.read_iceberg(mytable)
sql_catalog = SQLCatalog({"mydemo":df})
result_df = daft.sql("SELECT * FROM mydemo",catalog=sql_catalog)
print(result_df.collect())
- 写入
当前版本的Daft 对于新版本的pyiceberg 支持有问题(0.8.0)需要自己fix 下才能写入
dataframe/dataframe.py 修改如下
if parse(pyiceberg.__version__) >= parse("0.7.0"):
from pyiceberg.table import ALWAYS_TRUE, TableProperties
if parse(pyiceberg.__version__) >= parse("0.8.0"):
from pyiceberg.utils.properties import property_as_bool
property_as_bool = property_as_bool
if parse(pyiceberg.__version__) < parse("0.8.0"):
from pyiceberg.table import PropertyUtil
property_as_bool = PropertyUtil.property_as_bool
tx = table.transaction()
if mode == "overwrite":
tx.delete(delete_filter=ALWAYS_TRUE)
update_snapshot = tx.update_snapshot()
manifest_merge_enabled = mode == "append" and property_as_bool(
tx.table_metadata.properties,
TableProperties.MANIFEST_MERGE_ENABLED,
TableProperties.MANIFEST_MERGE_ENABLED_DEFAULT,
)
写入
import daft
from pyiceberg.catalog.sql import SqlCatalog as ISqlCatalog
import pyarrow.parquet as pq
from daft.sql import SQLCatalog
warehouse_path = "warehouse"
catalog = ISqlCatalog(
"default",
**{
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
},
)
# read table
mytable = catalog.load_table("default.taxi_dataset")
df = daft.read_iceberg(mytable)
sql_catalog = SQLCatalog({"mydemo":df})
result_df = daft.sql("SELECT * FROM mydemo",catalog=sql_catalog)
print(result_df.collect())
result_df.write_iceberg(mytable)
print(result_df.collect())
说明
以上是一个简单测试,整体上还是比较方便的,就是对于新版本支持有一些问题,可以自己fix
参考资料
https://github.com/apache/iceberg-python/blob/pyiceberg-0.7.1/pyiceberg/table/__init__.py
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
2024-02-07 dremio FileSystem 简单说明
2023-02-07 支持配置版本的flannel 服务
2023-02-07 编译flannel 支持VNI 扩展
2022-02-07 使用parcel api 进行npm 项目
2022-02-07 npm 8 的overrides 进行npm 依赖包的重写
2022-02-07 maven 同一个模块加载多个版本的方法
2022-02-07 webjars 的优缺点