kedro 的PartitionedDataset简单说明

kedro 的PartitionedDataset是一个比较强大的数据集处理模块,支持数据的分片加载以及分片写入能力,以下简单说明下

数据分片读取能力

  • 参考catalog 配置
companies:
  type: partitions.PartitionedDataset
  path: s3://kedro/01_raw/companies
  credentials: dev_s3
  dataset: pandas.CSVDataset
  filename_suffix: '.csv'
  • node使用
def preprocess_companies(companies: dict[str,pd.DataFrame],parameters: Dict) -> Tuple[pd.DataFrame, Dict]:
    print(parameters)
    print(companies)
    combine_all = pd.DataFrame()
 
    for key,partition_data_func in companies.items():
        print(key)
        partition_data = partition_data_func()
        combine_all = pd.concat([combine_all, partition_data], ignore_index=True, sort=True)
    combine_all["iata_approved"] = _is_true(combine_all["iata_approved"])
    combine_all["company_rating"] = _parse_percentage(combine_all["company_rating"])
    return combine_all, {"columns": combine_all.columns.tolist(), "data_type": "companies"}
  • pipeline 定义
node(
    func=preprocess_companies,
    inputs=["companies","parameters"],
    outputs=["preprocessed_companies", "companies_columns"],
    name="preprocess_companies_node",
),
  • 效果

数据分片写入能力

  • catalog 定义
# 读取
companiesv3:
  type: pandas.CSVDataset
  filepath: s3:///kedro/01_raw/companies.csv
  credentials: dev_s3
# 写入,基于iata_approved 分片键
companiesv4:
  type: partitions.PartitionedDataset
  path: s3://kedro/01_raw/companiesv2
  credentials: dev_s3
  dataset: pandas.CSVDataset
  filename_suffix: '.csv'
  • node 使用
def preprocess_companiesv2(companiesv3: pd.DataFrame) -> Dict[str, pd.DataFrame]:
    print(companiesv3.head())
    parts = {}
    for item in  companiesv3["iata_approved"].unique():
        parts[f"item-{item}"] = companiesv3[companiesv3["iata_approved"] == item]
    return parts
  • pipeline 定义
node(
    func=preprocess_companiesv2,
    inputs=["companiesv3"],
    outputs="companiesv4",
    name="preprocess_companies_nodev2",
),
  • 效果

说明

kedro 的PartitionedDataset对于数据的批量加载以及数据分片写入都是比较有用的,很值得尝试下

参考资料

https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.0.0.post1/api/kedro_datasets.partitions.PartitionedDataset.html#kedro_datasets.partitions.PartitionedDataset

posted on 2024-09-30 08:00  荣锋亮  阅读(4)  评论(0编辑  收藏  举报

导航