kedro 的PartitionedDataset简单说明
kedro 的PartitionedDataset是一个比较强大的数据集处理模块,支持数据的分片加载以及分片写入能力,以下简单说明下
数据分片读取能力
- 参考catalog 配置
companies:
type: partitions.PartitionedDataset
path: s3://kedro/01_raw/companies
credentials: dev_s3
dataset: pandas.CSVDataset
filename_suffix: '.csv'
- node使用
def preprocess_companies(companies: dict[str,pd.DataFrame],parameters: Dict) -> Tuple[pd.DataFrame, Dict]:
print(parameters)
print(companies)
combine_all = pd.DataFrame()
for key,partition_data_func in companies.items():
print(key)
partition_data = partition_data_func()
combine_all = pd.concat([combine_all, partition_data], ignore_index=True, sort=True)
combine_all["iata_approved"] = _is_true(combine_all["iata_approved"])
combine_all["company_rating"] = _parse_percentage(combine_all["company_rating"])
return combine_all, {"columns": combine_all.columns.tolist(), "data_type": "companies"}
- pipeline 定义
node(
func=preprocess_companies,
inputs=["companies","parameters"],
outputs=["preprocessed_companies", "companies_columns"],
name="preprocess_companies_node",
),
- 效果
数据分片写入能力
- catalog 定义
# 读取
companiesv3:
type: pandas.CSVDataset
filepath: s3:///kedro/01_raw/companies.csv
credentials: dev_s3
# 写入,基于iata_approved 分片键
companiesv4:
type: partitions.PartitionedDataset
path: s3://kedro/01_raw/companiesv2
credentials: dev_s3
dataset: pandas.CSVDataset
filename_suffix: '.csv'
- node 使用
def preprocess_companiesv2(companiesv3: pd.DataFrame) -> Dict[str, pd.DataFrame]:
print(companiesv3.head())
parts = {}
for item in companiesv3["iata_approved"].unique():
parts[f"item-{item}"] = companiesv3[companiesv3["iata_approved"] == item]
return parts
- pipeline 定义
node(
func=preprocess_companiesv2,
inputs=["companiesv3"],
outputs="companiesv4",
name="preprocess_companies_nodev2",
),
- 效果
说明
kedro 的PartitionedDataset对于数据的批量加载以及数据分片写入都是比较有用的,很值得尝试下