kedro 简单试用
主要是一个简单学习试用
环境准备
- 安装kedro
python -m venv venv
source venv/bin/activate
pip install kedro
- minio s3 存储
为了方便测试使用了s3 进行数据存储,注意需要同时安装
version: "3"
services:
minio:
image: minio/minio
ports:
- "9000:9000"
- "9001:9001"
command: server /data --console-address ":9001"
environment:
- MINIO_ACCESS_KEY=minio
- MINIO_SECRET_KEY=minio123
初始化项目
可以通过new 以及starter 模式
- 快速模式
kedro new --name=spaceflights --tools=viz --example=y
- 项目结构
关于项目结构以及代码的说明后续介绍
./spaceflights
├── README.md
├── conf
│ ├── README.md
│ ├── base
│ │ ├── catalog.yml
│ │ ├── parameters.yml
│ │ ├── parameters_data_processing.yml
│ │ ├── parameters_data_science.yml
│ │ └── parameters_reporting.yml
│ └── local
│ └── credentials.yml
├── data
│ ├── 01_raw
│ │ ├── companies.csv
│ │ ├── reviews.csv
│ │ └── shuttles.xlsx
│ ├── 02_intermediate
│ ├── 03_primary
│ ├── 04_feature
│ ├── 05_model_input
│ ├── 06_models
│ ├── 07_model_output
│ └── 08_reporting
├── notebooks
├── pyproject.toml
├── requirements.txt
└── src
└── spaceflights
├── __init__.py
├── __main__.py
├── pipeline_registry.py
├── pipelines
└── settings.py
安装依赖
cd spaceflights
pip install -r requirements.txt
同时上传测试数据到s3 中如下(就是模版项目中的spaceflights/data/01_raw 里边的数据,注意还需要安装下s3fs pip install s3fs
修改data catalog 使用s3 格式 conf/base/catalog.yml
companies:
type: pandas.CSVDataset
filepath: s3://kedro/01_raw/companies.csv
credentials: dev_s3
reviews:
type: pandas.CSVDataset
filepath: s3://kedro/01_raw/reviews.csv
credentials: dev_s3
shuttles:
type: pandas.ExcelDataset
filepath: s3://kedro/01_raw/shuttles.xlsx
load_args:
engine: openpyxl
credentials: dev_s3
preprocessed_companies:
type: pandas.ParquetDataset
filepath: s3://kedro/02_intermediate/preprocessed_companies.pq
credentials: dev_s3
preprocessed_shuttles:
type: pandas.ParquetDataset
filepath: s3://kedro/02_intermediate/preprocessed_shuttles.pq
credentials: dev_s3
model_input_table:
type: pandas.ParquetDataset
filepath: s3://kedro/03_primary/model_input_table.pq
credentials: dev_s3
regressor:
type: pickle.PickleDataset
filepath: s3://kedro/06_models/regressor.pickle
versioned: true
credentials: dev_s3
metrics:
type: tracking.MetricsDataset
filepath: s3://kedro/09_tracking/metrics.json
credentials: dev_s3
companies_columns:
type: tracking.JSONDataset
filepath: s3://kedro/09_tracking/companies_columns.json
credentials: dev_s3
shuttle_passenger_capacity_plot_exp:
type: plotly.PlotlyDataset
filepath: s3://kedro/08_reporting/shuttle_passenger_capacity_plot_exp.json
versioned: true
credentials: dev_s3
plotly_args:
type: bar
fig:
x: shuttle_type
y: passenger_capacity
orientation: h
layout:
xaxis_title: Shuttles
yaxis_title: Average passenger capacity
title: Shuttle Passenger capacity
shuttle_passenger_capacity_plot_go:
type: plotly.JSONDataset
filepath: s3://kedro/08_reporting/shuttle_passenger_capacity_plot_go.json
credentials: dev_s3
versioned: true
dummy_confusion_matrix:
type: matplotlib.MatplotlibWriter
filepath: s3://kedro/08_reporting/dummy_confusion_matrix.png
credentials: dev_s3
versioned: true
配置访问认证信息 conf/local/credentials.yml
dev_s3:
key: minio
secret: minio123
client_kwargs:
endpoint_url : http://localhost:9000
- 运行
kedro run -p data_processing
效果
minio 数据
- 打包
kedro 会将我们的项目打包为标准的python whl 格式的包,包含了代码(pipeline )以及配置部分,后续的使用只需要conf 以及data 目录,具体使用后续会介绍
kedro package
效果
说明
从实际体验kedro使用上还是比较方便的,同时包含了data process 以及 data_science 项目工程化也很不错,对于数据处理项目还是值得尝试下的
参考资料
https://docs.kedro.org/en/stable/get_started/install.html#installation-prerequisites
https://docs.kedro.org/en/stable/get_started/kedro_concepts.html
https://docs.kedro.org/en/stable/tutorial/package_a_project.html
https://github.com/kedro-org/kedro
https://github.com/kedro-org/kedro-plugins