kedro 简单试用

主要是一个简单学习试用

环境准备

  • 安装kedro
python -m venv venv
source venv/bin/activate
pip install kedro
  • minio s3 存储

为了方便测试使用了s3 进行数据存储,注意需要同时安装

version: "3"
services:
  minio: 
    image: minio/minio
    ports:
       - "9000:9000"
       - "9001:9001"
    command: server /data --console-address ":9001"
    environment:
    - MINIO_ACCESS_KEY=minio
    - MINIO_SECRET_KEY=minio123

初始化项目

可以通过new 以及starter 模式

  • 快速模式
kedro new --name=spaceflights --tools=viz --example=y
  • 项目结构

关于项目结构以及代码的说明后续介绍

./spaceflights
├── README.md
├── conf
├── README.md
├── base
├── catalog.yml
├── parameters.yml
├── parameters_data_processing.yml
├── parameters_data_science.yml
└── parameters_reporting.yml
└── local
└── credentials.yml
├── data
├── 01_raw
├── companies.csv
├── reviews.csv
└── shuttles.xlsx
├── 02_intermediate
├── 03_primary
├── 04_feature
├── 05_model_input
├── 06_models
├── 07_model_output
└── 08_reporting
├── notebooks
├── pyproject.toml
├── requirements.txt
└── src
    └── spaceflights
        ├── __init__.py
        ├── __main__.py
        ├── pipeline_registry.py
        ├── pipelines
        └── settings.py

安装依赖

cd spaceflights
pip install -r requirements.txt

同时上传测试数据到s3 中如下(就是模版项目中的spaceflights/data/01_raw 里边的数据,注意还需要安装下s3fs pip install s3fs


修改data catalog 使用s3 格式 conf/base/catalog.yml

companies:
  type: pandas.CSVDataset
  filepath: s3://kedro/01_raw/companies.csv
  credentials: dev_s3
reviews:
  type: pandas.CSVDataset
  filepath: s3://kedro/01_raw/reviews.csv
  credentials: dev_s3
shuttles:
  type: pandas.ExcelDataset
  filepath: s3://kedro/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl
  credentials: dev_s3
preprocessed_companies:
  type: pandas.ParquetDataset
  filepath: s3://kedro/02_intermediate/preprocessed_companies.pq
  credentials: dev_s3
preprocessed_shuttles:
  type: pandas.ParquetDataset
  filepath: s3://kedro/02_intermediate/preprocessed_shuttles.pq
  credentials: dev_s3
model_input_table:
  type: pandas.ParquetDataset
  filepath: s3://kedro/03_primary/model_input_table.pq
  credentials: dev_s3
regressor:
  type: pickle.PickleDataset
  filepath: s3://kedro/06_models/regressor.pickle
  versioned: true
  credentials: dev_s3
metrics:
  type: tracking.MetricsDataset
  filepath: s3://kedro/09_tracking/metrics.json
  credentials: dev_s3
companies_columns:
  type: tracking.JSONDataset
  filepath: s3://kedro/09_tracking/companies_columns.json
  credentials: dev_s3
shuttle_passenger_capacity_plot_exp:
  type: plotly.PlotlyDataset
  filepath: s3://kedro/08_reporting/shuttle_passenger_capacity_plot_exp.json
  versioned: true
  credentials: dev_s3
  plotly_args:
    type: bar
    fig:
      x: shuttle_type
      y: passenger_capacity
      orientation: h
    layout:
      xaxis_title: Shuttles
      yaxis_title: Average passenger capacity
      title: Shuttle Passenger capacity
 
shuttle_passenger_capacity_plot_go:
  type: plotly.JSONDataset
  filepath: s3://kedro/08_reporting/shuttle_passenger_capacity_plot_go.json
  credentials: dev_s3
  versioned: true
 
dummy_confusion_matrix:
  type: matplotlib.MatplotlibWriter
  filepath: s3://kedro/08_reporting/dummy_confusion_matrix.png
  credentials: dev_s3
  versioned: true 

配置访问认证信息 conf/local/credentials.yml

dev_s3:
    key: minio
    secret: minio123
    client_kwargs:
      endpoint_url : http://localhost:9000
  • 运行
kedro run  -p data_processing  

效果


minio 数据

  • 打包

kedro 会将我们的项目打包为标准的python whl 格式的包,包含了代码(pipeline )以及配置部分,后续的使用只需要conf 以及data 目录,具体使用后续会介绍

kedro package

效果

说明

从实际体验kedro使用上还是比较方便的,同时包含了data process 以及 data_science 项目工程化也很不错,对于数据处理项目还是值得尝试下的

参考资料

https://docs.kedro.org/en/stable/get_started/install.html#installation-prerequisites
https://docs.kedro.org/en/stable/get_started/kedro_concepts.html
https://docs.kedro.org/en/stable/tutorial/package_a_project.html
https://github.com/kedro-org/kedro
https://github.com/kedro-org/kedro-plugins

posted on 2024-09-20 06:02  荣锋亮  阅读(17)  评论(0编辑  收藏  举报

导航