kedro package 项目运行内部处理
kedro package 会将开发的data pipeline 项目构建为一个标准的python whl 格式包(build 模块),之后我们就可以直接基于项目模块运行开发的pipeline 了,以下简单说明下内部处理
项目结构
为了将kedro pipeline 项目提供为一个可以通过模块直接运行的,kedro starter 包含了一个__main__.py 的文件,这样就可以直接运行了,当然为了方便
通过cli 模式kedro 也提供了scripts 方便cli 使用
- 参考代码结构
./spaceflights
├── README.md
├── conf
│ ├── README.md
│ ├── base
│ │ ├── catalog.yml
│ │ ├── parameters.yml
│ │ ├── parameters_data_processing.yml
│ │ ├── parameters_data_science.yml
│ │ └── parameters_reporting.yml
│ └── local
│ └── credentials.yml
├── data
│ ├── 01_raw
│ │ ├── companies.csv
│ │ ├── reviews.csv
│ │ └── shuttles.xlsx
│ ├── 02_intermediate
│ ├── 03_primary
│ ├── 04_feature
│ ├── 05_model_input
│ ├── 06_models
│ ├── 07_model_output
│ └── 08_reporting
├── notebooks
├── pyproject.toml
├── requirements.txt
└── src
└── spaceflights
├── __init__.py
├── __main__.py # 通过python -m <projectname> 模式可以运行必备的
├── pipeline_registry.py
├── pipelines
└── settings.py
- cli 模式
pyproject.toml 配置了定义
[project.scripts]
spaceflights = "spaceflights.__main__:main"
内部处理
核心是__main__.py ,调用了内部的配置,以及pipeline 发现以及基于kedro 框架的cli 运行(具体会由不同的runner 执行)
- main.py
"""spaceflights file for ensuring the package is executable
as `spaceflights` and `python -m spaceflights`
"""
from pathlib import Path
from kedro.framework.cli.utils import find_run_command
from kedro.framework.project import configure_project
def main(*args, **kwargs):
package_name = Path(__file__).parent.name
# 项目配置
configure_project(package_name)
# 运行项目
run = find_run_command(package_name)
run(*args, **kwargs)
if __name__ == "__main__":
main()
- configure_project 配置处理
此出的处理上并不是kedro 使用的omegaconf,而且基于dynaconf 的配置管理,基于omegaconf 是在具体pipeline 运行中使用到
主要是对于实际运行中conf 以及data 的处理,会结合项目中的settings.py 的一些配置处理
def configure_project(package_name: str) -> None:
"""Configure a Kedro project by populating its settings with values
defined in user's settings.py and pipeline_registry.py.
"""
settings_module = f"{package_name}.settings"
settings.configure(settings_module)
pipelines_module = f"{package_name}.pipeline_registry"
pipelines.configure(pipelines_module)
# Once the project is successfully configured once, store PACKAGE_NAME as a
# global variable to make it easily accessible. This is used by validate_settings()
# below, and also by ParallelRunner on Windows, as package_name is required every
# time a new subprocess is spawned.
global PACKAGE_NAME # noqa: PLW0603
PACKAGE_NAME = package_name
if PACKAGE_NAME:
LOGGING.set_project_logging(PACKAGE_NAME)
- find_run_command 处理
def find_run_command(package_name: str) -> Callable:
"""Find the run command to be executed.
This is either the default run command defined in the Kedro framework or a run command defined by
an installed plugin.
Args:
package_name: The name of the package being run.
Raises:
KedroCliError: If the run command is not found.
Returns:
Run command to be executed.
"""
try:
# 首先看项目级别的cli
project_cli = importlib.import_module(f"{package_name}.cli")
# fail gracefully if cli.py does not exist
except ModuleNotFoundError as exc:
if f"{package_name}.cli" not in str(exc):
raise
# 加载entry_points 的几个定义
plugins = load_entry_points("project")
run = _find_run_command_in_plugins(plugins) if plugins else None
if run:
# use run command from installed plugin if it exists
return run # type: ignore[no-any-return]
# use run command from `kedro.framework.cli.project`
# 没有找到使用默认project 包中的run 模块
from kedro.framework.cli.project import run
return run # type: ignore[no-any-return]
# fail badly if cli.py exists, but has no `cli` in it
if not hasattr(project_cli, "cli"):
raise KedroCliError(f"Cannot load commands from {package_name}.cli")
return project_cli.run # type: ignore[no-any-return]
默认entry_point 查找的类型
ENTRY_POINT_GROUPS = {
"global": "kedro.global_commands",
"project": "kedro.project_commands",
"init": "kedro.init",
"line_magic": "kedro.line_magic",
"hooks": "kedro.hooks",
"cli_hooks": "kedro.cli_hooks",
"starters": "kedro.starters",
}
project run 中的处理
def run( # noqa: PLR0913
tags: str,
env: str,
runner: str,
is_async: bool,
node_names: str,
to_nodes: str,
from_nodes: str,
from_inputs: str,
to_outputs: str,
load_versions: dict[str, str] | None,
pipeline: str,
config: str,
conf_source: str,
params: dict[str, Any],
namespace: str,
) -> None:
"""Run the pipeline."""
runner_obj = load_obj(runner or "SequentialRunner", "kedro.runner")
tuple_tags = tuple(tags)
tuple_node_names = tuple(node_names)
# 创建session,具体有session 中的run 运行
with KedroSession.create(
env=env, conf_source=conf_source, extra_params=params
) as session:
session.run(
tags=tuple_tags,
runner=runner_obj(is_async=is_async),
node_names=tuple_node_names,
from_nodes=from_nodes,
to_nodes=to_nodes,
from_inputs=from_inputs,
to_outputs=to_outputs,
load_versions=load_versions,
pipeline_name=pipeline,
namespace=namespace,
)
说明
官方有一个参考架构的图,结合此图以及项目结构代码看运行机制就更加清晰了
参考资料
https://docs.kedro.org/en/0.19.5/tutorial/package_a_project.html
https://docs.kedro.org/en/stable/kedro_project_setup/settings.html
https://docs.kedro.org/en/stable/extend_kedro/architecture_overview.html
https://python-poetry.org/docs/pyproject/#scripts