kedro package 项目运行内部处理

kedro package 会将开发的data pipeline 项目构建为一个标准的python whl 格式包(build 模块),之后我们就可以直接基于项目模块运行开发的pipeline 了,以下简单说明下内部处理

项目结构

为了将kedro pipeline 项目提供为一个可以通过模块直接运行的,kedro starter 包含了一个__main__.py 的文件,这样就可以直接运行了,当然为了方便
通过cli 模式kedro 也提供了scripts 方便cli 使用

  • 参考代码结构
./spaceflights
├── README.md
├── conf
├── README.md
├── base
├── catalog.yml
├── parameters.yml
├── parameters_data_processing.yml
├── parameters_data_science.yml
└── parameters_reporting.yml
└── local
└── credentials.yml
├── data
├── 01_raw
├── companies.csv
├── reviews.csv
└── shuttles.xlsx
├── 02_intermediate
├── 03_primary
├── 04_feature
├── 05_model_input
├── 06_models
├── 07_model_output
└── 08_reporting
├── notebooks
├── pyproject.toml
├── requirements.txt
└── src
    └── spaceflights
        ├── __init__.py
        ├── __main__.py  # 通过python -m <projectname> 模式可以运行必备的
        ├── pipeline_registry.py
        ├── pipelines
        └── settings.py
  • cli 模式
    pyproject.toml 配置了定义
[project.scripts]
spaceflights = "spaceflights.__main__:main"

内部处理

核心是__main__.py ,调用了内部的配置,以及pipeline 发现以及基于kedro 框架的cli 运行(具体会由不同的runner 执行)

  • main.py
"""spaceflights file for ensuring the package is executable
as `spaceflights` and `python -m spaceflights`
"""
from pathlib import Path
 
from kedro.framework.cli.utils import find_run_command
from kedro.framework.project import configure_project
 
 
def main(*args, **kwargs):
    package_name = Path(__file__).parent.name
   # 项目配置
    configure_project(package_name)
    # 运行项目
    run = find_run_command(package_name)
    run(*args, **kwargs)
 
if __name__ == "__main__":
    main()
  • configure_project 配置处理
    此出的处理上并不是kedro 使用的omegaconf,而且基于dynaconf 的配置管理,基于omegaconf 是在具体pipeline 运行中使用到
    主要是对于实际运行中conf 以及data 的处理,会结合项目中的settings.py 的一些配置处理
def configure_project(package_name: str) -> None:
    """Configure a Kedro project by populating its settings with values
    defined in user's settings.py and pipeline_registry.py.
    """
    settings_module = f"{package_name}.settings"
    settings.configure(settings_module)
 
    pipelines_module = f"{package_name}.pipeline_registry"
    pipelines.configure(pipelines_module)
 
    # Once the project is successfully configured once, store PACKAGE_NAME as a
    # global variable to make it easily accessible. This is used by validate_settings()
    # below, and also by ParallelRunner on Windows, as package_name is required every
    # time a new subprocess is spawned.
    global PACKAGE_NAME  # noqa: PLW0603
    PACKAGE_NAME = package_name
 
    if PACKAGE_NAME:
        LOGGING.set_project_logging(PACKAGE_NAME)
  • find_run_command 处理
def find_run_command(package_name: str) -> Callable:
    """Find the run command to be executed.
       This is either the default run command defined in the Kedro framework or a run command defined by
       an installed plugin.
 
    Args:
        package_name: The name of the package being run.
 
    Raises:
        KedroCliError: If the run command is not found.
 
    Returns:
        Run command to be executed.
    """
    try:
      # 首先看项目级别的cli 
        project_cli = importlib.import_module(f"{package_name}.cli")
        # fail gracefully if cli.py does not exist
    except ModuleNotFoundError as exc:
        if f"{package_name}.cli" not in str(exc):
            raise
        # 加载entry_points 的几个定义
        plugins = load_entry_points("project")
        run = _find_run_command_in_plugins(plugins) if plugins else None
        if run:
            # use run command from installed plugin if it exists
            return run  # type: ignore[no-any-return]
        # use run command from `kedro.framework.cli.project`
       #  没有找到使用默认project 包中的run 模块
        from kedro.framework.cli.project import run
 
        return run  # type: ignore[no-any-return]
    # fail badly if cli.py exists, but has no `cli` in it
    if not hasattr(project_cli, "cli"):
        raise KedroCliError(f"Cannot load commands from {package_name}.cli")
    return project_cli.run  # type: ignore[no-any-return]

默认entry_point 查找的类型

ENTRY_POINT_GROUPS = {
    "global": "kedro.global_commands",
    "project": "kedro.project_commands",
    "init": "kedro.init",
    "line_magic": "kedro.line_magic",
    "hooks": "kedro.hooks",
    "cli_hooks": "kedro.cli_hooks",
    "starters": "kedro.starters",
}

project run 中的处理

def run(  # noqa: PLR0913
    tags: str,
    env: str,
    runner: str,
    is_async: bool,
    node_names: str,
    to_nodes: str,
    from_nodes: str,
    from_inputs: str,
    to_outputs: str,
    load_versions: dict[str, str] | None,
    pipeline: str,
    config: str,
    conf_source: str,
    params: dict[str, Any],
    namespace: str,
) -> None:
    """Run the pipeline."""
 
    runner_obj = load_obj(runner or "SequentialRunner", "kedro.runner")
    tuple_tags = tuple(tags)
    tuple_node_names = tuple(node_names)
   # 创建session,具体有session 中的run 运行
    with KedroSession.create(
        env=env, conf_source=conf_source, extra_params=params
    ) as session:
        session.run(
            tags=tuple_tags,
            runner=runner_obj(is_async=is_async),
            node_names=tuple_node_names,
            from_nodes=from_nodes,
            to_nodes=to_nodes,
            from_inputs=from_inputs,
            to_outputs=to_outputs,
            load_versions=load_versions,
            pipeline_name=pipeline,
            namespace=namespace,
        )

说明

官方有一个参考架构的图,结合此图以及项目结构代码看运行机制就更加清晰了

参考资料

https://docs.kedro.org/en/0.19.5/tutorial/package_a_project.html
https://docs.kedro.org/en/stable/kedro_project_setup/settings.html
https://docs.kedro.org/en/stable/extend_kedro/architecture_overview.html
https://python-poetry.org/docs/pyproject/#scripts

posted on 2024-09-26 08:00  荣锋亮  阅读(14)  评论(0编辑  收藏  举报

导航