kedro package 项目运行内部处理

kedro package 会将开发的data pipeline 项目构建为一个标准的python whl 格式包（build 模块），之后我们就可以直接基于项目模块运行开发的pipeline 了，以下简单说明下内部处理

项目结构

为了将kedro pipeline 项目提供为一个可以通过模块直接运行的，kedro starter 包含了一个__main__.py 的文件，这样就可以直接运行了，当然为了方便
通过cli 模式kedro 也提供了scripts 方便cli 使用

参考代码结构

./spaceflights

├── README.md

├── conf

│   ├── README.md

│   ├── base

│   │   ├── catalog.yml

│   │   ├── parameters.yml

│   │   ├── parameters_data_processing.yml

│   │   ├── parameters_data_science.yml

│   │   └── parameters_reporting.yml

│   └── local

│       └── credentials.yml

├── data

│   ├── 01_raw

│   │   ├── companies.csv

│   │   ├── reviews.csv

│   │   └── shuttles.xlsx

│   ├── 02_intermediate

│   ├── 03_primary

│   ├── 04_feature

│   ├── 05_model_input

│   ├── 06_models

│   ├── 07_model_output

│   └── 08_reporting

├── notebooks

├── pyproject.toml

├── requirements.txt

└── src

    └── spaceflights

        ├── __init__.py

        ├── __main__.py  # 通过python -m <projectname> 模式可以运行必备的

        ├── pipeline_registry.py

        ├── pipelines

        └── settings.py

cli 模式
pyproject.toml 配置了定义

[project.scripts]

spaceflights = "spaceflights.__main__:main"

内部处理

核心是__main__.py ，调用了内部的配置，以及pipeline 发现以及基于kedro 框架的cli 运行（具体会由不同的runner 执行）

main.py

"""spaceflights file for ensuring the package is executable

as `spaceflights` and `python -m spaceflights`

"""

from pathlib import Path
 
from kedro.framework.cli.utils import find_run_command

from kedro.framework.project import configure_project
 
def main(*args, **kwargs):

    package_name = Path(__file__).parent.name

   # 项目配置

    configure_project(package_name)

    # 运行项目

    run = find_run_command(package_name)

    run(*args, **kwargs)
 
if __name__ == "__main__":

    main()

configure_project 配置处理
此出的处理上并不是kedro 使用的omegaconf，而且基于dynaconf 的配置管理，基于omegaconf 是在具体pipeline 运行中使用到
主要是对于实际运行中conf 以及data 的处理，会结合项目中的settings.py 的一些配置处理

def configure_project(package_name: str) -> None:

    """Configure a Kedro project by populating its settings with values

    defined in user's settings.py and pipeline_registry.py.

    """

    settings_module = f"{package_name}.settings"

    settings.configure(settings_module)
 
    pipelines_module = f"{package_name}.pipeline_registry"

    pipelines.configure(pipelines_module)
 
    # Once the project is successfully configured once, store PACKAGE_NAME as a

    # global variable to make it easily accessible. This is used by validate_settings()

    # below, and also by ParallelRunner on Windows, as package_name is required every

    # time a new subprocess is spawned.

    global PACKAGE_NAME  # noqa: PLW0603

    PACKAGE_NAME = package_name
 
    if PACKAGE_NAME:

        LOGGING.set_project_logging(PACKAGE_NAME)

find_run_command 处理

def find_run_command(package_name: str) -> Callable:

    """Find the run command to be executed.

       This is either the default run command defined in the Kedro framework or a run command defined by

       an installed plugin.
 
    Args:

        package_name: The name of the package being run.
 
    Raises:

        KedroCliError: If the run command is not found.
 
    Returns:

        Run command to be executed.

    """

    try:

      # 首先看项目级别的cli 

        project_cli = importlib.import_module(f"{package_name}.cli")

        # fail gracefully if cli.py does not exist

    except ModuleNotFoundError as exc:

        if f"{package_name}.cli" not in str(exc):

            raise

        # 加载entry_points 的几个定义

        plugins = load_entry_points("project")

        run = _find_run_command_in_plugins(plugins) if plugins else None

        if run:

            # use run command from installed plugin if it exists

            return run  # type: ignore[no-any-return]

        # use run command from `kedro.framework.cli.project`

       #  没有找到使用默认project 包中的run 模块

        from kedro.framework.cli.project import run
 
        return run  # type: ignore[no-any-return]

    # fail badly if cli.py exists, but has no `cli` in it

    if not hasattr(project_cli, "cli"):

        raise KedroCliError(f"Cannot load commands from {package_name}.cli")

    return project_cli.run  # type: ignore[no-any-return]

默认entry_point 查找的类型

ENTRY_POINT_GROUPS = {

    "global": "kedro.global_commands",

    "project": "kedro.project_commands",

    "init": "kedro.init",

    "line_magic": "kedro.line_magic",

    "hooks": "kedro.hooks",

    "cli_hooks": "kedro.cli_hooks",

    "starters": "kedro.starters",

}

project run 中的处理

def run(  # noqa: PLR0913

    tags: str,

    env: str,

    runner: str,

    is_async: bool,

    node_names: str,

    to_nodes: str,

    from_nodes: str,

    from_inputs: str,

    to_outputs: str,

    load_versions: dict[str, str] | None,

    pipeline: str,

    config: str,

    conf_source: str,

    params: dict[str, Any],

    namespace: str,

) -> None:

    """Run the pipeline."""
 
    runner_obj = load_obj(runner or "SequentialRunner", "kedro.runner")

    tuple_tags = tuple(tags)

    tuple_node_names = tuple(node_names)

   # 创建session，具体有session 中的run 运行

    with KedroSession.create(

        env=env, conf_source=conf_source, extra_params=params

    ) as session:

        session.run(

            tags=tuple_tags,

            runner=runner_obj(is_async=is_async),

            node_names=tuple_node_names,

            from_nodes=from_nodes,

            to_nodes=to_nodes,

            from_inputs=from_inputs,

            to_outputs=to_outputs,

            load_versions=load_versions,

            pipeline_name=pipeline,

            namespace=namespace,

        )

说明

官方有一个参考架构的图，结合此图以及项目结构代码看运行机制就更加清晰了

参考资料

https://docs.kedro.org/en/0.19.5/tutorial/package_a_project.html
https://docs.kedro.org/en/stable/kedro_project_setup/settings.html
https://docs.kedro.org/en/stable/extend_kedro/architecture_overview.html
https://python-poetry.org/docs/pyproject/#scripts

posted on 2024-09-26 08:00 荣锋亮阅读(14) 评论(0) 编辑收藏举报

刷新页面返回顶部

rongfengliang-荣锋亮

kedro package 项目运行内部处理

项目结构

内部处理

说明

参考资料

导航

公告