kedro data catalog version 处理简单说明

kedro data catalog 支持version 化,这样可以方便的使用特定时间的多数据,以下是对于version 的简单说明

参考使用

pikachu:
  type: kedro_pokemon.datasets.image_dataset.ImageDataset
  filepath: data/01_raw/pokemon-images-and-types/images/images/pikachu.png
  versioned: true

参考处理

核心是获取文件路径,因为版本化之后,会有一个基于时间格式创建的一个子目录,对于多版本的,会包含多个时间文件夹

  • 读取获取版本路径
def _get_load_path(self) -> PurePosixPath:
    if not self._version:
        # When versioning is disabled, load from original filepath
        return self._filepath
 
    load_version = self.resolve_load_version()
    return self._get_versioned_path(load_version)  # type: ignore[arg-type]

resolve_load_version处理

def resolve_load_version(self) -> str | None:
    """Compute the version the dataset should be loaded with."""
    if not self._version:
        return None
    if self._version.load:
        return self._version.load  # type: ignore[no-any-return]
    return self._fetch_latest_load_version()

_fetch_latest_load_version 处理,核心是通过_glob_function 函数遍历,之后是反向排序,实际上就是获取最新的

def _fetch_latest_load_version(self) -> str:
    # When load version is unpinned, fetch the most recent existing
    # version from the given path.
    pattern = str(self._get_versioned_path("*"))
    try:
        version_paths = sorted(self._glob_function(pattern), reverse=True)
    except Exception as exc:
        message = (
            f"Did not find any versions for {self}. This could be "
            f"due to insufficient permission. Exception: {exc}"
        )
        raise VersionNotFoundError(message) from exc
    most_recent = next(
        (path for path in version_paths if self._exists_function(path)), None
    )
    if not most_recent:
        message = f"Did not find any versions for {self}"
        raise VersionNotFoundError(message)
    return PurePath(most_recent).parent.name
  • 写入获取路径
    处理类似,只是一个生成时间格式路径的处理
def resolve_save_version(self) -> str | None:
    """Compute the version the dataset should be saved with."""
    if not self._version:
        return None
    if self._version.save:
        return self._version.save  # type: ignore[no-any-return]
    return self._fetch_latest_save_version()

_fetch_latest_save_version

@cachedmethod(cache=attrgetter("_version_cache"), key=partial(hashkey, "save"))
def _fetch_latest_save_version(self) -> str:
    """Generate and cache the current save version"""
    return generate_timestamp()

generate_timestamp 格式

VERSION_FORMAT = "%Y-%m-%dT%H.%M.%S.%fZ"

说明

kedro data catalog version 的版本化实现内部并不难,但是功能上还是比较有用的,对于数据需要进行历史存储的比较有用

参考资料

https://docs.kedro.org/en/stable/data/advanced_data_catalog_usage.html#how-to-version-a-dataset-using-the-code-api
https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html

posted on 2024-09-23 00:02  荣锋亮  阅读(5)  评论(0编辑  收藏  举报

导航