kedro data catalog version 处理简单说明
kedro data catalog 支持version 化,这样可以方便的使用特定时间的多数据,以下是对于version 的简单说明
参考使用
pikachu:
type: kedro_pokemon.datasets.image_dataset.ImageDataset
filepath: data/01_raw/pokemon-images-and-types/images/images/pikachu.png
versioned: true
参考处理
核心是获取文件路径,因为版本化之后,会有一个基于时间格式创建的一个子目录,对于多版本的,会包含多个时间文件夹
- 读取获取版本路径
def _get_load_path(self) -> PurePosixPath:
if not self._version:
# When versioning is disabled, load from original filepath
return self._filepath
load_version = self.resolve_load_version()
return self._get_versioned_path(load_version) # type: ignore[arg-type]
resolve_load_version处理
def resolve_load_version(self) -> str | None:
"""Compute the version the dataset should be loaded with."""
if not self._version:
return None
if self._version.load:
return self._version.load # type: ignore[no-any-return]
return self._fetch_latest_load_version()
_fetch_latest_load_version 处理,核心是通过_glob_function 函数遍历,之后是反向排序,实际上就是获取最新的
def _fetch_latest_load_version(self) -> str:
# When load version is unpinned, fetch the most recent existing
# version from the given path.
pattern = str(self._get_versioned_path("*"))
try:
version_paths = sorted(self._glob_function(pattern), reverse=True)
except Exception as exc:
message = (
f"Did not find any versions for {self}. This could be "
f"due to insufficient permission. Exception: {exc}"
)
raise VersionNotFoundError(message) from exc
most_recent = next(
(path for path in version_paths if self._exists_function(path)), None
)
if not most_recent:
message = f"Did not find any versions for {self}"
raise VersionNotFoundError(message)
return PurePath(most_recent).parent.name
- 写入获取路径
处理类似,只是一个生成时间格式路径的处理
def resolve_save_version(self) -> str | None:
"""Compute the version the dataset should be saved with."""
if not self._version:
return None
if self._version.save:
return self._version.save # type: ignore[no-any-return]
return self._fetch_latest_save_version()
_fetch_latest_save_version
@cachedmethod(cache=attrgetter("_version_cache"), key=partial(hashkey, "save"))
def _fetch_latest_save_version(self) -> str:
"""Generate and cache the current save version"""
return generate_timestamp()
generate_timestamp 格式
VERSION_FORMAT = "%Y-%m-%dT%H.%M.%S.%fZ"
说明
kedro data catalog version 的版本化实现内部并不难,但是功能上还是比较有用的,对于数据需要进行历史存储的比较有用
参考资料
https://docs.kedro.org/en/stable/data/advanced_data_catalog_usage.html#how-to-version-a-dataset-using-the-code-api
https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html