dbt-checkpoint 源码结构简单说明
前边说过dbt-checkpoint 是基于dbt 的元数据解析,然后集合规则进行check,属于一个pre-commit 插件,以下简单说明下内部实现
配置
核心是 .pre-commit-hooks.yaml文件,一个标准的pre-commit 定义
- 内容
核心是id,name,entry,language,entry 实际上就是一个python entry_points 的 console_scripts
- id: check-column-desc-are-same
name: Check column descriptions are same
description: Check the models have same descriptions for same column names.
entry: check-column-desc-are-same
language: python
files: '.*\.(yml|yaml)$'
- id: check-column-name-contract
name: Check column name contract
description: Check column name abides to contract.
entry: check-column-name-contract
language: python
types_or: [sql, yaml]
- id: check-macro-has-description
name: Check the macro has description
description: Ensures that the macro has description in properties file.
entry: check-macro-has-description
language: python
types_or: [yaml, sql]
require_serial: true # because we need to process yaml and sql
- id: check-macro-arguments-have-desc
name: Check the macro arguments have description
description: Ensures that the macro has arguments with descriptions in properties file.
entry: check-macro-arguments-have-desc
language: python
types_or: [yaml, sql]
require_serial: true # because we need to process yaml and sql
- id: check-model-columns-have-desc
name: Check the model columns have description
description: Ensures that the model has columns with descriptions in properties file.
entry: check-model-columns-have-desc
language: python
types_or: [yaml, sql]
require_serial: true # because we need to process yaml and sql
- id: check-model-has-all-columns
name: Check the model has all columns in properties file
description: Ensures that all columns in database are specified in properties file.
entry: check-model-has-all-columns
language: python
types_or: [sql, yaml]
- id: check-model-has-description
name: Check the model has description
description: Ensures that the model has description in properties file.
entry: check-model-has-description
language: python
types_or: [yaml, sql]
require_serial: true # because we need to process yaml and sql
- id: check-model-has-meta-keys
name: Check the model has keys in the meta part
description: Ensures that the model has a list of valid meta keys.
entry: check-model-has-meta-keys
language: python
types_or: [yaml, sql]
require_serial: true # because we need to process YAML and SQL
- id: check-model-has-labels-keys
name: Check the model has keys in the labels part
description: Ensures that the model has a list of valid labels keys.
entry: check-model-has-labels-keys
language: python
types_or: [yaml, sql]
require_serial: true # because we need to process YAML and SQL
- id: check-model-has-properties-file
name: Check the model has properties file
description: Ensures that the model has properties file (schema file).
entry: check-model-has-properties-file
language: python
types_or: [sql, yaml]
- id: check-model-has-tests-by-group
name: Check model; number of tests from group of tests.
description: Ensures that the model has a number of tests from a group of tests.
entry: check-model-has-tests-by-group
language: python
types_or: [sql, yaml]
- id: check-model-has-tests-by-name
name: Check the model tests by a test name
description: Ensures that the model has a number of tests of a certain name (e.g. data, unique).
entry: check-model-has-tests-by-name
language: python
types_or: [sql, yaml]
- id: check-model-has-tests-by-type
name: Check the model tests by a test type
description: Ensures that the model has a number of tests of a certain type (data, schema).
entry: check-model-has-tests-by-type
language: python
types_or: [sql, yaml]
- id: check-model-has-tests
name: Check that model has tests
description: Ensures that the model has a number of tests.
entry: check-model-has-tests
language: python
types_or: [sql, yaml]
- id: check-model-name-contract
name: Check model name contract
description: Check model name abides to contract.
entry: check-model-name-contract
language: python
types_or: [sql]
- id: check-model-parents-schema
name: Check parent models/sources from certain schema
entry: check-model-parents-schema
language: python
types_or: [yaml, sql]
- id: check-model-parents-database
name: Check parent models/sources from certain database
entry: check-model-parents-database
language: python
types_or: [sql, yaml]
- id: check-model-parents-and-childs
name: Check the model has a parents/childs
description: Ensures the model has a specific number (max/min) of parents or/and childs.
entry: check-model-parents-and-childs
language: python
types_or: [sql, yaml]
- id: check-model-tags
name: Check the model has valid tags
description: Ensures that the model has only valid tags from the provided list.
entry: check-model-tags
language: python
types_or: [sql, yaml]
- id: check-script-has-no-table-name
name: Check the script has not table name
description: Ensures that the script is using only source or ref macro to specify the table name.
entry: check-script-has-no-table-name
language: python
types_or: [sql]
- id: check-script-ref-and-source
name: Check the script has existing refs and sources
description: Ensures that the script contains only existing sources or macros.
entry: check-script-ref-and-source
language: python
types_or: [sql]
- id: check-script-semicolon
name: Check the script does not contain a semicolon
description: Ensure that the script does not have a semicolon at the end of the file.
entry: check-script-semicolon
language: python
types_or: [sql]
- id: check-source-childs
name: Check the source has max/min number of childs.
description: Ensures the source has a specific number (max/min) of childs.
entry: check-source-childs
language: python
types_or: [sql]
- id: check-source-columns-have-desc
name: Check for source column descriptions
description: Ensures that the source has columns with descriptions in the properties file.
entry: check-source-columns-have-desc
language: python
types_or: [yaml]
- id: check-source-has-all-columns
name: Check source has all columns in properties file
description: Ensures that all columns in the database are specified in the properties file.
entry: check-source-has-all-columns
language: python
types_or: [yaml]
- id: check-source-table-has-description
name: Check the source table has description
description: Ensures that the source table has description in properties file.
entry: check-source-table-has-description
language: python
types_or: [yaml]
- id: check-source-has-freshness
name: Check the source has the freshness
description: Ensures that the source has freshness options.
entry: check-source-has-freshness
language: python
types_or: [yaml]
- id: check-source-has-loader
name: Check the source has loader option
description: Ensures that the source has loader option.
entry: check-source-has-loader
language: python
types_or: [yaml]
- id: check-source-has-meta-keys
name: Check the source has keys in the meta part
description: Ensures that the source has a list of valid meta keys.
entry: check-source-has-meta-keys
language: python
types_or: [yaml]
- id: check-source-has-labels-keys
name: Check the source has keys in the labels part
description: Ensures that the source has a list of valid labels keys.
entry: check-source-has-labels-keys
language: python
types_or: [yaml]
- id: check-source-has-tests-by-name
name: Check the source tests by test name
description: Ensures that the source has a number of tests of a certain name (e.g. data, unique).
entry: check-source-has-tests-by-name
language: python
types_or: [yaml]
- id: check-source-has-tests-by-type
name: Check the source tests by test type
description: Ensures that the source has a number of tests of a certain type (data, schema).
entry: check-source-has-tests-by-type
language: python
types_or: [yaml]
- id: check-source-has-tests-by-group
name: Check the source tests by test group
description: Ensures that the source has a number of tests of a certain group (unique, unique-combination-of-columns).
entry: check-source-has-tests-by-group
language: python
types_or: [yaml]
- id: check-source-has-tests
name: Check the source has tests
description: Ensures that the source has a number of tests.
entry: check-source-has-tests
language: python
types_or: [yaml]
- id: check-source-tags
name: Check the source has valid tags
description: Ensures that the source has only valid tags from the provided list.
entry: check-source-tags
language: python
types_or: [yaml]
- id: dbt-clean
name: dbt clean
description: Deletes all folders specified in the clean-targets.
entry: dbt-clean
language: python
pass_filenames: false
- id: dbt-compile
name: dbt compile
description: Generates executable SQL from source model, test, and analysis files.
entry: dbt-compile
language: python
types_or: [sql]
require_serial: true
- id: dbt-deps
name: dbt deps
description: Pulls the most recent version of the dependencies listed in your packages.yml.
entry: dbt-deps
language: python
pass_filenames: false
- id: dbt-docs-generate
name: dbt docs generate
description: The command is responsible for generating your project's documentation website.
entry: dbt-docs-generate
language: python
pass_filenames: false
- id: dbt-parse
name: dbt parse
description: Generates manifest.json from source model, test, and analysis files.
entry: dbt-parse
language: python
types_or: [sql]
require_serial: true
- id: dbt-run
name: dbt run
description: Executes compiled sql model files.
entry: dbt-run
language: python
require_serial: true
types_or: [sql]
- id: dbt-test
name: dbt test
description: Runs tests on data in deployed models.
entry: dbt-test
language: python
require_serial: true
types_or: [sql]
- id: generate-missing-sources
name: Generate missing sources
description: If any source is missing this hook tries to create it.
entry: generate-missing-sources
language: python
types_or: [sql]
- id: generate-model-properties-file
name: Generate model properties file
description: Generate model properties file if does not exists.
entry: generate-model-properties-file
language: python
types_or: [sql]
args:
[
"--properties-file",
"/Users/tomsejr/Documents/03-Workspace/Private/jaffle_shop/{database}/{schema}/{name}.yml",
]
require_serial: true
- id: unify-column-description
name: Unify column description
description: Unify column descriptions across all models
entry: unify-column-description
language: python
files: '.*\.(yml|yaml)$'
require_serial: true
- id: replace-script-table-names
name: Replace script table names
description: Replace table names with source or ref macros in the script.
entry: replace-script-table-names
language: python
types_or: [sql]
- id: remove-script-semicolon
name: Remove script semicolon
description: Remove semicolon at the end of the script.
entry: remove-script-semicolon
language: python
types_or: [sql]
- id: check-model-materialization-by-childs
name: Check the materialization of the model by childs
description: Controls the materialization of the model by its number of childs.
entry: check-model-materialization-by-childs
language: python
types_or: [sql]
- id: check-exposure-has-meta-keys
name: Check the exposure has keys in the meta part
description: Ensures that the exposure has a list of valid meta keys.
entry: check-exposure-has-meta-keys
language: python
types_or: [yaml]
- id: check-macro-has-meta-keys
name: Check the macro has keys in the meta part
description: Ensures that the macro has a list of valid meta keys.
entry: check-macro-has-meta-keys
language: python
types_or: [yaml]
- id: check-seed-has-meta-keys
name: Check the seed has keys in the meta part
description: Ensures that the seed has a list of valid meta keys.
entry: check-seed-has-meta-keys
language: python
types_or: [yaml]
- id: check-snapshot-has-meta-keys
name: Check the snapshot has keys in the meta part
description: Ensures that the snapshot has a list of valid meta keys.
entry: check-snapshot-has-meta-keys
language: python
types_or: [sql, yaml]
- id: check-test-has-meta-keys
name: Check the test has keys in the meta part
description: Ensures that the test has a list of valid meta keys.
entry: check-test-has-meta-keys
language: python
types_or: [sql]
entry 实现
一个简单例子,实际就是解析元数据,基于规则判断,为了方便处理,开发了一个工具模块utils.py 定义了类型以及元数据解析处理
- check_macro_has_description.py
import argparse
import os
import time
from typing import Any, Dict, Optional, Sequence
from dbt_checkpoint.tracking import dbtCheckpointTracking
from dbt_checkpoint.utils import (
JsonOpenError,
add_default_args,
get_dbt_manifest,
get_filenames,
get_macro_schemas,
get_macro_sqls,
get_macros,
get_missing_file_paths,
red,
)
# 基于元数据的规则处理
def has_description(
paths: Sequence[str], manifest: Dict[str, Any], exclude_pattern: str
) -> Dict[str, Any]:
paths = get_missing_file_paths(paths, manifest, exclude_pattern=exclude_pattern)
status_code = 0
ymls = get_filenames(paths, [".yml", ".yaml"])
sqls = get_macro_sqls(paths, manifest)
filenames = set(sqls.keys())
# get manifest macros that pre-commit found as changed
macros = get_macros(manifest, filenames)
# if user added schema but did not rerun the macro
schemas = get_macro_schemas(list(ymls.values()), filenames)
# convert to sets
in_macros = {macro.filename for macro in macros if macro.macro.get("description")}
in_schemas = {
schema.macro_name for schema in schemas if schema.schema.get("description")
}
missing = filenames.difference(in_macros, in_schemas)
for macro in missing:
status_code = 1
print(
f"{red(sqls.get(macro))}: "
f"does not have defined description or properties file is missing.",
)
return {"status_code": status_code}
def main(argv: Optional[Sequence[str]] = None) -> int:
parser = argparse.ArgumentParser()
add_default_args(parser)
args = parser.parse_args(argv)
try:
manifest = get_dbt_manifest(args)
except JsonOpenError as e:
print(f"Unable to load manifest file ({e})")
return 1
start_time = time.time()
hook_properties = has_description(
paths=args.filenames, manifest=manifest, exclude_pattern=args.exclude
)
end_time = time.time()
script_args = vars(args)
tracker = dbtCheckpointTracking(script_args=script_args)
tracker.track_hook_event(
event_name="Hook Executed",
manifest=manifest,
event_properties={
"hook_name": os.path.basename(__file__),
"description": "Check the macro has description.",
"status": hook_properties.get("status_code"),
"execution_time": end_time - start_time,
"is_pytest": script_args.get("is_test"),
},
)
return hook_properties.get("status_code")
if __name__ == "__main__":
exit(main())
- entry_points 的 console_scripts 定义
[options.entry_points]
console_scripts =
check-column-desc-are-same = dbt_checkpoint.check_column_desc_are_same:main
check-column-name-contract = dbt_checkpoint.check_column_name_contract:main
check-macro-has-description = dbt_checkpoint.check_macro_has_description:main
check-macro-arguments-have-desc = dbt_checkpoint.check_macro_arguments_have_desc:main
check-model-columns-have-desc = dbt_checkpoint.check_model_columns_have_desc:main
check-model-has-all-columns = dbt_checkpoint.check_model_has_all_columns:main
check-model-has-description = dbt_checkpoint.check_model_has_description:main
check-model-has-meta-keys = dbt_checkpoint.check_model_has_meta_keys:main
check-model-has-labels-keys = dbt_checkpoint.check_model_has_labels_keys:main
check-model-has-properties-file = dbt_checkpoint.check_model_has_properties_file:main
check-model-has-tests-by-name = dbt_checkpoint.check_model_has_tests_by_name:main
check-model-has-tests-by-type = dbt_checkpoint.check_model_has_tests_by_type:main
check-model-has-tests-by-group = dbt_checkpoint.check_model_has_tests_by_group:main
check-model-has-tests = dbt_checkpoint.check_model_has_tests:main
check-model-name-contract = dbt_checkpoint.check_model_name_contract:main
check-model-parents-and-childs = dbt_checkpoint.check_model_parents_and_childs:main
check-model-parents-database = dbt_checkpoint.check_model_parents_database:main
check-model-parents-schema = dbt_checkpoint.check_model_parents_schema:main
check-model-tags = dbt_checkpoint.check_model_tags:main
check-script-has-no-table-name = dbt_checkpoint.check_script_has_no_table_name:main
check-script-ref-and-source = dbt_checkpoint.check_script_ref_and_source:main
check-script-semicolon = dbt_checkpoint.check_script_semicolon:main
check-source-childs = dbt_checkpoint.check_source_childs:main
check-source-columns-have-desc = dbt_checkpoint.check_source_columns_have_desc:main
check-source-has-all-columns = dbt_checkpoint.check_source_has_all_columns:main
check-source-table-has-description = dbt_checkpoint.check_source_table_has_description:main
check-source-has-freshness = dbt_checkpoint.check_source_has_freshness:main
check-source-has-loader = dbt_checkpoint.check_source_has_loader:main
check-source-has-meta-keys = dbt_checkpoint.check_source_has_meta_keys:main
check-source-has-labels-keys = dbt_checkpoint.check_source_has_labels_keys:main
check-source-has-tests-by-group = dbt_checkpoint.check_source_has_tests_by_group:main
check-source-has-tests-by-name = dbt_checkpoint.check_source_has_tests_by_name:main
check-source-has-tests-by-type = dbt_checkpoint.check_source_has_tests_by_type:main
check-source-has-tests = dbt_checkpoint.check_source_has_tests:main
check-source-tags = dbt_checkpoint.check_source_tags:main
check-model-materialization-by-childs = dbt_checkpoint.check_model_materialization_by_childs:main
dbt-clean = dbt_checkpoint.dbt_clean:main
dbt-compile = dbt_checkpoint.dbt_compile:main
dbt-deps = dbt_checkpoint.dbt_deps:main
dbt-docs-generate = dbt_checkpoint.dbt_docs_generate:main
dbt-parse = dbt_checkpoint.dbt_parse:main
dbt-run = dbt_checkpoint.dbt_run:main
dbt-test = dbt_checkpoint.dbt_test:main
generate-missing-sources = dbt_checkpoint.generate_missing_sources:main
generate-model-properties-file = dbt_checkpoint.generate_model_properties_file:main
unify-column-description = dbt_checkpoint.unify_column_description:main
replace-script-table-names = dbt_checkpoint.replace_script_table_names:main
remove-script-semicolon = dbt_checkpoint.remove_script_semicolon:main
check-snapshot-has-meta-keys = dbt_checkpoint.check_snapshot_has_meta_keys:main
check-exposure-has-meta-keys = dbt_checkpoint.check_exposure_has_meta_keys:main
check-macro-has-meta-keys = dbt_checkpoint.check_macro_has_meta_keys:main
check-seed-has-meta-keys = dbt_checkpoint.check_seed_has_meta_keys:main
check-test-has-meta-keys = dbt_checkpoint.check_test_has_meta_keys:main
使用
对于使用实际上就是标准的pre-commit 玩法
- 定义.pre-commit-config.yaml
里边会包含hooks 定义
repos:
- repo: https://github.com/dbt-checkpoint/dbt-checkpoint
rev: v1.2.1
hooks:
- id: dbt-parse
- id: dbt-docs-generate
args: ["--cmd-flags", "++no-compile"]
- id: check-script-semicolon
- id: check-script-has-no-table-name
- id: check-model-has-all-columns
name: Check columns - core
files: ^models/core
- id: check-model-has-all-columns
name: Check columns - mart
files: ^models/mart
- id: check-model-columns-have-desc
files: ^models/mart
- 运行
可以安装为git 的pre-commit (pre-commit install)在git commit 的时候check,也可以直接运行pre-commit run --all-files
说明
dbt-checkpoint 核心是解析dbt元数据,但是也提供了dbt core 一个cli 命令的包装,比如run,compile,deps,clean,test,docs generate,parse 值得试用下
参考资料
https://github.com/dbt-checkpoint/dbt-checkpoint
https://pre-commit.com/hooks.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
2023-04-16 Pake 基于rust 开发的快速web 页面打包app 的工具
2022-04-16 使用 TypeScriptToLua 开发lua 应用
2021-04-16 cube.js 自定义checkAuth 响应状态码
2020-04-16 easy-rules spring boot 一个简单的starter
2019-04-16 websocket 2 rest api
2019-04-16 编写一个简单的基于jmespath 的prometheus exporter
2014-04-16 search bar 创建的一些文章