dbt_artifacts 包内部实现简单说明
以前对于dbt_artifacts 有过简单的介绍,以下从设计以及源码上分析下,方便学习使用
使用
dbt_artifacts 提供了不少macro 同时也提供了不少模型,比如stg 类型的以及dim,fact 类型的, 对于使用
dbt_artifacts 推荐的方法是是使用了dbt 的on-run-end 扩展,然后执行{{ dbt_artifacts.upload_results(results) }}
macro 方法
当然对于实际,推荐判断下target 环境,参考命令
on-run-end:
- "{% if target.name == 'prod' %}{{ dbt_artifacts.upload_results(results) }}{% endif %}"
内部处理
内部处理上实际上就是预定义模型,macro 调用的模式,对于on-run-end 中macro 的调用通过动态创建relation 以及调用一些其他macro 获取需要写入的数据以及relation,dbt_artifacts 提供了比较完备的模型描述以及文档定义
- upload_results 处理
{# dbt doesn't like us ref'ing in an operation so we fetch the info from the graph #}
{% macro upload_results(results) -%}
# 需要是执行节点,compile 或者run,同时定义了需要的模型类型
{% if execute %}
{% set datasets_to_load = ['exposures', 'seeds', 'snapshots', 'invocations', 'sources', 'tests', 'models'] %}
{% if results != [] %}
{# When executing, and results are available, then upload the results #}
{% set datasets_to_load = ['model_executions', 'seed_executions', 'test_executions', 'snapshot_executions'] + datasets_to_load %}
{% endif %}
{# Upload each data set in turn #}
{% for dataset in datasets_to_load %}
{% do log("Uploading " ~ dataset.replace("_", " "), true) %}
{# Get the results that need to be uploaded #}
# 通过get_dataset_content macro 获取指定模型的数据内容
{% set objects = dbt_artifacts.get_dataset_content(dataset) %}
{# Upload in chunks to reduce the query size #}
{% if dataset == 'models' %}
{% set upload_limit = 50 if target.type == 'bigquery' else 100 %}
{% else %}
{% set upload_limit = 300 if target.type == 'bigquery' else 5000 %}
{% endif %}
{# Loop through each chunk in turn #}
{% for i in range(0, objects | length, upload_limit) -%}
{# Get just the objects to load on this loop #}
{% set content = dbt_artifacts.get_table_content_values(dataset, objects[i: i + upload_limit]) %}
{# Insert the content into the metadata table #}
# 调用insert_into_metadata_table 进行数据写入
{{ dbt_artifacts.insert_into_metadata_table(
dataset=dataset,
fields=dbt_artifacts.get_column_name_list(dataset),
content=content
)
}}
{# Loop the next 'chunk' #}
{% endfor %}
{# Loop the next 'dataset' #}
{% endfor %}
{% endif %}
{%- endmacro %}
insert_into_metadata_table 处理
使用了dispatch 同时了不同adapter 的处理,核心还是基于relation,fields 以及content 包装sql 之后基于run_query 执行sql
{% macro insert_into_metadata_table(dataset, fields, content) -%}
{% if content != "" %}
{# Get the relation that the results will be uploaded to #}
{% set dataset_relation = dbt_artifacts.get_relation(dataset) %}
{# Insert the data into the table #}
{{ return(adapter.dispatch('insert_into_metadata_table', 'dbt_artifacts')(dataset_relation, fields, content)) }}
{% endif %}
{%- endmacro %}
{% macro spark__insert_into_metadata_table(relation, fields, content) -%}
{% set insert_into_table_query %}
insert into {{ relation }} {{ fields }}
{{ content }}
{% endset %}
{% do run_query(insert_into_table_query) %}
{%- endmacro %}
- 预定义模型
一般我们是通过seed 或者创建schema 模式,dbt_artifacts 基本也一样,是通过cte 与创建的基础模型(这些模型是后边结果数据写入的基础,同时也是后续维护建模的来源)
参考model 定义
实际上就是通过select 创建的schema
with dummy_cte as (
select 1 as foo
)
# type_string 属于定义的macro,使用了column 的辅助方法api.Column.translate_type
select
cast(null as {{ type_string() }}) as command_invocation_id,
cast(null as {{ type_string() }}) as node_id,
cast(null as {{ type_timestamp() }}) as run_started_at,
cast(null as {{ type_string() }}) as database,
cast(null as {{ type_string() }}) as schema,
cast(null as {{ type_string() }}) as name,
cast(null as {{ type_array() }}) as depends_on_nodes,
cast(null as {{ type_string() }}) as package_name,
cast(null as {{ type_string() }}) as path,
cast(null as {{ type_string() }}) as checksum,
cast(null as {{ type_string() }}) as materialization,
cast(null as {{ type_array() }}) as tags,
cast(null as {{ type_json() }}) as meta,
cast(null as {{ type_string() }}) as alias,
cast(null as {{ type_json() }}) as all_results
from dummy_cte
where 1 = 0
model 配置,这个属于dbt 标准玩法, 提供了完整的字段描述,当然还复用了docs macro 处理
version: 2
models:
- name: models
description: Staging model that contains metadata about model executions. One row per node per run.
columns:
- name: command_invocation_id
description: '{{ doc("command_invocation_id") }}'
- name: node_id
description: '{{ doc("node_id") }}'
- name: run_started_at
description: '{{ doc("run_started_at") }}'
- name: database
description: '{{ doc("database") }}'
- name: schema
description: '{{ doc("schema") }}'
- name: name
description: '{{ doc("name") }}'
- name: depends_on_nodes
description: '{{ doc("depends_on_nodes") }}'
- name: package_name
description: '{{ doc("package_name") }}'
- name: path
description: '{{ doc("path") }}'
- name: checksum
description: '{{ doc("checksum") }}'
- name: materialization
description: '{{ doc("materialization") }}'
- name: tags
description: '{{ doc("tags") }}'
- name: meta
description: '{{ doc("meta") }}'
- name: alias
description: '{{ doc("alias") }}'
- name: all_results
description: '{{ doc("all_results") }}'
- 其他模型使用预定义模型
实际上就是基于ref 以及结合cte 进行新模型的开发,创建符合实际的dim 以及fact
stg_dbt__exposures.sql stg 模型定义
with base as (
select *
from {{ ref('exposures') }}
),
enhanced as (
select
{{ dbt_artifacts.generate_surrogate_key(['command_invocation_id', 'node_id']) }} as exposure_execution_id,
command_invocation_id,
node_id,
run_started_at,
name,
type,
owner,
maturity,
path,
description,
url,
package_name,
depends_on_nodes,
tags
from base
)
select * from enhanced
dim_dbt__current_models.sql 定义
with base as (
select *
from {{ ref('stg_dbt__models') }}
),
model_executions as (
select *
from {{ ref('stg_dbt__model_executions') }}
),
latest_models as (
/* Retrieves the models present in the most recent run */
select *
from base
where run_started_at = (select max(run_started_at) from base)
),
latest_models_runs as (
/* Retreives all successful run information for the models present in the most
recent run and ranks them based on query completion time */
select
model_executions.node_id
, model_executions.was_full_refresh
, model_executions.query_completed_at
, model_executions.total_node_runtime
, model_executions.rows_affected
{% if target.type == 'bigquery' %}
, model_executions.bytes_processed
{% endif %}
/* Row number by refresh and node ID */
, row_number() over (
partition by latest_models.node_id, model_executions.was_full_refresh
order by model_executions.query_completed_at desc /* most recent ranked first */
) as run_idx
/* Row number by node ID */
, row_number() over (
partition by latest_models.node_id
order by model_executions.query_completed_at desc /* most recent ranked first */
) as run_idx_id_only
from model_executions
inner join latest_models on model_executions.node_id = latest_models.node_id
where model_executions.status = 'success'
),
latest_model_stats as (
select
node_id
, max(case when was_full_refresh then query_completed_at end) as last_full_refresh_run_completed_at
, max(case when was_full_refresh then total_node_runtime end) as last_full_refresh_run_total_runtime
, max(case when was_full_refresh then rows_affected end) as last_full_refresh_run_rows_affected
{% if target.type == 'bigquery' %}
, max(case when was_full_refresh then bytes_processed end) as last_full_refresh_run_bytes_processed
{% endif %}
, max(case when run_idx_id_only = 1 then query_completed_at end) as last_run_completed_at
, max(case when run_idx_id_only = 1 then total_node_runtime end) as last_run_total_runtime
, max(case when run_idx_id_only = 1 then rows_affected end) as last_run_rows_affected
{% if target.type == 'bigquery' %}
, max(case when run_idx_id_only = 1 then bytes_processed end) as last_run_bytes_processed
{% endif %}
, max(case when not was_full_refresh then query_completed_at end) as last_incremental_run_completed_at
, max(case when not was_full_refresh then total_node_runtime end) as last_incremental_run_total_runtime
, max(case when not was_full_refresh then rows_affected end) as last_incremental_run_rows_affected
{% if target.type == 'bigquery' %}
, max(case when not was_full_refresh then bytes_processed end) as last_incremental_run_bytes_processed
{% endif %}
from latest_models_runs
where run_idx = 1
group by 1
),
final as (
select
latest_models.*
, latest_model_stats.last_full_refresh_run_completed_at
, latest_model_stats.last_full_refresh_run_total_runtime
, latest_model_stats.last_full_refresh_run_rows_affected
{% if target.type == 'bigquery' %}
, latest_model_stats.last_full_refresh_run_bytes_processed
{% endif %}
, latest_model_stats.last_run_completed_at
, latest_model_stats.last_run_total_runtime
, latest_model_stats.last_run_rows_affected
{% if target.type == 'bigquery' %}
, latest_model_stats.last_run_bytes_processed
{% endif %}
, latest_model_stats.last_incremental_run_completed_at
, latest_model_stats.last_incremental_run_total_runtime
, latest_model_stats.last_incremental_run_rows_affected
{% if target.type == 'bigquery' %}
, latest_model_stats.last_incremental_run_bytes_processed
{% endif %}
from latest_models
left join latest_model_stats
on latest_models.node_id = latest_model_stats.node_id
)
select * from final
说明
以上只是大致说明了下,实际上dbt_artifacts 还包含了不少其他能力,比如模型迁移(核心是dbt_artifacts 的演进)当然还有其他能力,详细的可以参考源码学习下
说明
参考资料
https://github.com/brooklyn-data/dbt_artifacts
https://brooklyn-data.github.io/dbt_artifacts/#!/overview