dbt 默认snapshot snapshot_staging_table 处理简单说明
dbt 默认snapshot 的处理使用了 snapshot_staging_table 中间表,了解内部处理有利于学习快照处理技术
macro 定义
内部的处理实际上就是对于已经存在的快照表数据与目前需要进行快照的查询数据进行比较,会处理添加的,删除的,以及交集数据,dbt
使用了自己的支持key,以及基于scd2 的模型进行数据存储
如下
{% macro snapshot_staging_table(strategy, source_sql, target_relation) -%}
{{ adapter.dispatch('snapshot_staging_table', 'dbt')(strategy, source_sql, target_relation) }}
{% endmacro %}
{% macro default__snapshot_staging_table(strategy, source_sql, target_relation) -%}
with snapshot_query as (
{{ source_sql }}
),
snapshotted_data as (
select *,
{{ strategy.unique_key }} as dbt_unique_key
from {{ target_relation }}
where dbt_valid_to is null
),
insertions_source_data as (
select
*,
{{ strategy.unique_key }} as dbt_unique_key,
{{ strategy.updated_at }} as dbt_updated_at,
{{ strategy.updated_at }} as dbt_valid_from,
nullif({{ strategy.updated_at }}, {{ strategy.updated_at }}) as dbt_valid_to,
{{ strategy.scd_id }} as dbt_scd_id
from snapshot_query
),
updates_source_data as (
select
*,
{{ strategy.unique_key }} as dbt_unique_key,
{{ strategy.updated_at }} as dbt_updated_at,
{{ strategy.updated_at }} as dbt_valid_from,
{{ strategy.updated_at }} as dbt_valid_to
from snapshot_query
),
{%- if strategy.invalidate_hard_deletes %}
deletes_source_data as (
select
*,
{{ strategy.unique_key }} as dbt_unique_key
from snapshot_query
),
{% endif %}
insertions as (
select
'insert' as dbt_change_type,
source_data.*
from insertions_source_data as source_data
left outer join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
where snapshotted_data.dbt_unique_key is null
or (
snapshotted_data.dbt_unique_key is not null
and (
{{ strategy.row_changed }}
)
)
),
updates as (
select
'update' as dbt_change_type,
source_data.*,
snapshotted_data.dbt_scd_id
from updates_source_data as source_data
join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
where (
{{ strategy.row_changed }}
)
)
{%- if strategy.invalidate_hard_deletes -%}
,
deletes as (
select
'delete' as dbt_change_type,
source_data.*,
{{ snapshot_get_time() }} as dbt_valid_from,
{{ snapshot_get_time() }} as dbt_updated_at,
{{ snapshot_get_time() }} as dbt_valid_to,
snapshotted_data.dbt_scd_id
from snapshotted_data
left join deletes_source_data as source_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
where source_data.dbt_unique_key is null
)
{%- endif %}
select * from insertions
union all
select * from updates
{%- if strategy.invalidate_hard_deletes %}
union all
select * from deletes
{%- endif %}
{%- endmacro %}
- scd_id 内部处理
目前是基于了md5
{% macro snapshot_hash_arguments(args) -%}
{{ adapter.dispatch('snapshot_hash_arguments', 'dbt')(args) }}
{%- endmacro %}
{% macro default__snapshot_hash_arguments(args) -%}
md5({%- for arg in args -%}
coalesce(cast({{ arg }} as varchar ), '')
{% if not loop.last %} || '|' || {% endif %}
{%- endfor -%})
{%- endmacro %}
参数处理上会结合配置的主键以及更新时间处理(当然可以自己调整)
{% set scd_id_expr = snapshot_hash_arguments([primary_key, updated_at]) %}
说明
了解dbt 内部默认快照的处理机制还是比较重要的,一个是可以学习,同时也有利于我们开发自己的快照实现
参考资料
core/dbt/include/global_project/macros/materializations/snapshots/strategies.sql
core/dbt/include/global_project/macros/materializations/snapshots/snapshot.sql
core/dbt/include/global_project/macros/materializations/snapshots/helpers.sql
https://docs.getdbt.com/reference/snapshot-properties