dbt_artifacts 包内部实现简单说明

以前对于dbt_artifacts 有过简单的介绍,以下从设计以及源码上分析下,方便学习使用

使用

dbt_artifacts 提供了不少macro 同时也提供了不少模型,比如stg 类型的以及dim,fact 类型的, 对于使用
dbt_artifacts 推荐的方法是是使用了dbt 的on-run-end 扩展,然后执行{{ dbt_artifacts.upload_results(results) }} macro 方法
当然对于实际,推荐判断下target 环境,参考命令

on-run-end:
  - "{% if target.name == 'prod' %}{{ dbt_artifacts.upload_results(results) }}{% endif %}"

内部处理

内部处理上实际上就是预定义模型,macro 调用的模式,对于on-run-end 中macro 的调用通过动态创建relation 以及调用一些其他macro 获取需要写入的数据以及relation,dbt_artifacts 提供了比较完备的模型描述以及文档定义

  • upload_results 处理
{# dbt doesn't like us ref'ing in an operation so we fetch the info from the graph #}
 
{% macro upload_results(results) -%}
   #  需要是执行节点,compile 或者run,同时定义了需要的模型类型
    {% if execute %}
 
        {% set datasets_to_load = ['exposures', 'seeds', 'snapshots', 'invocations', 'sources', 'tests', 'models'] %}
        {% if results != [] %}
            {# When executing, and results are available, then upload the results #}
            {% set datasets_to_load = ['model_executions', 'seed_executions', 'test_executions', 'snapshot_executions'] + datasets_to_load %}
        {% endif %}
 
        {# Upload each data set in turn #}
        {% for dataset in datasets_to_load %}
 
            {% do log("Uploading " ~ dataset.replace("_", " "), true) %}
 
            {# Get the results that need to be uploaded #}
           #  通过get_dataset_content macro 获取指定模型的数据内容
            {% set objects = dbt_artifacts.get_dataset_content(dataset) %}
 
            {# Upload in chunks to reduce the query size #}
            {% if dataset == 'models' %}
                {% set upload_limit = 50 if target.type == 'bigquery' else 100 %}
            {% else %}
                {% set upload_limit = 300 if target.type == 'bigquery' else 5000 %}
            {% endif %}
 
            {# Loop through each chunk in turn #}
            {% for i in range(0, objects | length, upload_limit) -%}
 
                {# Get just the objects to load on this loop #}
                {% set content = dbt_artifacts.get_table_content_values(dataset, objects[i: i + upload_limit]) %}
 
                {# Insert the content into the metadata table #}
                # 调用insert_into_metadata_table 进行数据写入
                {{ dbt_artifacts.insert_into_metadata_table(
                    dataset=dataset,
                    fields=dbt_artifacts.get_column_name_list(dataset),
                    content=content
                    )
                }}
 
            {# Loop the next 'chunk' #}
            {% endfor %}
 
        {# Loop the next 'dataset' #}
        {% endfor %}
 
    {% endif %}
 
{%- endmacro %}

insert_into_metadata_table 处理
使用了dispatch 同时了不同adapter 的处理,核心还是基于relation,fields 以及content 包装sql 之后基于run_query 执行sql

{% macro insert_into_metadata_table(dataset, fields, content) -%}
 
    {% if content != "" %}
 
        {# Get the relation that the results will be uploaded to #}
        {% set dataset_relation = dbt_artifacts.get_relation(dataset) %}
        {# Insert the data into the table #}
        {{ return(adapter.dispatch('insert_into_metadata_table', 'dbt_artifacts')(dataset_relation, fields, content)) }}
 
    {% endif %}
 
{%- endmacro %}
 
{% macro spark__insert_into_metadata_table(relation, fields, content) -%}
 
    {% set insert_into_table_query %}
    insert into {{ relation }} {{ fields }}
    {{ content }}
    {% endset %}
 
    {% do run_query(insert_into_table_query) %}
 
{%- endmacro %}
  • 预定义模型
    一般我们是通过seed 或者创建schema 模式,dbt_artifacts 基本也一样,是通过cte 与创建的基础模型(这些模型是后边结果数据写入的基础,同时也是后续维护建模的来源)
    参考model 定义
    实际上就是通过select 创建的schema
with dummy_cte as (
    select 1 as foo
)
 # type_string 属于定义的macro,使用了column 的辅助方法api.Column.translate_type
select
    cast(null as {{ type_string() }}) as command_invocation_id,
    cast(null as {{ type_string() }}) as node_id,
    cast(null as {{ type_timestamp() }}) as run_started_at,
    cast(null as {{ type_string() }}) as database,
    cast(null as {{ type_string() }}) as schema,
    cast(null as {{ type_string() }}) as name,
    cast(null as {{ type_array() }}) as depends_on_nodes,
    cast(null as {{ type_string() }}) as package_name,
    cast(null as {{ type_string() }}) as path,
    cast(null as {{ type_string() }}) as checksum,
    cast(null as {{ type_string() }}) as materialization,
    cast(null as {{ type_array() }}) as tags,
    cast(null as {{ type_json() }}) as meta,
    cast(null as {{ type_string() }}) as alias,
    cast(null as {{ type_json() }}) as all_results
from dummy_cte
where 1 = 0

model 配置,这个属于dbt 标准玩法, 提供了完整的字段描述,当然还复用了docs macro 处理

version: 2
models:
  - name: models
    description: Staging model that contains metadata about model executions. One row per node per run.
    columns:
      - name: command_invocation_id
        description: '{{ doc("command_invocation_id") }}'
      - name: node_id
        description: '{{ doc("node_id") }}'
      - name: run_started_at
        description: '{{ doc("run_started_at") }}'
      - name: database
        description: '{{ doc("database") }}'
      - name: schema
        description: '{{ doc("schema") }}'
      - name: name
        description: '{{ doc("name") }}'
      - name: depends_on_nodes
        description: '{{ doc("depends_on_nodes") }}'
      - name: package_name
        description: '{{ doc("package_name") }}'
      - name: path
        description: '{{ doc("path") }}'
      - name: checksum
        description: '{{ doc("checksum") }}'
      - name: materialization
        description: '{{ doc("materialization") }}'
      - name: tags
        description: '{{ doc("tags") }}'
      - name: meta
        description: '{{ doc("meta") }}'
      - name: alias
        description: '{{ doc("alias") }}'
      - name: all_results
        description: '{{ doc("all_results") }}'
  • 其他模型使用预定义模型
    实际上就是基于ref 以及结合cte 进行新模型的开发,创建符合实际的dim 以及fact
    stg_dbt__exposures.sql stg 模型定义
with base as (
 
    select *
    from {{ ref('exposures') }}
 
),
 
enhanced as (
 
    select
        {{ dbt_artifacts.generate_surrogate_key(['command_invocation_id', 'node_id']) }} as exposure_execution_id,
        command_invocation_id,
        node_id,
        run_started_at,
        name,
        type,
        owner,
        maturity,
        path,
        description,
        url,
        package_name,
        depends_on_nodes,
        tags
    from base
 
)
 
select * from enhanced

dim_dbt__current_models.sql 定义

with base as (
    select *
    from {{ ref('stg_dbt__models') }}
),
 
model_executions as (
    select *
    from {{ ref('stg_dbt__model_executions') }}
),
 
latest_models as (
    /* Retrieves the models present in the most recent run */
    select *
    from base
    where run_started_at = (select max(run_started_at) from base)
),
 
latest_models_runs as (
    /* Retreives all successful run information for the models present in the most
    recent run and ranks them based on query completion time */
    select
        model_executions.node_id
        , model_executions.was_full_refresh
        , model_executions.query_completed_at
        , model_executions.total_node_runtime
        , model_executions.rows_affected
        {% if target.type == 'bigquery' %}
        , model_executions.bytes_processed
        {% endif %}
        /* Row number by refresh and node ID */
        , row_number() over (
            partition by latest_models.node_id, model_executions.was_full_refresh
            order by model_executions.query_completed_at desc /* most recent ranked first */
        ) as run_idx
        /* Row number by node ID */
        , row_number() over (
            partition by latest_models.node_id
            order by model_executions.query_completed_at desc /* most recent ranked first */
        ) as run_idx_id_only
    from model_executions
    inner join latest_models on model_executions.node_id = latest_models.node_id
    where model_executions.status = 'success'
),
 
latest_model_stats as (
    select
        node_id
        , max(case when was_full_refresh then query_completed_at end) as last_full_refresh_run_completed_at
        , max(case when was_full_refresh then total_node_runtime end) as last_full_refresh_run_total_runtime
        , max(case when was_full_refresh then rows_affected end) as last_full_refresh_run_rows_affected
        {% if target.type == 'bigquery' %}
        , max(case when was_full_refresh then bytes_processed end) as last_full_refresh_run_bytes_processed
        {% endif %}
        , max(case when run_idx_id_only = 1 then query_completed_at end) as last_run_completed_at
        , max(case when run_idx_id_only = 1 then total_node_runtime end) as last_run_total_runtime
        , max(case when run_idx_id_only = 1 then rows_affected end) as last_run_rows_affected
        {% if target.type == 'bigquery' %}
        , max(case when run_idx_id_only = 1 then bytes_processed end) as last_run_bytes_processed
        {% endif %}
        , max(case when not was_full_refresh then query_completed_at end) as last_incremental_run_completed_at
        , max(case when not was_full_refresh then total_node_runtime end) as last_incremental_run_total_runtime
        , max(case when not was_full_refresh then rows_affected end) as last_incremental_run_rows_affected
        {% if target.type == 'bigquery' %}
        , max(case when not was_full_refresh then bytes_processed end) as last_incremental_run_bytes_processed
        {% endif %}
    from latest_models_runs
    where run_idx = 1
    group by 1
),
 
final as (
    select
        latest_models.*
        , latest_model_stats.last_full_refresh_run_completed_at
        , latest_model_stats.last_full_refresh_run_total_runtime
        , latest_model_stats.last_full_refresh_run_rows_affected
        {% if target.type == 'bigquery' %}
        , latest_model_stats.last_full_refresh_run_bytes_processed
        {% endif %}
        , latest_model_stats.last_run_completed_at
        , latest_model_stats.last_run_total_runtime
        , latest_model_stats.last_run_rows_affected
        {% if target.type == 'bigquery' %}
        , latest_model_stats.last_run_bytes_processed
        {% endif %}
        , latest_model_stats.last_incremental_run_completed_at
        , latest_model_stats.last_incremental_run_total_runtime
        , latest_model_stats.last_incremental_run_rows_affected
        {% if target.type == 'bigquery' %}
        , latest_model_stats.last_incremental_run_bytes_processed
        {% endif %}
    from latest_models
    left join latest_model_stats
        on latest_models.node_id = latest_model_stats.node_id
)
 
select * from final

说明

以上只是大致说明了下,实际上dbt_artifacts 还包含了不少其他能力,比如模型迁移(核心是dbt_artifacts 的演进)当然还有其他能力,详细的可以参考源码学习下

说明

参考资料

https://github.com/brooklyn-data/dbt_artifacts
https://brooklyn-data.github.io/dbt_artifacts/#!/overview

posted on 2024-06-16 07:05  荣锋亮  阅读(16)  评论(0编辑  收藏  举报

导航