dbt_artifacts 包内部实现简单说明

以前对于dbt_artifacts 有过简单的介绍，以下从设计以及源码上分析下，方便学习使用

使用

dbt_artifacts 提供了不少macro 同时也提供了不少模型，比如stg 类型的以及dim，fact 类型的, 对于使用
dbt_artifacts 推荐的方法是是使用了dbt 的on-run-end 扩展，然后执行{{ dbt_artifacts.upload_results(results) }} macro 方法
当然对于实际，推荐判断下target 环境，参考命令

on-run-end:

  - "{% if target.name == 'prod' %}{{ dbt_artifacts.upload_results(results) }}{% endif %}"

内部处理

内部处理上实际上就是预定义模型，macro 调用的模式，对于on-run-end 中macro 的调用通过动态创建relation 以及调用一些其他macro 获取需要写入的数据以及relation，dbt_artifacts 提供了比较完备的模型描述以及文档定义

upload_results 处理

{# dbt doesn't like us ref'ing in an operation so we fetch the info from the graph #}
 
{% macro upload_results(results) -%}

   #  需要是执行节点，compile 或者run，同时定义了需要的模型类型

    {% if execute %}
 
        {% set datasets_to_load = ['exposures', 'seeds', 'snapshots', 'invocations', 'sources', 'tests', 'models'] %}

        {% if results != [] %}

            {# When executing, and results are available, then upload the results #}

            {% set datasets_to_load = ['model_executions', 'seed_executions', 'test_executions', 'snapshot_executions'] + datasets_to_load %}

        {% endif %}
 
        {# Upload each data set in turn #}

        {% for dataset in datasets_to_load %}
 
            {% do log("Uploading " ~ dataset.replace("_", " "), true) %}
 
            {# Get the results that need to be uploaded #}

           #  通过get_dataset_content macro 获取指定模型的数据内容

            {% set objects = dbt_artifacts.get_dataset_content(dataset) %}
 
            {# Upload in chunks to reduce the query size #}

            {% if dataset == 'models' %}

                {% set upload_limit = 50 if target.type == 'bigquery' else 100 %}

            {% else %}

                {% set upload_limit = 300 if target.type == 'bigquery' else 5000 %}

            {% endif %}
 
            {# Loop through each chunk in turn #}

            {% for i in range(0, objects | length, upload_limit) -%}
 
                {# Get just the objects to load on this loop #}

                {% set content = dbt_artifacts.get_table_content_values(dataset, objects[i: i + upload_limit]) %}
 
                {# Insert the content into the metadata table #}

                # 调用insert_into_metadata_table 进行数据写入

                {{ dbt_artifacts.insert_into_metadata_table(

                    dataset=dataset,

                    fields=dbt_artifacts.get_column_name_list(dataset),

                    content=content

                    )

                }}
 
            {# Loop the next 'chunk' #}

            {% endfor %}
 
        {# Loop the next 'dataset' #}

        {% endfor %}
 
    {% endif %}
 
{%- endmacro %}

insert_into_metadata_table 处理
使用了dispatch 同时了不同adapter 的处理，核心还是基于relation，fields 以及content 包装sql 之后基于run_query 执行sql

{% macro insert_into_metadata_table(dataset, fields, content) -%}
 
    {% if content != "" %}
 
        {# Get the relation that the results will be uploaded to #}

        {% set dataset_relation = dbt_artifacts.get_relation(dataset) %}

        {# Insert the data into the table #}

        {{ return(adapter.dispatch('insert_into_metadata_table', 'dbt_artifacts')(dataset_relation, fields, content)) }}
 
    {% endif %}
 
{%- endmacro %}
 
{% macro spark__insert_into_metadata_table(relation, fields, content) -%}
 
    {% set insert_into_table_query %}

    insert into {{ relation }} {{ fields }}

    {{ content }}

    {% endset %}
 
    {% do run_query(insert_into_table_query) %}
 
{%- endmacro %}

预定义模型
一般我们是通过seed 或者创建schema 模式，dbt_artifacts 基本也一样，是通过cte 与创建的基础模型（这些模型是后边结果数据写入的基础，同时也是后续维护建模的来源）
参考model 定义
实际上就是通过select 创建的schema

with dummy_cte as (

    select 1 as foo

)

 # type_string 属于定义的macro，使用了column 的辅助方法api.Column.translate_type

select

    cast(null as {{ type_string() }}) as command_invocation_id,

    cast(null as {{ type_string() }}) as node_id,

    cast(null as {{ type_timestamp() }}) as run_started_at,

    cast(null as {{ type_string() }}) as database,

    cast(null as {{ type_string() }}) as schema,

    cast(null as {{ type_string() }}) as name,

    cast(null as {{ type_array() }}) as depends_on_nodes,

    cast(null as {{ type_string() }}) as package_name,

    cast(null as {{ type_string() }}) as path,

    cast(null as {{ type_string() }}) as checksum,

    cast(null as {{ type_string() }}) as materialization,

    cast(null as {{ type_array() }}) as tags,

    cast(null as {{ type_json() }}) as meta,

    cast(null as {{ type_string() }}) as alias,

    cast(null as {{ type_json() }}) as all_results

from dummy_cte

where 1 = 0

model 配置，这个属于dbt 标准玩法, 提供了完整的字段描述，当然还复用了docs macro 处理

version: 2

models:

  - name: models

    description: Staging model that contains metadata about model executions. One row per node per run.

    columns:

      - name: command_invocation_id

        description: '{{ doc("command_invocation_id") }}'

      - name: node_id

        description: '{{ doc("node_id") }}'

      - name: run_started_at

        description: '{{ doc("run_started_at") }}'

      - name: database

        description: '{{ doc("database") }}'

      - name: schema

        description: '{{ doc("schema") }}'

      - name: name

        description: '{{ doc("name") }}'

      - name: depends_on_nodes

        description: '{{ doc("depends_on_nodes") }}'

      - name: package_name

        description: '{{ doc("package_name") }}'

      - name: path

        description: '{{ doc("path") }}'

      - name: checksum

        description: '{{ doc("checksum") }}'

      - name: materialization

        description: '{{ doc("materialization") }}'

      - name: tags

        description: '{{ doc("tags") }}'

      - name: meta

        description: '{{ doc("meta") }}'

      - name: alias

        description: '{{ doc("alias") }}'

      - name: all_results

        description: '{{ doc("all_results") }}'

其他模型使用预定义模型
实际上就是基于ref 以及结合cte 进行新模型的开发，创建符合实际的dim 以及fact
stg_dbt__exposures.sql stg 模型定义

with base as (
 
    select *

    from {{ ref('exposures') }}
 
),
 
enhanced as (
 
    select

        {{ dbt_artifacts.generate_surrogate_key(['command_invocation_id', 'node_id']) }} as exposure_execution_id,

        command_invocation_id,

        node_id,

        run_started_at,

        name,

        type,

        owner,

        maturity,

        path,

        description,

        url,

        package_name,

        depends_on_nodes,

        tags

    from base
 
)
 
select * from enhanced

dim_dbt__current_models.sql 定义

with base as (

    select *

    from {{ ref('stg_dbt__models') }}

),
 
model_executions as (

    select *

    from {{ ref('stg_dbt__model_executions') }}

),
 
latest_models as (

    /* Retrieves the models present in the most recent run */

    select *

    from base

    where run_started_at = (select max(run_started_at) from base)

),
 
latest_models_runs as (

    /* Retreives all successful run information for the models present in the most

    recent run and ranks them based on query completion time */

    select

        model_executions.node_id

        , model_executions.was_full_refresh

        , model_executions.query_completed_at

        , model_executions.total_node_runtime

        , model_executions.rows_affected

        {% if target.type == 'bigquery' %}

        , model_executions.bytes_processed

        {% endif %}

        /* Row number by refresh and node ID */

        , row_number() over (

            partition by latest_models.node_id, model_executions.was_full_refresh

            order by model_executions.query_completed_at desc /* most recent ranked first */

        ) as run_idx

        /* Row number by node ID */

        , row_number() over (

            partition by latest_models.node_id

            order by model_executions.query_completed_at desc /* most recent ranked first */

        ) as run_idx_id_only

    from model_executions

    inner join latest_models on model_executions.node_id = latest_models.node_id

    where model_executions.status = 'success'

),
 
latest_model_stats as (

    select

        node_id

        , max(case when was_full_refresh then query_completed_at end) as last_full_refresh_run_completed_at

        , max(case when was_full_refresh then total_node_runtime end) as last_full_refresh_run_total_runtime

        , max(case when was_full_refresh then rows_affected end) as last_full_refresh_run_rows_affected

        {% if target.type == 'bigquery' %}

        , max(case when was_full_refresh then bytes_processed end) as last_full_refresh_run_bytes_processed

        {% endif %}

        , max(case when run_idx_id_only = 1 then query_completed_at end) as last_run_completed_at

        , max(case when run_idx_id_only = 1 then total_node_runtime end) as last_run_total_runtime

        , max(case when run_idx_id_only = 1 then rows_affected end) as last_run_rows_affected

        {% if target.type == 'bigquery' %}

        , max(case when run_idx_id_only = 1 then bytes_processed end) as last_run_bytes_processed

        {% endif %}

        , max(case when not was_full_refresh then query_completed_at end) as last_incremental_run_completed_at

        , max(case when not was_full_refresh then total_node_runtime end) as last_incremental_run_total_runtime

        , max(case when not was_full_refresh then rows_affected end) as last_incremental_run_rows_affected

        {% if target.type == 'bigquery' %}

        , max(case when not was_full_refresh then bytes_processed end) as last_incremental_run_bytes_processed

        {% endif %}

    from latest_models_runs

    where run_idx = 1

    group by 1

),
 
final as (

    select

        latest_models.*

        , latest_model_stats.last_full_refresh_run_completed_at

        , latest_model_stats.last_full_refresh_run_total_runtime

        , latest_model_stats.last_full_refresh_run_rows_affected

        {% if target.type == 'bigquery' %}

        , latest_model_stats.last_full_refresh_run_bytes_processed

        {% endif %}

        , latest_model_stats.last_run_completed_at

        , latest_model_stats.last_run_total_runtime

        , latest_model_stats.last_run_rows_affected

        {% if target.type == 'bigquery' %}

        , latest_model_stats.last_run_bytes_processed

        {% endif %}

        , latest_model_stats.last_incremental_run_completed_at

        , latest_model_stats.last_incremental_run_total_runtime

        , latest_model_stats.last_incremental_run_rows_affected

        {% if target.type == 'bigquery' %}

        , latest_model_stats.last_incremental_run_bytes_processed

        {% endif %}

    from latest_models

    left join latest_model_stats

        on latest_models.node_id = latest_model_stats.node_id

)
 
select * from final

说明

以上只是大致说明了下，实际上dbt_artifacts 还包含了不少其他能力，比如模型迁移（核心是dbt_artifacts 的演进）当然还有其他能力，详细的可以参考源码学习下

说明

参考资料

https://github.com/brooklyn-data/dbt_artifacts
https://brooklyn-data.github.io/dbt_artifacts/#!/overview

posted on 2024-06-16 07:05 荣锋亮阅读(16) 评论(0) 编辑收藏举报

刷新页面返回顶部

rongfengliang-荣锋亮